I need to drive a website that is rendered almost entirely with javascript. I have been able to detect the rendered page and navigate it so far, however there are variables in the script that I'd like to process for some navigation decisions. I can identify tags using xpath but I can't get the text in between them. To be clear, I do not wish to execute javascript, just read the variables in the javascript on the page. I'm having trouble finding any documentation that spells out what I need. In one thread someone mentioned using a document object, but I'm not sure how to programatically get to that.
I'd really appreciate a hint here. Thanks very much in advance for your help.
I figured it out. WebDriver.getPageSource(). Since there were no parsers javascript I located the bits I wanted with a regular expression then converted the JSON into an object with simple json.
private String getRandomProvider(){
String shortName = "";
JSONArray providers;
String page = this.getPageSource();
Pattern pattern = Pattern.compile("domainBootstrap\\.providers = (\\[,?\\{.*\\}\\]);");
Matcher matcher = pattern.matcher(page);
if (matcher.find()){
try {
providers = (JSONArray) new JSONParser().parse(matcher.group(1));
int randomProvider = (int)(Math.random() * providers.size());
JSONObject provider = (JSONObject) providers.get(randomProvider);
shortName = provider.get("shortName").toString();
} catch (ParseException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
return shortName;
Related
Goal
Transform HTML extracted from Telligent (an extranet platform) to plain text and send to Slack
Setup
A Telligent webhook is triggered when an event occurs. An Azure Logic App receives the event JSON. The JSON value is in HTML. A JavaScript Azure Function inside the Azure Logic App pipeline transforms the JSON value to plain text. The final step in the pipeline posts the plain text in Slack.
Example of incoming code to the Azure Function
"body": "<p>" '</p><div style=\"clear:both;\"></div>"
Transformation method
This is the basic code in the Azure Function. I have left out parts that seem irrelevant to this question but can provide the entire script if that is necessary.
module.exports = function (context, data) {
var html = data.body;
// Change HTML to plain text
var text = JSON.stringify(html.body);
var noHtml = text.replace(/<(?:.|\n)*?>/gm, '');
var noHtmlEncodeSingleQuote = noHtml.replace(/'/g, "'");
var noHtmlEncodeDoubleQuote = noHtmlEncodeSingleQuote.replace(/"/g, "REPLACEMENT");
// Compile body for Slack
var readyString = "Slack text: " + noHtmlEncodeDoubleQuote;
// Response of the function to be used later
context.res = {
body: readyString
};
context.done();
};
Results
The single quote is replaced successfully and resolves accurately when posted in Slack.
The following replacement methods for the double quote throw a Status: 500 Internal Server Error within the Azure Function.
Unsuccessful replacement methods
"\""
'"'
"
"'"'"
"["]"
"(")"
Putting these replacement methods in their own var also throws the same error. E.g.:
var replace = "\""
...
var noHtmlEncodeDoubleQuote = noHtmlEncodeSingleQuote.replace(/"/g, replace);
The code appears to be correct because when I replace " with something like abc, the replacement is successful.
Thank you
Please forgive my JavaScript as I am not a programmer and am seeking to streamline a process for my job. However I am grateful for any advice both about the code or my entire approach.
Generally, you don't want to try to parse HTML with regular expressions or string replacement. There are just too many things that can go wrong. See this now famous StackOverflow answer. (It was even made into a T-Shirt.)
Instead, you should use a technique that is purposefully built for this purpose. If you were in a web browser, you could use the techniques described in the answers to this question. But in Azure Functions, your JavaScript doesn't run in a browser, it runs in a Node JS environment. Therefore, you need will need to use a library such as Cheerio or htmlparser2 (and others).
Here is an example using Cheerio:
var cheerio = require('cheerio');
var text = cheerio.load(html.body).text();
Also, regarding this part:
... as I am not a programmer ...
Yes you are. You are clearly programming right now. Anyone who writes code is a programmer. There is no club or secret handshake. We all start out exactly like this. Good job asking questions, and good luck on your journey!
http://www.biletix.com/search/TURKIYE/en#!subcat_interval:12/12/15TO19/12/15
I want to get data from this website. When i use jsoup, it cant execute because of javascript. Despite all my efforts, still couldnot manage.
enter image description here
As you can see, i only want to get name and url. Then i can go to that url and get begin-end time and location.
I dont want to use headless browsers. Do you know any alternatives?
Sometimes javascript and json based web pages are easier to scrape than plain html ones.
If you inspect carefully the network traffic (for example, with browser developer tools) you'll realize that page is making a GET request that returns a json string with all the data you need. You'll be able to parse that json with any json library.
URL is:
http://www.biletix.com/solr/en/select/?start=0&rows=100&fq=end%3A[2015-12-12T00%3A00%3A00Z%20TO%202015-12-19T00%3A00%3A00Z%2B1DAY]&sort=vote%20desc,start%20asc&&wt=json
You can generate this URL in a similar way you are generating the URL you put in your question.
A fragment of the json you'll get is:
....
"id":"SZ683",
"venuecount":"1",
"category":"ART",
"start":"2015-12-12T18:30:00Z",
"subcategory":"tiyatro$ART",
"name":"The Last Couple to Meet Online",
"venuecode":"BT",
.....
There you can see the name and URL is easily generated using id field (SZ683), for example: http://www.biletix.com/etkinlik/SZ683/TURKIYE/en
------- EDIT -------
Get the json data is more difficult than I initially thought. Server requires a cookie in order to return correct data so we need:
To do a first GET, fetch the cookie and do a second GET for obtain the json data. This is easy using Jsoup.
Then we will parse the response using org.json.
This is a working example:
//Only as example please DON'T use in production code without error control and more robust parsing
//note the smaller change in server will break this code!!
public static void main(String[] args) throws IOException {
//We do a initial GET to retrieve the cookie
Document doc = Jsoup.connect("http://www.biletix.com/").get();
Element body = doc.head();
//needs error control
String script = body.select("script").get(0).html();
//Not the more robust way of doing it ...
Pattern p = Pattern.compile("document\\.cookie\\s*=\\s*'(\\w+)=(.*?);");
Matcher m = p.matcher(script);
m.find();
String cookieName = m.group(1);
String cookieValue = m.group(2);
//I'm supposing url is already built
//removing url last part (json.wrf=jsonp1450136314484) result will be parsed more easily
String url = "http://www.biletix.com/solr/tr/select/?start=0&rows=100&q=subcategory:tiyatro$ART&qt=standard&fq=region:%22ISTANBUL%22&fq=end%3A%5B2015-12-15T00%3A00%3A00Z%20TO%202017-12-15T00%3A00%3A00Z%2B1DAY%5D&sort=start%20asc&&wt=json";
Document document = Jsoup.connect(url)
.cookie(cookieName, cookieValue) //introducing the cookie we will get the corect results
.get();
String bodyText = document.body().text();
//We parse the json and extract the data
JSONObject jsonObject = new JSONObject(bodyText);
JSONArray jsonArray = jsonObject.getJSONObject("response").getJSONArray("docs");
for (Object object : jsonArray) {
JSONObject item = (JSONObject) object;
System.out.println("name = " + item.getString("name"));
System.out.println("link = " + "http://www.biletix.com/etkinlik/" + item.getString("id") + "/TURKIYE/en");
//similarly you can fetch more info ...
System.out.println();
}
}
I skipped the URL generation as I suppose you know how to generate it.
I hope all the explanation is clear, english isn't my first language so it is difficult for me to explain myself.
How can I prevent XSS attacks in a JSP/Servlet web application?
XSS can be prevented in JSP by using JSTL <c:out> tag or fn:escapeXml() EL function when (re)displaying user-controlled input. This includes request parameters, headers, cookies, URL, body, etc. Anything which you extract from the request object. Also the user-controlled input from previous requests which is stored in a database needs to be escaped during redisplaying.
For example:
<p><c:out value="${bean.userControlledValue}"></p>
<p><input name="foo" value="${fn:escapeXml(param.foo)}"></p>
This will escape characters which may malform the rendered HTML such as <, >, ", ' and & into HTML/XML entities such as <, >, ", ' and &.
Note that you don't need to escape them in the Java (Servlet) code, since they are harmless over there. Some may opt to escape them during request processing (as you do in Servlet or Filter) instead of response processing (as you do in JSP), but this way you may risk that the data unnecessarily get double-escaped (e.g. & becomes & instead of & and ultimately the enduser would see & being presented), or that the DB-stored data becomes unportable (e.g. when exporting data to JSON, CSV, XLS, PDF, etc which doesn't require HTML-escaping at all). You'll also lose social control because you don't know anymore what the user has actually filled in. You'd as being a site admin really like to know which users/IPs are trying to perform XSS, so that you can easily track them and take actions accordingly. Escaping during request processing should only and only be used as latest resort when you really need to fix a train wreck of a badly developed legacy web application in the shortest time as possible. Still, you should ultimately rewrite your JSP files to become XSS-safe.
If you'd like to redisplay user-controlled input as HTML wherein you would like to allow only a specific subset of HTML tags like <b>, <i>, <u>, etc, then you need to sanitize the input by a whitelist. You can use a HTML parser like Jsoup for this. But, much better is to introduce a human friendly markup language such as Markdown (also used here on Stack Overflow). Then you can use a Markdown parser like CommonMark for this. It has also builtin HTML sanitizing capabilities. See also Markdown or HTML.
The only concern in the server side with regard to databases is SQL injection prevention. You need to make sure that you never string-concatenate user-controlled input straight in the SQL or JPQL query and that you're using parameterized queries all the way. In JDBC terms, this means that you should use PreparedStatement instead of Statement. In JPA terms, use Query.
An alternative would be to migrate from JSP/Servlet to Java EE's MVC framework JSF. It has builtin XSS (and CSRF!) prevention over all place. See also CSRF, XSS and SQL Injection attack prevention in JSF.
The how-to-prevent-xss has been asked several times. You will find a lot of information in StackOverflow. Also, OWASP website has an XSS prevention cheat sheet that you should go through.
On the libraries to use, OWASP's ESAPI library has a java flavour. You should try that out. Besides that, every framework that you use has some protection against XSS. Again, OWASP website has information on most popular frameworks, so I would recommend going through their site.
I had great luck with OWASP Anti-Samy and an AspectJ advisor on all my Spring Controllers that blocks XSS from getting in.
public class UserInputSanitizer {
private static Policy policy;
private static AntiSamy antiSamy;
private static AntiSamy getAntiSamy() throws PolicyException {
if (antiSamy == null) {
policy = getPolicy("evocatus-default");
antiSamy = new AntiSamy();
}
return antiSamy;
}
public static String sanitize(String input) {
CleanResults cr;
try {
cr = getAntiSamy().scan(input, policy);
} catch (Exception e) {
throw new RuntimeException(e);
}
return cr.getCleanHTML();
}
private static Policy getPolicy(String name) throws PolicyException {
Policy policy =
Policy.getInstance(Policy.class.getResourceAsStream("/META-INF/antisamy/" + name + ".xml"));
return policy;
}
}
You can get the AspectJ advisor from the this stackoverflow post
I think this is a better approach then c:out particular if you do a lot of javascript.
Managing XSS requires multiple validations, data from the client side.
Input Validations (form validation) on the Server side. There are multiple ways of going about it. You can try JSR 303 bean validation(hibernate validator), or ESAPI Input Validation framework. Though I've not tried it myself (yet), there is an annotation that checks for safe html (#SafeHtml). You could in fact use Hibernate validator with Spring MVC for bean validations -> Ref
Escaping URL requests - For all your HTTP requests, use some sort of XSS filter. I've used the following for our web app and it takes care of cleaning up the HTTP URL request - http://www.servletsuite.com/servlets/xssflt.htm
Escaping data/html returned to the client (look above at #BalusC explanation).
I would suggest regularly testing for vulnerabilities using an automated tool, and fixing whatever it finds. It's a lot easier to suggest a library to help with a specific vulnerability then for all XSS attacks in general.
Skipfish is an open source tool from Google that I've been investigating: it finds quite a lot of stuff, and seems worth using.
There is no easy, out of the box solution against XSS. The OWASP ESAPI API has some support for the escaping that is very usefull, and they have tag libraries.
My approach was to basically to extend the stuts 2 tags in following ways.
Modify s:property tag so it can take extra attributes stating what sort of escaping is required (escapeHtmlAttribute="true" etc.). This involves creating a new Property and PropertyTag classes. The Property class uses OWASP ESAPI api for the escaping.
Change freemarker templates to use the new version of s:property and set the escaping.
If you didn't want to modify the classes in step 1, another approach would be to import the ESAPI tags into the freemarker templates and escape as needed. Then if you need to use a s:property tag in your JSP, wrap it with and ESAPI tag.
I have written a more detailed explanation here.
http://www.nutshellsoftware.org/software/securing-struts-2-using-esapi-part-1-securing-outputs/
I agree escaping inputs is not ideal.
My personal opinion is that you should avoid using JSP/ASP/PHP/etc pages. Instead output to an API similar to SAX (only designed for calling rather than handling). That way there is a single layer that has to create well formed output.
If you want to automatically escape all JSP variables without having to explicitly wrap each variable, you can use an EL resolver as detailed here with full source and an example (JSP 2.0 or newer), and discussed in more detail here:
For example, by using the above mentioned EL resolver, your JSP code will remain like so, but each variable will be automatically escaped by the resolver
...
<c:forEach items="${orders}" var="item">
<p>${item.name}</p>
<p>${item.price}</p>
<p>${item.description}</p>
</c:forEach>
...
If you want to force escaping by default in Spring, you could consider this as well, but it doesn't escape EL expressions, just tag output, I think:
http://forum.springsource.org/showthread.php?61418-Spring-cross-site-scripting&p=205646#post205646
Note: Another approach to EL escaping that uses XSL transformations to preprocess JSP files can be found here:
http://therning.org/niklas/2007/09/preprocessing-jsp-files-to-automatically-escape-el-expressions/
If you want to make sure that your $ operator does not suffer from XSS hack you can implement ServletContextListener and do some checks there.
The complete solution at: http://pukkaone.github.io/2011/01/03/jsp-cross-site-scripting-elresolver.html
#WebListener
public class EscapeXmlELResolverListener implements ServletContextListener {
private static final Logger LOG = LoggerFactory.getLogger(EscapeXmlELResolverListener.class);
#Override
public void contextInitialized(ServletContextEvent event) {
LOG.info("EscapeXmlELResolverListener initialized ...");
JspFactory.getDefaultFactory()
.getJspApplicationContext(event.getServletContext())
.addELResolver(new EscapeXmlELResolver());
}
#Override
public void contextDestroyed(ServletContextEvent event) {
LOG.info("EscapeXmlELResolverListener destroyed");
}
/**
* {#link ELResolver} which escapes XML in String values.
*/
public class EscapeXmlELResolver extends ELResolver {
private ThreadLocal<Boolean> excludeMe = new ThreadLocal<Boolean>() {
#Override
protected Boolean initialValue() {
return Boolean.FALSE;
}
};
#Override
public Object getValue(ELContext context, Object base, Object property) {
try {
if (excludeMe.get()) {
return null;
}
// This resolver is in the original resolver chain. To prevent
// infinite recursion, set a flag to prevent this resolver from
// invoking the original resolver chain again when its turn in the
// chain comes around.
excludeMe.set(Boolean.TRUE);
Object value = context.getELResolver().getValue(
context, base, property);
if (value instanceof String) {
value = StringEscapeUtils.escapeHtml4((String) value);
}
return value;
} finally {
excludeMe.remove();
}
}
#Override
public Class<?> getCommonPropertyType(ELContext context, Object base) {
return null;
}
#Override
public Iterator<FeatureDescriptor> getFeatureDescriptors(ELContext context, Object base){
return null;
}
#Override
public Class<?> getType(ELContext context, Object base, Object property) {
return null;
}
#Override
public boolean isReadOnly(ELContext context, Object base, Object property) {
return true;
}
#Override
public void setValue(ELContext context, Object base, Object property, Object value){
throw new UnsupportedOperationException();
}
}
}
Again: This only guards the $. Please also see other answers.
<%# page import="org.apache.commons.lang.StringEscapeUtils" %>
String str=request.getParameter("urlParam");
String safeOuput = StringEscapeUtils.escapeXml(str);
Maybe this is simple, maybe this is a bug on Parse - would like to know if anyone has had the same problem and a possible solution.
What I'm trying to do:
I'm sending a JSON request from an app called FormEntry to my Parse app
The body comes in like this: json={"someLabel" : "someValue"}
I would like to take the entire body and create a Parse.Cloud.httpRequest over to Zapier to perform some functions.
Now, the problem seems to be this:
On random occasions (i.e. I have no idea why), the body is sent (as shown by the logs) where there is a trailing comma at the end of the last pair in the JSON object. e.g. like this json={"lastLabel" : "lastValue",}
The number of elements in 'normal' and 'incorrect' objects seem to be the same, so it's simply just another comma added. And I have no idea why.
My setup:
Using app.use(parseExpressRawBody()); only and not the standard app.use(express.bodyParser()); which doesn't provide access to the raw body.
Because parseExpressRawBody converts the body to a buffer I need to turn it back into a string to send it in the HTTP request in a meaningful way. Therefore I use: var body = req.body.toString();
When logging this var to the Parse console it looks to be format back from the buffer fine.
And that's about it. Nothing complex going on here but a real annoying bug that I just haven't found a sensible way of understanding. Would SUPER appreciate anyone who has seen this before or who could point me in a direction to focus on.
Just an update on this. Not a solution that answers why there is malformed JSON but a hack to get the right result.
The purpose of the HTTP request was to point over to Zapier so I wrote a Zapier script that would deal with the malformed JSON. Added here for anyone else who needs it.
"use strict";
var Zap = { newSubmission_catch_hook: function(bundle) {
var body = bundle.request.content;
var cleanTop = body.substring(5,body.length);
var cleanChar = cleanTop.length;
var condition = cleanTop.substring(cleanChar-2,cleanChar);
function testCase(condition,cleanTop) {
if (condition != ",}"){
console.log("Everything is fine, returning JSON");
return cleanTop;
}
else {
console.log("Nope! We have an error, cleaning end");
var cleanEnd = cleanTop.substr(0,cleanChar-2) + '}';
console.log("The object now ends with: " + cleanEnd.substr(-10));
return cleanEnd;
}
}
var newBody = JSON.parse(testCase(condition,cleanTop));
return newBody;
}
};
I am semi-new to ASP.NET MVC. I am building an app that is used internally for my company.
The scenario is this: There are two Html.Listbox's. One has all database information, and the other is initally empty. The user would add items from the database listbox to the empty listbox.
Every time the user adds a command, I call a js function that calls an ActionResult "AddCommand" in my EditController. In the controller, the selected items that are added are saved to another database table.
Here is the code (this gets called every time an item is added):
function Add(listbox) {
...
//skipping initializing code for berevity
var url = "/Edit/AddCommand/" + cmd;
$.post(url);
}
So the problem occurs when the 'cmd' is an item that has a '/', ':', '%', '?', etc (some kind of special character)
So what I'm wondering is, what's the best way to escape these characters? Right now I'm checking the database's listbox item's text, and rebuilding the string, then in the Controller, I'm taking that built string and turning it back into its original state.
So for example, if the item they are adding is 'Cats/Dogs', I am posting 'Cats[SLASH]Dogs' to the controller, and in the controller changing it back to 'Cats/Dogs'.
Obviously this is a horrible hack, so I must be missing something. Any help would be greatly appreciated.
Why not just take this out of the URI? You're doing a POST, so put it in the form.
If your action is:
public ActionResult AddCommand(string cmd) { // ...
...then you can do:
var url = "/Edit/AddCommand";
var data = { cmd: cmd };
$.post(url, data);
... and everything will "just work" with no separate encoding step.
Have you tried using the 'escape' function, before sending the data? This way, all special characters are encoded in safe characters. On the server-side, you can decode the value.
function Add(listbox) { ...
//skipping initializing code for berevity
var url = "/Edit/AddCommand/" + escape(cmd);
$.post(url);
}
use javascript escaping, it does urlencoding.
Javascript encoding
Then in C# you can simple decode it.
It will look as such
function Add(listbox) { ...
//skipping initializing code for berevity
var url = "/Edit/AddCommand/" + escape(cmd);
$.post(url);
}
Have you tried just wrapping your cmd variable in a call to escape()?
You could pass the details as a query string. At the moment I'm guessing you action looks like:
public virtual ActionResult AddCommand( string id )
you could change it to:
public virtual ActionResult AddCommand( string cmd )
and then in you javascript call:
var url = "/Edit/AddCommand?cmd=" + cmd;
That way you don't need to worry about the encoding.
A better way would be if you could pass the databases item id rather than a string. This would probably be better performance for your db as well.