Extracting data from JavaScript (Python Scraper)

Extracting data from JavaScript (Python Scraper) - javascript

I'm currently using a fusion of urllib2, pyquery, and json to scrape a site, and now I find that I need to extract some data from JavaScript. One thought would be to use a JavaScript engine (like V8), but that seems like overkill for what I need. I would use regular expressions, but the expression for this seems way to complex.
JavaScript:
(function(){DOM.appendContent(this, HTML("<html>"));;})
I need to extract the <html>, but I'm not entirely sure how to do so. The <html> itself can contain basically every character under the sun, so [^"] won't work.
Any thoughts?

Why regex? Can't you just use two substrings as you know how many characters you want to trim off the beginning and end?
string[42:-7]
As well as being quicker than a regex, it then doesn't matter if quotes inside <html> are escaped or not.

If every occurance of " inside the html code would be escaped by using \" (it is a JavaScript string after all), you could use
HTML\("((?:\\"|.)*?)"\)
to get the parameter to HTML into the first capturing group.
Note that this Regex is not yet escaped to be a Javascript String itself.

Related

Escape characters in jQuery variables not working

I'm having issues with escaping characters (namely period) found in variables when using selectors in jQuery. I was going to type this all out, but it was just easier taking a screenshot of my console window in Chrome.
It looks like the variables and the clear text versions match up. I expect $('#'+escName) to return the div, just like $('#jeffrey\\.lamb') returns a div. It does not. Why?

You have to think in terms of the individual parsers that will be examining your string values. The very first one, of course, is the JavaScript parser itself. Backslash characters have a meaning in the string grammar, so if you want a single backslash in a string it needs to be doubled.
After the string is parsed from the source code into an internal string value, the next thing that'll pay attention to its contents (in this case) is the CSS selector evaluator (either Sizzle or the native querySelector code; not sure which in the case of strings with escapes like this). That code only needs one backslash to quote the . in order that it not be interpreted as introducing a class name match.
Thus, escName = "jeffrey\\.lamb"; is all you need in this case.

How can I remove escaping from a RegExp pattern?

I'm trying to simplify input for a particular regex for my users. A simple example of the regex might be
\b(C|C\+\+|Java)\b
I'm now giving the user the option of appending another branch at the end of the regex by inputting the raw string into a <input type="text"> field. The branch will be interpreted literally, so I need to escape it. I've used https://stackoverflow.com/a/2593661/785663 to get RegExp.quote to do this. I then store the complete regex in a database.
Now, when I retrieve the regex from the database and split it back up and display the branches to the user, I need to remove all the escape characters again. Is there some pre-made function for this or do I need to roll my own?
Yes, I know I ought to replace this with a list of strings to search for. But this only a part of a larger (regex based) picture.

The optimal solution is to change your design: store the unescaped regex, then only escape it when you actually use it. That way you don't have to worry about this messy business of converting it back and forth all the time.
If you use this regex a lot and are worried about the overhead of having to escape it all the time, then store both the unescaped and escaped versions. Update both whenever the user makes a change.
p.s. Allowing user-entered regexes may make your site vulnerable to attack. (Update: Though in this case it is less likely to be a problem, since you are only allowing literal strings)

Highlighting HTML parts matching a perl regex - do it in server side perl or client side javascript?

A Perl CGI application is providing a search function. The application writes matching snippets to the HTML page. Now I would like to highlight the matches inside the snippets. I could use something like
s/($searchregex)/<span class="highlight">$1<\/span>/gi
to highlight the matches. This is working fine for text only cases, but breaks sometimes with snippets containing itself HTML tag, e.g. for links or images with references. In failing cases the above replacement is destroying the HTML links by inserting the span tag inside the href value.
At the moment I am seeing three possible solutions:
Write a regex that is not replacing matches inside of html tags, e.g. inside <>. I am not aware how to write a replacement regex for this case. Is there a perl regex to allow this replacement and how does it look like?
Write a regex that replaces all wrong replacements of the above replacement. This would fix the wrong span tags inside the href.
Use Javascript to highlight the matches inside the resulting DOM tree. Possible ways using jQuery are outlined in highlight html with matching text. Even normal Javascript may be enough JavaScript’s Regular Expression Flavor. There are special jQuery plugins for highlighting highlight regular expressions , too. I am new to Javascript so some more advise is appreciated, too.
What is the preferable solution? The best way would to it as 1. - but that seems not possible. So the remaining question is: Do the work in an ugly way on the server side or introduce Javascript to solve the problem in a cleaner way on the client side.

in perl with a lookahead after pattern
s/($searchregex)(?=[^>]*<)/<span class="highlight">$1<\/span>/gi
or shorter
s/$searchregex(?=[^>]*<)/<span class="highlight">$&<\/span>/gi
but maybe you will need to read the whole file in a string or change the input record separator ($/) to '<', because the regexp matches the pattern if it's followed by a sequence of any character except '>' and by '<' because will not match if ($/="\n" and there is a newline between pattern and next '<'.

You could use an HTML parser on the server side, which is the correct tool for the job you are doing.
Or you could do it with javascript as you say, which I prefer myself as it is more versatile, and could lead to more interactivity, although you would probably be facing a similar issue to what you are facing now (just that you have moved it to the client side).
It is actually a more complex question than it first appears. Without more information, it is impossible to try to come up with a better solution.
One good solution would be to traverse the DOM tree and match against each text node, but you have a problem then that you would not match text that spans several text nodes - for example "John the Con Johnson" would not match the search for "John the Con" as they would be in separate nodes. This might or might not be a problem for you, depending on your use case.

escape exactly what in javascript

Being a newbie in javascript I came to a situation where I need more information on escaping characters in a string.
Basically I know that in order to escape " I need to replace it with \" but what I don't know is for which characters I need to escape a particular string for. Is there a list of these "characters to escape"? or is it any character that is not a-zA-Z0-9 ?
In my situation, I don't have control over the content that is being displayed on my page. Users enter some text and save it. I then use a webservice to extract them from the database, build a json array of objects, then iterate the array when I need to display them. In this case, I have - naturally - no idea of what the text the user has entered and therefore for what characters I need to escape. I also use jQuery for this specific project (just in case it has a function I am not aware of, to do what I need)
Providing examples would be appreciated but I also want to learn the theory and logic behind it.
Hope someone can be of any help.

There's no need to escape everything that's not a-zA-Z0-9, take a look at this example:
http://www.c-point.com/javascript_tutorial/special_characters.htm
You may also want to check out this site which holds information about escaping string, especially URLs, etc. etc.
http://www.the-art-of-web.com/javascript/escape/

Finding beginning and end quotations

I'm starting to write a code syntax highlighter in JavaScript, and I want to highlight text that is in quotes (both "s and 's) in a certain color. I need it be able to not be messed up by one of one type of quote being in the middle of a pair of the other quotes as well, but i'm really not sure where to even start. I'm not sure how I should go about finding the quotes and then finding the correct end quote.

Unless you're doing this for the challenge, have a look at Google Code Prettify.
For your problem, you could read up on parsing (and lexers) at Wikipedia. It's a huge topic and you'll find that you'll come upon bigger problems than parsing strings.
To start, you could use regular expressions (although they rarely have the accuracy of a true lexer.) A typical regular expression for matching a string is:
/"(?:[^"\\]+|\\.)*"/
And then the same for ' instead of ".
Otherwise, for a character-by-character parser, you would set some kind of state that you're in a string once you hit ", then when you hit " that is not preceded by an uneven amount of backslashes (an even amount of backslashes would escape eachother), you exit the string.

You can find quotes using regular expressions but if you're writing a syntax highlighter then the only reliable way is to step through the code, character by character, and decide what to do from there.
E.g. of a Regex
/("|')((?:\\\1|.)+?)\1/g
(matches "this" and 'this' and "thi\"s")

use stack.. if unmatched quote found push it.. if match found pop

I did it with a single regular expression in php using backwards references. JS does not support it and i think that's what you need if you really want to detect undefined backslashes.

We Keep Coding

JavaScript is the programming language of the Web.