web crawler HTML and JS "understanding" with python

web crawler HTML and JS "understanding" with python - javascript

I am making a web crawler that try to find security issues in web sites (something like w3af). for example finding xss etc...
but I have a problem: my crawler try to find the forms in the pages using regex (I know regex can't pars html but i'm only parsing the form tag), anyway I found that some pages (like google) have the entire page encoded in some sort of ugly JS code and then nothing works, in addition when I inject scripts to the urls and want to check if they execute (reflected xss) I need something that could tell me if (for example) an alert showed up.
so I need a module that will get the source code of a page and will be able to give me the rendered page or if alert occured.
do someone know something that might work? (I'v found some solutions such as selenium but it require a browser and therefor is way to slow...)
thanks ahead!

Related

Scrape Javascript Files with Python

I've searched high and low, but all I can find is questions (and answers) about scraping content that is dynamically generated by Javascript.
I'm putting together a simple tool to audit client websites by finding text in the HTML source and comparing it to a dictionary.
For example, "ga.js" = Google Analytics.
However, I'm noticing that comparable tools are picking up scripts that mine is not... because they don't actually appear in the HTML source. I can only see them through Chrome's Developer Tools:
Here's a capture from Chrome, since I can't post the image...
Those scripts, such as the "reflektion_b.js", are nowhere to be found in the HTML source.
My script, as it stands now, is using urllib2 (urlopen) to fetch and then BeautifulSoup to parse. Can anyone help me our re:getting the list of script sources? Or maybe even being able to read them as well (not 100% necessary, but could come in handy)?
Any help would be much appreciated.

You need to use a headless browser with python API approach. Ghost will probably do what you want.
http://jeanphix.me/Ghost.py/

content that is dynamically generated by Javascript. implies that the Javascript in question is interpreted, which involves a Javascript interpreter.
You probably need an instance of web view with a mechanism to intercept request to figure out which javascript is being loaded in the page.

Ruby on Rails code for taking a screenshot of different sections of a page

I am creating a Ruby on Rails app. A specific page in my app is divided into several sections by <div> tags. Each <div> includes a combination of text (using different fonts), symbols and mathematic formulas. I use MathJax and a few other Javascript codes to display them correctly and everything works great on my computer. However Javascript is not enabled on everyone's browser and some Javascript codes might not load correctly on some other people's browsers. One solution I was thinking is this: after all the javascripts are done processing and the page is displayed correctly on my computer (server) I use some code to generate a snapshot of each <div> in PNG and send them to the server (for example I click a <button> tag on the page to activate this code after I'm happy what is displayed is correct). Then I'll save these images in the database and serve them which will look the same on everyone's computer regardless of whether Javascript is enabled, what browser they're using, etc. Is anyone aware of a code or command that I can use? Please note, currently after the page is loaded, Javascripts process the HTML content and produce the correct display. Also I don't want to take a snapshot of the whole page; snapshot of each <div> separately.
Thanks a lot.

Well this is a client-side problem, here is a javascript that will work for you http://experiments.hertzen.com/jsfeedback/

You've got a bit of a problem there. Javascript is not executed until the page has finished loading, i.e. all of the information has already been sent to the client. You're not executing javascript at the server level, so you wouldn't be able to do that kind of processing at all. If they have javascript disabled, your code will never get executed.
You could generate the images using Imagemagick or something similar, I know PHP has bindings for that. There are a couple of extremely messy solutions like rendering it in a browser on the server side with something like selenium, but I definitely wouldn't recommend doing that. Overall, it depends on the platform on which your developing, but most major languages have support for generating images that don't require 100% javascript.

Are search-bots or spam-bots able to emulate/trigger JavaScript events?

Are search-bot or spam-bots able to emulate/trigger JavaScript events while they read out the page?

No, because search bots fetch a static HTML stream. They aren't running any of the initialization events like init() or myObj.init(), which is in your JavaScript code. They don't load any external libraries like jQuery, nor execute the $(document).ready code nor any of the standard .click() listeners. So unless a search bot author has a specific reason to intentionally build their search bot to trigger or execute <script> blocks which are on the page, they usually won't run JavaScript code.
I've written a search bot. All that I care about is extracting the links & text from the page. However, I don't want to run someone else's client-side calendar component nor video player component. I don't want that JS code to be inserted into my database, where it could end up on the Search Engine Results Page (SERP). So there is no reason to run an eval() command on any code in the <script> blocks, nor trigger any of the initialization events in the JS layer.
When search bots load the HTML DOM, there are usually embedded external .js files in them. So to execute the JS would require parsing out the strings for multiple .js files, then building a concatenator for those files & then trying to execute everything that's been downloaded. That's extra work for a search bot author, for no net gain at all. We simply don't want that JS code to appear anywhere in our SERPs. Otherwise, seeing JS code on the SERP looks like a bad search result. However, bots can see content in <script> tags & are only looking for links to crawl. So that may be why people start to think that bots can execute JavaScript, but they are only really parsing them for their text links.

Here’s someone who makes the case that Google is loading pages in a headless WebKit when crawling them to get a chance to index AJAX content and for other reasons. Search bots don’t generally submit forms though.
I’ve taken a look at your site and the protection is entirely client-side. Since an HTML form really is just a description of what key/values to submit to some URL, there’s no reason anyone couldn’t just POST this data with a bot.
Example:
POST /contact
/* ... */
fullname=SO+test&email=test%40example.com&reason=test&message=test
Also, and this is important, you are penalising legitimate visitors this way. There’s all kind of reasons why JavaScript could be blocked, fail to load, or simply not work.

Check if viewed source not render

Is there a way to check if the current request was for page source (HTML) not actual site?
And if not (which I think is the case), is there a way I could somehow "parse" this out of request parameters and maybe times or something?
I need this to display real source when viewing it, and trimmed one when rendering it..

Is there a way to check if the current request was for page source (HTML) not actual site?
No. The request is always for the page source. There is no way to distinguish what the browser is going to do with it.
Also, many browsers (like IE) can't make a request for "view source" at all - you always load the whole site, render it, and then do a "view source".
Workaround ideas: (All terribly flawed)
Add some JavaScript to the page making an Ajax call. If the call is made, the page was rendered.
Add some image resource to the page. If it's loaded, the page was rendered.
If this is to protect your HTML source code, forget it and go do something productive instead. :)

It sounds like you're wanting to send optimized/compressed/bandwidth-friendly code when it's being viewed in the browser, and readable/understandable/indented code if the user is wanting to view the source code, right?
Unfortunately, that's not possible, and I suspect it would be counter-productive, since it would prevent you from debugging problems if your Javascript minimizer or HTML compressor code was introducing problems. It would be far better to use something that reintroduces whitespace and indentation for readability. For example, the View Source Chart extension for Firefox. (I don't know what options there are for unminimizing Javascript code.)

Using injected JavaScript to copy text from a web page

As part of a job I'm doing on a web site I have to copy a few thousand lines of text from several pages of the old site and paste them into the HTML for the new site. The long and painstaking way of going to the old page and copying the many lines of text and then going to my editor and pasting it there line by line is getting really old. I thought of using injected JavaScript to do this but I'm not quite sure where to start. Thanks in advance for any help.
Here are links to a page of the old site and a page of the new site. As you can see in the tables on each page it would take a ton of time to copy it all manually.
Old site: http://temp.delridgelegalformscom.officelive.com/macorporation1.aspx
New Site: http://ezwebsites.us/delridge/macorporation1.html

In order to do this type of work, you need two things: a way of injecting or executing your script on that page, and a good working knowledge of the Document Object Model for the target site.
I highly recommend using the Firefox plugin FireBug, or some equivalent tool on your browser of choice. FireBug lets you execute commands from a JavaScript console which will help. Hopefully the old site does not have a bunch of <FONT>, <OBJECT> or <IFRAME> tags which will make this even more tedious.
Using a library like Prototype or JQuery will also help selecting parts of the website you need. You can submit results using JQuery like this:
$(function() {
snippet = $('#content-id').html;
$.post('http://myserver/page', {content: snippet});
});
A problem you will very likely run into is the "same origination policy" many browsers enforce for JavaScript. So if your JavaScript was loaded from http://myserver as in this example, you would be OK.
Perhaps another route you can take is to use a scripting language like Ruby, Python, or (if you really have patience) VBA. The script can automate the list of pages to scrape and a target location for the information. It can just as easily package it up as a request to the new server if that's how pages get updated. This way you don't have to worry about injecting the JavaScript and hoping all works without problems.

I think you need Grease Monkey http://www.greasespot.net/

We Keep Coding

JavaScript is the programming language of the Web.

web crawler HTML and JS "understanding" with python - javascript

Related

Scrape Javascript Files with Python

Ruby on Rails code for taking a screenshot of different sections of a page

Are search-bots or spam-bots able to emulate/trigger JavaScript events?

Check if viewed source not render

Using injected JavaScript to copy text from a web page

Categories

Resources