I've searched high and low, but all I can find is questions (and answers) about scraping content that is dynamically generated by Javascript.
I'm putting together a simple tool to audit client websites by finding text in the HTML source and comparing it to a dictionary.
For example, "ga.js" = Google Analytics.
However, I'm noticing that comparable tools are picking up scripts that mine is not... because they don't actually appear in the HTML source. I can only see them through Chrome's Developer Tools:
Here's a capture from Chrome, since I can't post the image...
Those scripts, such as the "reflektion_b.js", are nowhere to be found in the HTML source.
My script, as it stands now, is using urllib2 (urlopen) to fetch and then BeautifulSoup to parse. Can anyone help me our re:getting the list of script sources? Or maybe even being able to read them as well (not 100% necessary, but could come in handy)?
Any help would be much appreciated.
You need to use a headless browser with python API approach. Ghost will probably do what you want.
http://jeanphix.me/Ghost.py/
content that is dynamically generated by Javascript. implies that the Javascript in question is interpreted, which involves a Javascript interpreter.
You probably need an instance of web view with a mechanism to intercept request to figure out which javascript is being loaded in the page.
Related
A little info:
When 'inspected' (Google Chrome), the website displays the information I need (namely, a simple link to a .pdf).
When I cURL the website, only a part of it gets saved. This coupled with the fact that there are functions and <script> tags leads me to believe that javascript is the culprit (I'm honestly not 100% sure, as I'm pretty new at this).
I need to pull this link periodically, and it changes each time.
The question:
Is there a way for me, in bash, to run this javascript and save the new HTML code it generates to a file?
Not trivially.
Typically, for that approach, you need to:
Construct a DOM from the HTML
Execute the JavaScript in the context of that DOM while resolving URLs relative to the URL you fetched the HTML from
There are tools which can help with this, such as Puppeteer, PhantomJS, and Selenium, but they generally lend themselves to being driven with beefier programming languages than bash.
As an alternative, you can look at reverse engineering the page. It gets the data from somewhere. You can probably work out the URLs (the Network tab of a browser's developer tools is helpful there) and access them directly.
If you want to download a web page that generates itself with JavaScript, you'll need to execute this JavaScript in order to load the page. To achieve this you can use libraries that do this like puppeteer with NodeJS. There's a lot of other libraries, but that's the most popular.
If you're wondering why does this happens, it's because web developers often use frameworks like React, Vue or Angular to quote the most popular ones which only generates a JavaScript output that's not executed by common HTTP requesting libraries.
I've been recently scraping some JS-driven pages. As far as I know there are two ways of loading the content: static (ready to use HTML pages) and dynamically (making HTML code in-place from a raw data). I know about XHR, and I've been successfully intercepting some.
But now I've faced strange thing - site dynamically loads the content after the page fully loads but there are no XHRs. How can that be?
My guess is: the inner js files are making some hidden requests (which transfer the data) and building page based on responses.
What should I do?
P.S. I'm not interested in selenium-based solutions - they are well-known, but slow and inefficient.
P.P.S. I'm a back-end developer mostly, so I'm not familiar with JS.
Nowadays you do not need to use selenium for scrapping any more. The Chrome browser can now be used in headless mode and you can than run scraping script after the page is fully loaded.
there is simple guide:
https://developers.google.com/web/updates/2017/04/headless-chrome
There is nodejs library for driving it (chrome-remote-interface) but the downside is that I could not found python one.
I am making a web crawler that try to find security issues in web sites (something like w3af). for example finding xss etc...
but I have a problem: my crawler try to find the forms in the pages using regex (I know regex can't pars html but i'm only parsing the form tag), anyway I found that some pages (like google) have the entire page encoded in some sort of ugly JS code and then nothing works, in addition when I inject scripts to the urls and want to check if they execute (reflected xss) I need something that could tell me if (for example) an alert showed up.
so I need a module that will get the source code of a page and will be able to give me the rendered page or if alert occured.
do someone know something that might work? (I'v found some solutions such as selenium but it require a browser and therefor is way to slow...)
thanks ahead!
I have had a lot of trouble trying to find information or possible examples of this being done.
I would like to render html in a window and take the js from the html and output that to a python code.
The Html is local and there will never be an internet connection for it to run off. Everythin i try shearch for possible answers everyone always seems to relate back to using some small lightweigh browser which in my case isn't an option to use.
Fort some more detail, I am running Selenium-Webdriver
(python) and Iceweasel(Raspberry Pi B+) to get the value of a element from a html page. So using a different browser isnt possible as the lightweight ones are not compatible with selenium. Using Selenium and Iceweasel takes in excess of 2 miunets to fully load up which for what i need it for is far to long.
I had a look into Awesomium but i think it lacks compatability with the Raspberry Pi.
My other thought was to use OpenGL to render the html but found no easy explained examples.
Currently looking into LibRocket, Berkelium and QWebView but again i dont think they will have anythin i need with the compatability i need.
EDIT:
Basically i want a Canvas capeable of rendering HTML to a screen using X11. On the HTML there will be buttons. I want those buttons to preform actions inside a python script.
The way i see it, a browser is basically a toolbar, a canvas and a lot of networking. I want to strip away as much of that as possible and just remain with the canvas.
First go to the directory that you has the local webpage. Than run python -m SimpleHTTPServer 8000. This will "render the html" in a window. Then view source and paste the javascript into a python file. Alternatively if you would like to automate piping the javascript into an out file you can use beautiful soup to select the javascript and write it to any file you want. Then manipulate it in python however you want.
As part of a job I'm doing on a web site I have to copy a few thousand lines of text from several pages of the old site and paste them into the HTML for the new site. The long and painstaking way of going to the old page and copying the many lines of text and then going to my editor and pasting it there line by line is getting really old. I thought of using injected JavaScript to do this but I'm not quite sure where to start. Thanks in advance for any help.
Here are links to a page of the old site and a page of the new site. As you can see in the tables on each page it would take a ton of time to copy it all manually.
Old site: http://temp.delridgelegalformscom.officelive.com/macorporation1.aspx
New Site: http://ezwebsites.us/delridge/macorporation1.html
In order to do this type of work, you need two things: a way of injecting or executing your script on that page, and a good working knowledge of the Document Object Model for the target site.
I highly recommend using the Firefox plugin FireBug, or some equivalent tool on your browser of choice. FireBug lets you execute commands from a JavaScript console which will help. Hopefully the old site does not have a bunch of <FONT>, <OBJECT> or <IFRAME> tags which will make this even more tedious.
Using a library like Prototype or JQuery will also help selecting parts of the website you need. You can submit results using JQuery like this:
$(function() {
snippet = $('#content-id').html;
$.post('http://myserver/page', {content: snippet});
});
A problem you will very likely run into is the "same origination policy" many browsers enforce for JavaScript. So if your JavaScript was loaded from http://myserver as in this example, you would be OK.
Perhaps another route you can take is to use a scripting language like Ruby, Python, or (if you really have patience) VBA. The script can automate the list of pages to scrape and a target location for the information. It can just as easily package it up as a request to the new server if that's how pages get updated. This way you don't have to worry about injecting the JavaScript and hoping all works without problems.
I think you need Grease Monkey http://www.greasespot.net/