I've been recently scraping some JS-driven pages. As far as I know there are two ways of loading the content: static (ready to use HTML pages) and dynamically (making HTML code in-place from a raw data). I know about XHR, and I've been successfully intercepting some.
But now I've faced strange thing - site dynamically loads the content after the page fully loads but there are no XHRs. How can that be?
My guess is: the inner js files are making some hidden requests (which transfer the data) and building page based on responses.
What should I do?
P.S. I'm not interested in selenium-based solutions - they are well-known, but slow and inefficient.
P.P.S. I'm a back-end developer mostly, so I'm not familiar with JS.
Nowadays you do not need to use selenium for scrapping any more. The Chrome browser can now be used in headless mode and you can than run scraping script after the page is fully loaded.
there is simple guide:
https://developers.google.com/web/updates/2017/04/headless-chrome
There is nodejs library for driving it (chrome-remote-interface) but the downside is that I could not found python one.
Related
A little info:
When 'inspected' (Google Chrome), the website displays the information I need (namely, a simple link to a .pdf).
When I cURL the website, only a part of it gets saved. This coupled with the fact that there are functions and <script> tags leads me to believe that javascript is the culprit (I'm honestly not 100% sure, as I'm pretty new at this).
I need to pull this link periodically, and it changes each time.
The question:
Is there a way for me, in bash, to run this javascript and save the new HTML code it generates to a file?
Not trivially.
Typically, for that approach, you need to:
Construct a DOM from the HTML
Execute the JavaScript in the context of that DOM while resolving URLs relative to the URL you fetched the HTML from
There are tools which can help with this, such as Puppeteer, PhantomJS, and Selenium, but they generally lend themselves to being driven with beefier programming languages than bash.
As an alternative, you can look at reverse engineering the page. It gets the data from somewhere. You can probably work out the URLs (the Network tab of a browser's developer tools is helpful there) and access them directly.
If you want to download a web page that generates itself with JavaScript, you'll need to execute this JavaScript in order to load the page. To achieve this you can use libraries that do this like puppeteer with NodeJS. There's a lot of other libraries, but that's the most popular.
If you're wondering why does this happens, it's because web developers often use frameworks like React, Vue or Angular to quote the most popular ones which only generates a JavaScript output that's not executed by common HTTP requesting libraries.
How can I screen scrape a multi page application? I want to do this using Javascript. Here are the approaches I have considered and the problems I have encountered.
Using the Fetch web API in a Node application to get the web pages
Problem: The web pages won't load properly when being fetched. I guess all javascript on the page does not run when the page is fetched.
Running JavaScript from the console
This is a very simple way to inject JavaScript straight into the document. But one problem is that opening the web page is a browser and pasting into the console is manual work. Another problem is that while this works for single page application it becomes very cumbersome for multi-page applications.
What better approach exists that solves the problems I have encountered?
Depends on what are you doing. If you just want to get some that from some website then injecting JS in the page is the way to go.
But as you said it's manual work from which I deduce you want to scrape the sites and save the data maybe. In this case a service side script is better suited. To fix the problem with the JavaScript not being loaded you can use things like PhantomJs or Horseman.
Take a look at this: https://medium.com/#designman/building-a-performant-web-scraper-in-node-js-5f4449674163
If you want to save website content (html, js, css files, images) to file system you can take a look on website-scraper package for nodejs https://www.npmjs.com/package/website-scraper
It also has plugin for PhantomJS which allows to handle single page applications
We have a web app that its content generated by javascript. Can google index those pages?
When we investigate this issue we always found solutions from old pages about using "#!" in links.
In our app the links are like this:
domain.com/paris
domain.com/london
When we use these kind of links, javascript populates content.
Is it wise to use HTML snapshot or do you have any other suggestions?
Short answer
Yes they can crawl JavaScript generated content, as long as you are using pushstates.
Detailed answer
It depends on your setup. Google and Bing CAN crawl javascript and AJAX based content if your are using pushstates. If you do they will handle content coming from AJAX calls, updates to page title or meta tags using javascript, and in general any such things.
Most frontend frameworks like Angular, Ember or Backbone already works with pushstates so in these cases you don't need to do anything. Check whatever system you are using to see how they do things. If you are not using pushstates you will need to implement it on your own or use the whole escapted_fragment html snapshot deal.
So if you use pushstate then yes, search engines can crawl your page just fine. If you don't then no, you will need to implement pushstates or do HTML snapshots.
Bonus info - Unfortunately Facebook does not handle pushstates, so the facebook crawler needs either non-dynamic og-tags or HTML snapshots.
"Generated by JavaScript" is ambiguous. That could mean that you are running a JS script on the server or it could mean that you are making an AJAX call with a JS API. The difference appears to matter as far as Googlebot is concerned. But you don't have to take my word for it, as there is empirical proof of what Googlebot will and won't currently cache as far as JavaScript content in the form of live experiments using both the XMLHTTPRequest API and the Fetch API. So, as you can see, server-side rendering is still going to be the best way to go for SEO.
I've searched high and low, but all I can find is questions (and answers) about scraping content that is dynamically generated by Javascript.
I'm putting together a simple tool to audit client websites by finding text in the HTML source and comparing it to a dictionary.
For example, "ga.js" = Google Analytics.
However, I'm noticing that comparable tools are picking up scripts that mine is not... because they don't actually appear in the HTML source. I can only see them through Chrome's Developer Tools:
Here's a capture from Chrome, since I can't post the image...
Those scripts, such as the "reflektion_b.js", are nowhere to be found in the HTML source.
My script, as it stands now, is using urllib2 (urlopen) to fetch and then BeautifulSoup to parse. Can anyone help me our re:getting the list of script sources? Or maybe even being able to read them as well (not 100% necessary, but could come in handy)?
Any help would be much appreciated.
You need to use a headless browser with python API approach. Ghost will probably do what you want.
http://jeanphix.me/Ghost.py/
content that is dynamically generated by Javascript. implies that the Javascript in question is interpreted, which involves a Javascript interpreter.
You probably need an instance of web view with a mechanism to intercept request to figure out which javascript is being loaded in the page.
I'm writing a web crawler (web spider) that crawl all links in a website.
My application is a Win32 App, written in C# with .Net framework 3.5.
Now I'm using HttpWebRequest an HttpWebResponse to communicate with the web server.
I also built my own Http Parser that can parse anything I want.
I found all link like "href", "src", "action"... in the parse.
But I can not solve one problem: Simulate Client Script in the page (like JS and VBS)
For example, if a link like:
a href = "javascript:buildLink(1)"
... with buildLink(parameter) is a Javascript function that will make a custom link due to the parameter.
Please help me to solve this problem. How to simulate JavaScript in this app? I can parse the HTML source code and take all JavaScript code to another file, but how to simulate a function of it?
Thanks.
Your only real option is to automate a browser. As other answers have said, you cannot reliably simulate browser javascript without having a complete DOM.
There are fortunately ways to automate the browser, check out Selenium.
It has a C# API, so you can control the browser from C#.
Use your .NET web crawler code to crawl the site. Whenever you encounter a href="javascript:... link, handle the page containing the link in Selenium:
Use the Selenium API to tell the browser to load the page.
Use the Selenium API to find all links on the page.
This way, your spider only uses Selenium when necessary (pages without javascript links can be handled by the browser-less spider code you already got). And since this is an embarrassingly parallel workload, you could easily have multiple Selenium processes running at the same time (either on one computer or on other computers).
But remember that href="javascript is hardly the only way a page can have dynamic links. The more common case is probably that a onload or $(document).ready() script manipulates the DOM and adds links that way.
To catch that case (and others), the spider probably will have to use Selenium for all pages that have a <script> tag.
You are basically pretending to be a browser, except that HttpWebRequest only does the networking stuff for you.
I would recommend using the ie web browser control and interop'ing into that from your c# application. That will allow you to run JavaScript, set variables, post, etc etc.
Here's some basic links I found after a search for "ie web browser control":
http://www.c-sharpcorner.com/UploadFile/mahesh/WebBrowserInCSMDB12022005001524AM/WebBrowserInCSMDB.aspx
http://support.microsoft.com/kb/313068
This is a problem which is not easily solved. You could consider taking one of the existing JavaScript implementations and porting or interfacing with it somehow.
If I were tackling this problem, I'd probably build a small side application in Java on top of Rhino, with some sort of RPC framework layered on top of that so that I could communicate with it from my primary application.
Unfortunately, without having a complete DOM implementation on top of that, you would be limited to only very simple javascript.
You could execute the javascript by using the MS JScript engine or something similar.
MSDN Reference
Eric Lippert's blog on using Eval (part 1) (part 2) (part 3)
This isn't guaranteed to work, especially if the javascript tries to access the DOM, or somesuch... But for simple scripts, it might be enough.