How can I screen scrape a multi page application? I want to do this using Javascript. Here are the approaches I have considered and the problems I have encountered.
Using the Fetch web API in a Node application to get the web pages
Problem: The web pages won't load properly when being fetched. I guess all javascript on the page does not run when the page is fetched.
Running JavaScript from the console
This is a very simple way to inject JavaScript straight into the document. But one problem is that opening the web page is a browser and pasting into the console is manual work. Another problem is that while this works for single page application it becomes very cumbersome for multi-page applications.
What better approach exists that solves the problems I have encountered?
Depends on what are you doing. If you just want to get some that from some website then injecting JS in the page is the way to go.
But as you said it's manual work from which I deduce you want to scrape the sites and save the data maybe. In this case a service side script is better suited. To fix the problem with the JavaScript not being loaded you can use things like PhantomJs or Horseman.
Take a look at this: https://medium.com/#designman/building-a-performant-web-scraper-in-node-js-5f4449674163
If you want to save website content (html, js, css files, images) to file system you can take a look on website-scraper package for nodejs https://www.npmjs.com/package/website-scraper
It also has plugin for PhantomJS which allows to handle single page applications
Related
I've been recently scraping some JS-driven pages. As far as I know there are two ways of loading the content: static (ready to use HTML pages) and dynamically (making HTML code in-place from a raw data). I know about XHR, and I've been successfully intercepting some.
But now I've faced strange thing - site dynamically loads the content after the page fully loads but there are no XHRs. How can that be?
My guess is: the inner js files are making some hidden requests (which transfer the data) and building page based on responses.
What should I do?
P.S. I'm not interested in selenium-based solutions - they are well-known, but slow and inefficient.
P.P.S. I'm a back-end developer mostly, so I'm not familiar with JS.
Nowadays you do not need to use selenium for scrapping any more. The Chrome browser can now be used in headless mode and you can than run scraping script after the page is fully loaded.
there is simple guide:
https://developers.google.com/web/updates/2017/04/headless-chrome
There is nodejs library for driving it (chrome-remote-interface) but the downside is that I could not found python one.
I'm trying to get full html generated by SPA done in AngularJS.
$.get('http://localhost:3388/' + d.response.path, function (d) {
$('#templateContainer').html(d);
});
But it's retuning only the basic html structure, not the dynamic html which in SPA is generated by AJAX (I'm wondering if this is why SPA are not good for SEO).
I believe might exist some technique/trick to solve this problem. Chrome for example when you inspect elements it's able to render all html from AJAX.
Maybe I'm not using the right keywords on google. What people has been doing to workaround this problem?
UPDATE:
Just to be clear about my case. I'm trying to get the full html from this SPA to display to the user a template preview.
I have many different SPA with different templates. I want to display these live templates by AJAX instead IFRAME. With IFRAME works but isn't great.
You can generate a full page in a SPA, but that's not the goal you should not use a SPA in that case.
Because the goal of a SPA is to get only some pieces of the page and load them when necessary you should try a middleware if you care about SEO crawlers like prerender.io you can create you own server with it or use their services its open source.
but generating a full page in SPA is killing all the reasons why you should use a SPA. Best Regards :)
You'll get the raw html because the server doesn't render your app and this is the expected behaviour, there is no bug.
S.P.A are usually ClientSide applications, it means that the browser have to render at rountime the AngularApp!
Of course the browser no renders anything because your code is async-injected into the dom, so, you need to program the AngularJS bootstrapping manually not through the ng-app directive.
by the way, there are many ways to have a server-side rendering of your app, have a look on https://prerender.io/...
If your goal is have a good indexing... loading the app via jQuery is a bad solution because search-engine-crawlers aren't able to process javascript, this is not a angular specific issue, each app built in javascript has this problem.
The best solution should be have a full angular-app and, only for crawlers, implement a server-side prerendering (using prerender.io).
Hope helps!
We have a web app that its content generated by javascript. Can google index those pages?
When we investigate this issue we always found solutions from old pages about using "#!" in links.
In our app the links are like this:
domain.com/paris
domain.com/london
When we use these kind of links, javascript populates content.
Is it wise to use HTML snapshot or do you have any other suggestions?
Short answer
Yes they can crawl JavaScript generated content, as long as you are using pushstates.
Detailed answer
It depends on your setup. Google and Bing CAN crawl javascript and AJAX based content if your are using pushstates. If you do they will handle content coming from AJAX calls, updates to page title or meta tags using javascript, and in general any such things.
Most frontend frameworks like Angular, Ember or Backbone already works with pushstates so in these cases you don't need to do anything. Check whatever system you are using to see how they do things. If you are not using pushstates you will need to implement it on your own or use the whole escapted_fragment html snapshot deal.
So if you use pushstate then yes, search engines can crawl your page just fine. If you don't then no, you will need to implement pushstates or do HTML snapshots.
Bonus info - Unfortunately Facebook does not handle pushstates, so the facebook crawler needs either non-dynamic og-tags or HTML snapshots.
"Generated by JavaScript" is ambiguous. That could mean that you are running a JS script on the server or it could mean that you are making an AJAX call with a JS API. The difference appears to matter as far as Googlebot is concerned. But you don't have to take my word for it, as there is empirical proof of what Googlebot will and won't currently cache as far as JavaScript content in the form of live experiments using both the XMLHTTPRequest API and the Fetch API. So, as you can see, server-side rendering is still going to be the best way to go for SEO.
I am trying to make a Google Chrome extension using content script.
My goal is to have a display at the top of the page (which is already working on my own pages) that can interact with the page.
I need things which are very complicated to put together in an extension, due to security policies :
Using require.js on the extension (that works for now, using this Github repo)
Using a templating engine to describe my display : I need to add a lot of content to the page and I don't think writing HTML in javascript would be a good workflow.
For my current version I use jade with my server, but this is not possible with an extension. I think I need to use something like Angular.js or Backbone.js, but I can't make them work on the content script.
I need a lot of communication between my extension and the page : For example I need to detect almost constantly mouse moves
I need communication with my server using socket.io
Every bit of functionality of my extension have been developed and tried in a standalone web page, but now I need to integrate it in a real extension and I am really stuck
So due to these requirements, I am wondering what would be the right approach for building this : putting it all in an iFrame (would the server-side communication work? And how to communicate with the page ?), or a way to make a templating engine work nicely in there, or a solution I didn't think of?
Try this:
Develop the HUD part as a standalone page that the content script will include in an iframe. You should be able to use Angular.js etc. with this, but you will need local copies of as much as possible and you'll need appropriate entries in the manifest.json to get it working in the extension. See/create other questions for the details.
Have your content script inject the code to monitor mouse-moves, etc. into the target page. Have this code digest and summarize the data, so it's not spamming the system. Maybe message the summaries to the HUD page and/or content script five or six times a second.
After that, it should just be a matter of getting the pieces working, one at a time. Break it down to specific problems and ask a question on one specific problem at a time (If you can't find the answers in previous questions).
I'm pretty sure what you appear to want is do-able, but the details are too broad for a single Stack Overflow question.
This is kind of tricky. There is this webpage which, I am guessing, uses some kind of AJAX to pull out content based on the search query. When I fetch the page using get in Perl, it fetches the script code behind the php/html, but not the results which are displayed when the query is searched manually. I need to be able to fetch the content of the results page. Is there anyway to do this in Perl?
Take a look at Selenium RC and the WWW::Selenium module in Perl. With them you can control a real web browser.
Another option is WWW::HtmlUnit which uses the HtmlUnit Java library to execute the JavaScript without a web browser. WWW::HtmlUnit uses Inline::Java to give Perl access to the library. I have found that when installing, it is best to say No to the question "Do you wish to build the JNI extension?".
If you are writing tests that need to check the rendered page, you can have a look at Schwern's javascript-tap-harness, which works with Selenium and handles all the scaffolding.
I also found Using WWW::Selenium To Test Or Automate An Ajax Website pretty useful.