Unable to scrape hidden content in twitter - javascript

I have recently posted a question about how to login to twitter using requests library. Finally, I got the solution for that and another problem i am facing is that i am able to scrape only visible content in the page. How to scrape dynamically loaded content in that page?
Note: I am not using selenium. Please provide any other means to do this.
How to load dynamic content and then scrape it?

Without using something like Selenium or another browser (headless or otherwise) which will actually run the JavaScript in a normal-ish manner, the only other method would be to manually reverse engineer the JavaScript, see what kind of calls it's making, and make them yourself directly.
There wouldn't be any other kind of "one-size-fits-all" solution.

Related

How can I screen scrape a multi page application using javascript?

How can I screen scrape a multi page application? I want to do this using Javascript. Here are the approaches I have considered and the problems I have encountered.
Using the Fetch web API in a Node application to get the web pages
Problem: The web pages won't load properly when being fetched. I guess all javascript on the page does not run when the page is fetched.
Running JavaScript from the console
This is a very simple way to inject JavaScript straight into the document. But one problem is that opening the web page is a browser and pasting into the console is manual work. Another problem is that while this works for single page application it becomes very cumbersome for multi-page applications.
What better approach exists that solves the problems I have encountered?
Depends on what are you doing. If you just want to get some that from some website then injecting JS in the page is the way to go.
But as you said it's manual work from which I deduce you want to scrape the sites and save the data maybe. In this case a service side script is better suited. To fix the problem with the JavaScript not being loaded you can use things like PhantomJs or Horseman.
Take a look at this: https://medium.com/#designman/building-a-performant-web-scraper-in-node-js-5f4449674163
If you want to save website content (html, js, css files, images) to file system you can take a look on website-scraper package for nodejs https://www.npmjs.com/package/website-scraper
It also has plugin for PhantomJS which allows to handle single page applications

Can Googlebot crawl javascript generated content?

We have a web app that its content generated by javascript. Can google index those pages?
When we investigate this issue we always found solutions from old pages about using "#!" in links.
In our app the links are like this:
domain.com/paris
domain.com/london
When we use these kind of links, javascript populates content.
Is it wise to use HTML snapshot or do you have any other suggestions?
Short answer
Yes they can crawl JavaScript generated content, as long as you are using pushstates.
Detailed answer
It depends on your setup. Google and Bing CAN crawl javascript and AJAX based content if your are using pushstates. If you do they will handle content coming from AJAX calls, updates to page title or meta tags using javascript, and in general any such things.
Most frontend frameworks like Angular, Ember or Backbone already works with pushstates so in these cases you don't need to do anything. Check whatever system you are using to see how they do things. If you are not using pushstates you will need to implement it on your own or use the whole escapted_fragment html snapshot deal.
So if you use pushstate then yes, search engines can crawl your page just fine. If you don't then no, you will need to implement pushstates or do HTML snapshots.
Bonus info - Unfortunately Facebook does not handle pushstates, so the facebook crawler needs either non-dynamic og-tags or HTML snapshots.
"Generated by JavaScript" is ambiguous. That could mean that you are running a JS script on the server or it could mean that you are making an AJAX call with a JS API. The difference appears to matter as far as Googlebot is concerned. But you don't have to take my word for it, as there is empirical proof of what Googlebot will and won't currently cache as far as JavaScript content in the form of live experiments using both the XMLHTTPRequest API and the Fetch API. So, as you can see, server-side rendering is still going to be the best way to go for SEO.

Load webpage inside iframe but replace a part of url inside iframe page

My first question here :)
I want a way to load a page inside iframe while changing/replacing a part of the urls of any links present in the webpage with alternate text.
eg.
Suppose we load a website in iframe like "mywebsite.com" which has a link to another site inside the loaded page
eg. http s://www.facebook.com/abcd?id=text
http s://www.facebook.com/efgh?id=text
.
Then I want the website inside iframe to be loaded with custom urls like:
eg. http s://www.facebook.com/abcd?id=alternatetext
http s://www.facebook.com/efgh?id=alternatetext
Basically I need a way to replace "text" to "alternatetext" .. ON THE FLY while rendering the webpage inside iframe.
How do I do it?
Help me people..
Thanks.
This is completely possible. But I think you may be far off on this. Since you do not include any JavaScript I assume that you have not made any head way on that. This is going to be deep and take some fine tuning, its not just some code snippet that someone can give you. It can totally be done with a scripting language. I recommend you take the time to learn a server side language. I personally use VB.NET at work. You will be amazed with the possibilities.
On another note, if Facebook found out you were displaying their pages online and modifying their links they would surely take some action.
I recommend this question be closed.

How do I implement bread-crumb navigation similar to Facebook?

I'm trying to make an Ajax Web Application that uses bread-crumbing to allow the use of the Back and Forward Buttons, but still have that slick ajax page movement.
An excellent example is Facebook's image gallery.
When you click 'Next' the URL changes to the respective URL but the entire page does not update. It's a really smooth interface and I'd like to mimic that.
Anyone got a tutorial/write up on how this works?
Thanks.
Facebook uses the URL-Anchor-Identifier to store the code needed for their AJAX code. This allows changing the URL without having the website reloaded.
Example: http://somedomain.com/#ajax_data_here
Now it's to you to write a smart format for your ajax data and to parse that data.
Update Dec 2012:
I've recently encountered the following method for changing the path within the URL without reloading. Although it only works with newer browsers, I thought I'd append it:
Modify the URL without reloading the page
As far as I am aware there are two main ways that this effect is achieved:
Using the anchor portion of the url (#gallery)
Using a hidden iframe
There are pre-built solutions that you can use to leverage this kind of functionality without having to deal with writing the code. For example if you are working with asp.net then you can use the Ajax History Control:
http://www.asp.net/ajax/videos/how-do-i-use-the-aspnet-ajax-history-control
If you are using JQuery, look at the Address plugin.
http://www.asual.com/jquery/address/
If you're using jQuery, there are lots of suggestions documented here: https://stackoverflow.com/questions/116446/what-is-the-best-back-button-jquery-plugin
I've personally used jQuery Address, and it's super easy and very effective.

How can I get dynamically web content using Perl?

This is kind of tricky. There is this webpage which, I am guessing, uses some kind of AJAX to pull out content based on the search query. When I fetch the page using get in Perl, it fetches the script code behind the php/html, but not the results which are displayed when the query is searched manually. I need to be able to fetch the content of the results page. Is there anyway to do this in Perl?
Take a look at Selenium RC and the WWW::Selenium module in Perl. With them you can control a real web browser.
Another option is WWW::HtmlUnit which uses the HtmlUnit Java library to execute the JavaScript without a web browser. WWW::HtmlUnit uses Inline::Java to give Perl access to the library. I have found that when installing, it is best to say No to the question "Do you wish to build the JNI extension?".
If you are writing tests that need to check the rendered page, you can have a look at Schwern's javascript-tap-harness, which works with Selenium and handles all the scaffolding.
I also found Using WWW::Selenium To Test Or Automate An Ajax Website pretty useful.

Categories