Scraping dynamic data

Scraping dynamic data - javascript

I am scraping profiles on ask.fm for a research question. The problem is that only the top most recent questions are viewable and I have to click "view more" to see the next 15.
The source code for clicking view more looks like this:
<input class="submit-button-more submit-button-more-active" name="commit" onclick="return Forms.More.allowSubmit(this)" type="submit" value="View more" />
What is an easy way of calling this 4 times before scraping it. I want the most recent 60 posts on the site. Python is preferable.

Without using Headless Browser
Open the chrome debugger tools -> Network Tab.
Now click on the View More button.
Inspect the request being fired in the Network dialog when you click on View More.
The data would be loaded from an external API in most cases so check whether the request to the API is a get or a post request and the kind of response from it.
There could be a limit or any similar query parameter which could be passed to that url so as to limit the no of response objects. In your case, it would be 15.
Try to make the request to the same URL from your script increasing the limit to 60 let's say and check the response.
The above technique works in most cases. But if this doesn't work for you try the following steps.
Using Headless Browser
Try to use a headless browser which loads the dynamic content, you have the methods to scroll down, click etc available.
Examples of Headless browsers are Selenium, Splash, PhantomJS, SlimmerJS etc.

You could probably use selenium to browse to the website and click on the button/link a few times. You can get that here:
https://pypi.python.org/pypi/selenium
Or you might be able to do it with mechanize:
http://wwwsearch.sourceforge.net/mechanize/
I have also heard good things about twill, but never used it myself:
http://twill.idyll.org/

Related

Getting initiator of XXX-xsrfstatemanager.js file using Chrome Developer Tools

In order to triage a problem with a web browser I am trying to determine the initiator of the XXX-xsrfstatemanager.js file (the XXX part seems to be something dynamic like a nonce) that occurs as part of a Google Authentication flow (using OAuth).
When I use Chrome developer tools, it says the below URL is the initiator:
https://accounts.google.com/o/oauth2/v2/auth?approval_state=%21Ch[REDACTED]Q%E2%88%99AJ[REDACTED]xq&as=-aBk[REDACTED]
Looking at the result of the above page see a lot of Javascript, but the string "xsrfstatemanager" is nowhere to be found, nor do I see any other javascript pages being included. Unless there is some really cryptic code that is somehow building this URL, the call is actually coming from some other page.
Does anyone know how I can get the 'real' initiator? Or if the above URL might be correct, if I can get more information like what exact line number of the file initiated the call?
By the way, while I edited the above URL for security reasons, if you go to (for example) www.quora.com and quick "continue with google" it is easy to see the flow in question.

The flow includes a redirection, which is why you cannot see the source code that initiates/references that script.
If you view the source of the original URL that is opened when you click on "Continue with Google", you will see the <script src> that references it. This works in Chrome and probably Safari -
view-source:https://accounts.google.com/o/oauth2/auth?redirect_uri=storagerelay%3A%2F%2Fhttps%2Fwww.quora.com%3Fid%3Dauth488109&response_type=code%20permission%20id_token&scope=email%20profile%20openid&openid.realm=&client_id=917071888555.apps.googleusercontent.com&ss_domain=https%3A%2F%2Fwww.quora.com&access_type=offline&include_granted_scopes=true&prompt=select_account&origin=https%3A%2F%2Fwww.quora.com&gsiwebsdk=2
From the source code -
<script src='https://ssl.gstatic.com/accounts/o/532969778-xsrfstatemanager.js' nonce="IgiKmQiLZIHDwGvce7/q6Q"></script>
You can also use tools like Fiddler to see the source code of the redirect, or check "Preserve log" in the Network panel of the Developer Tools feature of Chrome, or by going to the original URL with JavaScript disabled.

Clicking a Javascript link to make a post request in Python

I'm writing a webscraper/automation tool. This tool needs to use POST requests to submit form data. The final action uses this link:
<a id="linkSaveDestination" href='javascript:WebForm_DoPostBackWithOptions(new WebForm_PostBackOptions("linkSaveDestination", "", true, "", "", false, true))'>Save URL on All Search Engines</a>
to submit data from this form:
<input name="sem_ad_group__destination_url" type="text" maxlength="1024" id="sem_ad_group__destination_url" class="TextValueStyle" style="width:800px;">
I've been using requests and BeautifulSoup. I understand that these libraries can't interact with Javascript, and people recommend Selenium. But as I understand it Selenium can't do POSTs. How can I handle this? Is it possible to do without opening an actual browser like Selenium does?

Yes. You can absolutely duplicate what the link is doing by just submitting a POST to the proper url (this is, in reality, eventually going to be the same thing that the javascript that fires when the link is clicked does).
You'll find the relevant section in the requests docs here: http://docs.python-requests.org/en/latest/user/quickstart/#more-complicated-post-requests
So, that'll look something like this for your particular case:
payload = {'sem_ad_group__destination_url': 'yourTextValueHere'}
r = requests.post("theActionUrlForTheFormHere", data=payload)
If you're having trouble figuring out what url it is actually be posted to, just monitor the network tab (in chrome dev tools) while you manually click the link yourself, you should be able to find the right request and pull any information off of that.
Good Luck!

With selenium you mimic the real-user interactions in a real browser - tell it to locate an input, write a text inside, click a button etc - high-level approach - you don't even need to know what is there under-the-hood, you see what a real user sees. The downside here is that there is a real browser involved which, at least, slows things down. You can though, automate a a headless browser (PhantomJS), or use a Xvfb virtual framebuffer if you don't have conditions to open up a browser with a UI. Example:
from selenium import webdriver
driver = webdriver.PhantomJS()
driver.get('url here')
button = driver.find_element_by_id('linkSaveDestination')
button.click()
With requests+BeautifulSoup, you are going down to the bare metal - using browser developer tools you research/analyze what requests are made to a server and mimic them in your code. Sometimes the way a page is constructed and requests made are too complicated to automate, or there are anti-web-scraping technique used.
There are pros & cons about both approaches - which option to choose depends on many things.

"undefined" randomly appended in 1% of requested urls on my website since 12 june 2012

Since 12 june 2012 11:20 TU, I see very weirds errors in my varnish/apache logs.
Sometimes, when a user has requested one page, several seconds later I see a similar request but the all string after the last / in the url has been replaced by "undefined".
Example:
http://example.com/foo/bar triggers a http://example.com/foo/undefined request.
Of course theses "undefined" pages does not exist and my 404 page is returned instead (which is a custom page with a standard layout, not a classic apache 404)
This happens with any pages (from the homepage to the deepest)
with various browsers, (mostly Chrome 19, but also firefox 3.5 to 12, IE 8/9...) but only 1% of the trafic.
The headers sent by these request are classic headers (and there is no ajax headers).
For a given ip, this seems occur randomly: sometimes at the first page visited, sometimes on a random page during the visit, sometimes several pages during the visit...
Of course it looks like a javascript problem (I'm using jquery 1.7.2 hosted by google), but I've absolutely nothing changed in the js/html or the server configuration since several days and I never saw this kind of error before. And of course, there is no such links in the html.
I also noticed some interesting facts:
the undefined requests are never found as referer of another pages, but instead the "real" pages were used as referer for the following request of the same IP (the user has the ability to use the classic menu on the 404 page)
I did not see any trace of these pages in Google Analytics, so I assume no javascript has been executed (tracker exists on all pages including 404)
nobody has contacted us about this, even when I invoked the problem in the social networks of the website
most of the users continue the visit after that
All theses facts make me think the problem occurs silently in the browers, probably triggered by a buggy add-on, antivirus, a browser bar or a crappy manufacturer soft integrated in browsers updated yesterday (but I didn't find any add-on released yesterday for chrome, firefox and IE).
Is anyone here has noticed the same issue, or have a more complete explanation?

There is no simple straight answer.
You are going to have to debug this and it is probably JavaScript due to the 'undefined' word in the URL. However it doesn't have to be AJAX, it could be JavaScript creating any URL that is automatically resolved by the browser (e.g. JavaScript that sets the src attribute on an image tag, setting a css-image attribute, etc). I use Firefox with Firebug installed most of the time, so my directions will be with that in mind.
Firebug Initial Setup
Skip this if you already know how to use Firebug.
After the installs and restarting Firefox for Firebug, you are going to have to enable most of Firebug's 'panels'. To open Firebug there will be a little fire bug/insect looking thing in the top right corner of your browser or you can press F12. Click through the Firebug tabs 'Console', 'Script', 'Net' and enable them by opening them up and reading the panel's information. You might have to refresh the page to get them working properly.
Debugging User Interaction
Navigate to one of the pages that has the issue with Firebug open and the Net panel active. In the Net panel there will be a few options: 'Clear', 'Persist', 'All', 'Html', etc. Make sure ALL is selected. Don't do anything on the page and try not to mouse over anything on it. Look through the requests. The request for the invalid URL will be red and probably have a status of 404 Not Found (or similar).
See it on load? Skip to the next part.
Don't see it on initial load? Start using your page and continue here.
Start clicking on every feature, mouse over everything, etc. Keep your eyes on the Net panel and watch for a requests that fail. You might have to be creative, but continue using your application till you see your browser make an invalid request. If the page makes many requests, feel free to hit the 'Clear' button on the top left of the Net panel to clear it up a bit.
If you submit the page and see a failed request go out really quick but then lose it because the next page loads, enable persistence by clicking 'Persist' in the top left of the Net panel.
Once it does, and it should, consider what you did to make that happen. See if you can make it happen again. After you figure out what user interaction is making it happen, dive into that code and start looking for things that are making invalid requests.
You can use the Script tab to setup breakpoints in your JavaScript and step through them. Investigate event handlers done via $(elemment).bind/click/focus/etc or from old school event attributes like onclick=""/onfocus="" etc.
If the request is happening as soon as the page loads
This is going to be a little harder to peg down. You will need to go to the Script tab and start adding break points to every script that runs on load. You do this by clicking on the left side of the line of JavaScript.
Reload your page and your break points should stop the browser from loading the page. Press the 'Continue' button on the script panel. Go to your net panel and see if your request was made, continue till it is found. You can use this to narrow down where the request is being made from by slowly adding more and more break points and then stepping into and out of functions.
What you are looking for in your code
Something that is similar to the following:
var url = workingUrl + someObject['someProperty'];
var url = workingUrl + someObject.someProperty;
Keep in mind that someObject might be an object {}, an array [], or any of the internal browser types. The point is that a property will be accessed that doesn't exist.
I don't see any 404/red requests
Then whatever is causing it isn't being triggered by your tests. Try using more things. The point is you should be able to make the request happen somehow. You just don't know yet. It has to show up in the Net panel. The only time it won't is when you aren't doing whatever triggers it.
Conclusion
There is no super easy way to peg down what exactly is going on. However using the methods I outlined you should be at least be able to get close. It is probably something you aren't even considering.

Based on this post, I reverse-engineered the "Complitly" Chrome Plugin/malware, and found that this extension is injecting an "improved autocomplete" feature that was throwing "undefined" requests at every site that has a input text field with NAME or ID of "search", "q" and many others.
I found also that the enable.js file (one of complitly files) were checking a global variable called "suggestmeyes_loaded" to see if it's already loaded (like a Singleton). So, setting this variable to false disables the plugin.
To disable the malware and stop "undefined" requests, apply this to every page with a search field on your site:
<script type="text/javascript">
window.suggestmeyes_loaded = true;
</script>
This malware also redirects your users to a "searchcompletion.com" site, sometimes showing competitors ADS. So, it should be taken seriously.

You have correctly established that the undefined relates to a JavaScript problem and if your site users haven't complained about seeing error pages, you could check the following.
If JavaScript is used to set or change image locations, it sometimes happens that an undefined makes its way into the URI.
When that happens, the browser will happily try to load the image (no AJAX headers), but it will leave hints: it sets a particular Accept: header; instead of text/html, text/xml, ... it will use image/jpeg, image/png, ....
Once such a header is confirmed, you have narrowed down the problem to images only. Finding the root cause will possibly take some time though :)
Update
To help debugging you could override $.fn.attr() and invoke the debugger when something is being assigned to undefined. Something like this:
(function($, undefined) {
var $attr = $.fn.attr;
$.fn.attr = function(attributeName, value) {
var v = attributeName === 'src' ? value : attributeName.src;
if (v === 'undefined') {
alert("Setting src to undefined");
}
return $attr(attributeName, value);
}
}(jQuery));

Some facts that have been established, especially in this thread: http://productforums.google.com/forum/#!msg/chrome/G1snYHaHSOc/p8RLCohxz2kJ
it happens on pages that have no javascript at all.
this proves that it is not an on-page programming error
the user is unaware of the issue and continues to browse quite happily.
it happens a few seconds after the person visits the page.
it doesn't happen to everybody.
happens on multiple browsers (Chrome, IE, Firefox, Mobile Safari, Opera)
happens on multiple operating systems (Linux, Android, NT)
happens on multiple web servers (IIS, Nginx, Apache)
I have one case of googlebot following the link and claiming the same referrer. They may just be trying to be clever and the browser communicated it to the mothership who then set out a bot to investigate.
I am fairly convinced by the proposal that it is caused by plugins. Complitly is one, but that doesn't support Opera. There many be others.
Though the mobile browsers weigh against the plugin theory.
Sysadmins have reported a major drop off by adding some javascript on the page to trick Complitly into thinking it is already initialized.
Here's my solution for nginx:
location ~ undefined/?$ {
return 204;
}
This returns "yeah okay, but no content for you".
If you are on website.com/some/page and you (somehow) navigate to website.com/some/page/undefined the browser will show the URL as changed but will not even do a page reload. The previous page will stay as it was in the window.
If for some reason this is something experienced by users then they will have a clean noop experience and it will not disturb whatever they were doing.

This sounds like a race condition where a variable is not getting properly initialized before getting used. Considering this is not an AJAX issue according to your comments, there will be a couple of ways of figuring this out, listed below.
Hookup a Javascript exception Logger: this will help you catch just about all random javascript exceptions in your log. Most of the time programmatic errors will bubble up here. Put it before any scripts. You will need to catch these on the server and print them to your logs for analysis later. This is your first line of defense. Here is an example:
window.onerror = function(m,f,l) {
var e = window.encodeURIComponent;
new Image().src = "/jslog?msg=" + e(m) + "&filename=" + e(f) + "&line=" + e(l) + "&url=" + e(window.location.href);
};
Search for window.location: for each of these instances you should add logging or check for undefined concats/appenders to your window.location. For example:
function myCode(loc) {
// window.location.href = loc; // old
typeof loc === 'undefined' && window.onerror(...); //new
window.location.href = loc; //new
}
or the slightly cleaner:
window.setLocation = function(url) {
/undefined/.test(url) ?
window.onerror(...) : window.location.href = url;
}
function myCode(loc) {
//window.location.href = loc; //old
window.setLocation(loc); //new
}
If you are interested in getting stacktraces at this stage take a look at: https://github.com/eriwen/javascript-stacktrace
Grab all unhandled undefined links: Besides window.location The only thing left are the DOM links themselves. The third step is to check all unhandeled DOM links for your invalid URL pattern (you can attach this right after jQuery finishes loading, earlier better):
$("body").on("click", "a[href$='undefined']", function() {
window.onerror('Bad link: ' + $(this).html()); //alert home base
});
Hope this is helpful. Happy debugging.

I'm wondering if this might be an adblocker issue. When I search through the logs by IP address it appears that every request by a particular user to /folder/page.html is followed by a request to /folder/undefined

I don't know if this helps, but my website is replacing one particular *.webp image file with undefined after it's loaded in multiple browsers. Is your site hosting webp images?

I had a similar problem (but with /null 404 errors in the console) that #andrew-martinez's answer helped me to resolve.
Turns out that I was using img tags with an empty src field:
<img src="" alt="My image" data-src="/images/my-image.jpg">
My idea was to prevent browser from loading the image at page load to manually load later by setting the src attribute from the data-src attribute with javascript (lazy loading). But when combined with iDangerous Swiper, that method caused the error.

Full Ajax site, redirecting from "normal" URL to Ajax (fragment) URL automatically?

Ok, so all the rage these days is having a site like this:
mysite.com/
mysite.com/about
mysite.com/contact
But then if the user has Javascript enabled, then to have them browse those pages with Ajax:
mysite.com/#/
mysite.com/#/about
mysite.com/#/contact
That's all well and good. I have that all working perfectly well.
My question is, if the user arrives at "mysite.com/about", I want to automatically redirect them to "mysite.com/#/about" immediately if they have Javascript.
I have it working so if they arrive at "mysite.com/about", that page will load fine on its own (no redirects) and then all clicks after that load via ajax, but the pre-fragment URL doens't change. e.g. if they arrive on "mysite.com/about" and then click "contact", the new URL will be "mysite.com/about#/contact". I really don't like that though, it's very ugly.
The only way I can think of to automatically redirect a user arriving at "mysite.com/about" to "mysite.com/#/about" is to have some javascript in the header that is ONLY run if the page is NOT being loaded via ajax. That code looks like this ($ = jQuery):
$(function(){
if( !location.hash || location.hash.substr(1,1) != '/' ) {
location.replace( location.protocol+'//'+location.hostname+'/#'+location.pathname+location.search );
}
});
That technically works, but it causes some very strange behavior. For example, normally when you "view source" for a page that has some ajax content, that ajax content will not be in the source because you're viewing the original page's source. Well, when I view source after redirecting like this, then the code I see is ONLY the code that was loaded via Ajax - I've never seen anything like that before. This happens in both Firefox 3.6 and Chrome 6.0. I haven't verified it with other browsers yet but the fact that two browsers using completely different engines exhibit the same behavior indicates I am doing something bad (e.g. not just a bug with FF or Chrome).
So somehow the browser thinks the page I'm on "is" the Ajax page. I can continue to browse around and it works fine, but if I e.g. close Firefox and re-open it (and it re-opens the pages I was on), it only reloads the Ajax fragment of the page, and not the whole wrapper, until I do a manual refresh. (Chrome doesn't do this though, only Firefox). I've never seen anything like that.
I've tried using setTimeout so it does not do the redirect until after the page has fully loaded, but the same thing happens. Basically, as far as I can tell, this only works if the fragment is put there as the result of a user action (click), and not automatically.
So my question is - what's the best way to automatically redirect a Javascript capable browser from a "normal" URL to an Ajax URL? Anyone have experience doing this? I know there are sites that do this - e.g., http://rdio.com (a music site). No weirdness happens there, but I can't figure out how they're doing it.
Thanks for any advice.

This behavior is like the new twitter. If you type the URL:
http://twitter.com/dinizz
You will be redirected to:
http://twitter.com/#!/dinizz
I realize that this is done, not with javascript but in the server side. I am looking for a solution to implements this using ruby on rails.
Although I suggest you to take a look on this article: Making AJAX Applications Crawlable

Is there a way to mitigate downloading of resources (images/css and js files) with Javascript?

I have a html page on my localhost - get_description.html.
The snippet below is part of the code:
<input type="text" id="url"/>
<button id="get_description_button">Get description</button>
<iframe id="description_container" src="#"/>
When the button is clicked the src of the iframe is set to the url entered in the textbox. The pages fetched this way are very big with lots of linked files. What I am interested in the page is a block of text contained in a <div id="description"> element.
Is there a way to mitigate downloading of resources linked in the page that loads into the iframe?
I don't want to use curl because the data is only available to logged in users and the steps to take with curl to get the content is too complicated. The iframe is simple as I use this on a box which sends the right cookies to identify the request as coming from a logged in user, but the problem is that it is very wasteful to get nearly 1 MB of data to keep 1 KB of it and throw out the rest.
Edit
If the proposed method just works in Firefox it is fine, so I added Firefox tag. Also, it is possible that the answer actually is from the realm of Firefox add-on techniques, so I added that tag as well.
The problem is not that I cannot get at what I'm looking for, rather, the problem is the easy iframe method is wasteful.
I know that Firefox does allow loading only the text of a page. If you open a page and press Ctrl+U you are taken to 'view page source' window, There links behave as normal and are clickable, if you click on a link in source view, the source of the new page is loaded into the view source window, without the linked resources being downloaded, exactly what I'm trying to get. But I don't know how to access this behaviour.
Another example is the Adblock add-on. It somehow kills elements before they get loaded. With plain Javascript this is not possible. Because it only is triggered too late to intervene in good time.

The Same Origin Policy forbids any web page to access contents of any other web page in a different domain so basically you cannot do that.
However it seems that with some browsers it is allowed to access web pages content if you are trying to access it from a local web page which seems to be your case.
Safari, IE 6/7/8 are browser that allow a local web page to do so via XMLHttpRequest (source: Google Browser Security Handbook) so you may want to choose to use one of those browsers to do what you need (note that future versions of those browsers may not allow to do so anymore).
A part from this solution I only see two possibities:
If the web pages you need to fetch content from are somehow controlled by you, you can create a simpler interface to let other web pages to get the content you need (for example allowing JSONP requests).
If the web pages you need to fetch content from are not controlled by you the only solution I see is to fetch content server side logging in from the server directly (I know that you don't want to do so, but I don't see any other possibility if the previous I mentioned are not practicable)
Hope it helps.

Actually I've seen Cross Domain jQuery .load request before, here: http://james.padolsey.com/javascript/cross-domain-requests-with-jquery/
The author claims that codes like these found on that page
$('#container').load('http://google.com'); // SERIOUSLY!
$.ajax({
url: 'http://news.bbc.co.uk',
type: 'GET',
success: function(res) {
var headline = $(res.responseText).find('a.tsh').text();
alert(headline);
}
});
// Works with $.get too!
would work. (The BBC code might not work because of the recent redesign, but you get the idea)
Apparently it is using YQL wrapped into a jQuery plugin to do the trick. Now I cannot say I fully understand what he is doing there but it appears to work, and fits the bill. Once you load the data I suppose it is a simple matter of filtering out the data that you need.
If you prefer something that works at the browser level, may I suggest Mozilla's Jetpack framework for lightweight extensions. I've not yet read the documentations in its entirety but it should contain the APIs needed for this to work.

There are various ways to go about this in AJAX, I'm going to show the jQuery way for brevity as one option, though you could do this in vanilla JavaScript as well.
Instead of an <iframe> you can just use a container, let's say a <div> like this:
<div id="description_container"></div>
Then to load it:
$(function() {
$("#get_description_button").click(function() {
$("#description_container").load($("input").val() + " #description");
});
});
This uses the .load() method which takes a string in this format: .load("url selector"), then takes that element in the page and places it's content inside the container you're loading, in this case #description_container.
This is just the jQuery route, mainly to illustrate that yes, you can do what you want, but you don't have to do it exactly like this, just showing the concept is getting what you want from an AJAX request, rather than in an <iframe>.

Your description sounds like you are fetching pages from the same domain (you said that you need to be logged in and have session credentials) so have you tried to use async request via XMLHttpRequest? It might complain if the html on a page is particularly messed up but you chould still be able to get raw text via .responseText and extract what you need with a regex.

We Keep Coding

JavaScript is the programming language of the Web.