Browser-based client-side scraping

Browser-based client-side scraping - javascript

I wonder if its possible to scrape an external (cross-domain) page through the user's IP?
For a shopping comparison site, I need to scrape pages of an e-com site but several requests from the server would get me banned, so I'm looking for ways to do client-side scraping — that is, request pages from the user's IP and send to server for processing.

No, you won't be able to use the browser of your clients to scrape content from other websites using JavaScript because of a security measure called Same-origin policy.
There should be no way to circumvent this policy and that's for a good reason. Imagine you could instruct the browser of your visitors to do anything on any website. That's not something you want to happen automatically.
However, you could create a browser extension to do that. JavaScript browser extensions can be equipped with more privileges than regular JavaScript.
Adobe Flash has similar security features but I guess you could use Java (not JavaScript) to create a web-scraper that uses your user's IP address. Then again, you probably don't want to do that as Java plugins are considered insecure (and slow to load!) and not all users will even have it installed.
So now back to your problem:
I need to scrape pages of an e-com site but several requests from the server would get me banned.
If the owner of that website doesn't want you to use his service in that way, you probably shouldn't do it. Otherwise you would risk legal implications (look here for details).
If you are on the "dark side of the law" and don't care if that's illegal or not, you could use something like http://luminati.io/ to use IP adresses of real people.

Basically browsers are made to avoid doing this…
The solution everyone thinks about first:
jQuery/JavaScript: accessing contents of an iframe
But it will not work in most cases with "recent" browsers (<10 years old)
Alternatives are:
Using the official apis of the server (if any)
Try finding if the server is providing a JSONP service (good luck)
Being on the same domain, try a cross site scripting (if possible, not very ethical)
Using a trusted relay or proxy (but this will still use your own ip)
Pretends you are a google web crawler (why not, but not very reliable and no warranties about it)
Use a hack to setup the relay / proxy on the client itself I can think about java or possibly flash. (will not work on most mobile devices, slow, and flash does have its own cross site limitations too)
Ask google or another search engine for getting the content (you might have then a problem with the search engine if you abuse of it…)
Just do this job by yourself and cache the answer, this in order to unload their server and decrease the risk of being banned.
Index the site by yourself (your own web crawler), then use your own indexed website. (depends on the source changes frequency)
http://www.quora.com/How-can-I-build-a-web-crawler-from-scratch
[EDIT]
One more solution I can think about is using going through a YQL service, in this manner it is a bit like using a search engine / a public proxy as a bridge to retrieve the informations for you.
Here is a simple example to do so, In short, you get cross domain GET requests

Have a look at http://import.io, they provide a couple of crawlers, connectors and extractors. I'm not pretty sure how they get around bans but they do somehow (we are using their system over a year now with no problems).

You could build an browser extension with artoo.
http://medialab.github.io/artoo/chrome/
That would allow you to get around the same orgin policy restrictions. It is all javascript and on the client side.

Related

Lightest single page scraper

There is some website that publishes some everchanging data. There is no api but I need to access that value programmatically in js. This is a common problem for me.
I am writing a simple js script(HTML/js) hackyAPI, HTML/js or client side is enough for any manipulations I want to do on the data.
But the problem is XMLHttpRequest and cross server permissions. I do not want to introduce server elements just for this. Is there is a workaround considering I just want a typical html respone ?

CORS restrictions are enforced by browsers explicitly to prevent this kind of script (as well as more malicious XSS scripts); whoever is running that site is providing those resources (and paying to serve them up), so unless they're offering a public API it's not really fair to use that data in the way you seem to be trying to do. In this particular case it seems to be directly in conflict with the site's terms of service.

What is the rationale behind AJAX cross-domain security?

Given the simplicity of writing a server side proxy that fetches data across domains, I'm at a loss as to what the initial intention was in preventing client side AJAX from making calls across domains. I'm not asking for speculation, I'm looking for documentation from the language designers (or people close to them) for what they thought they were doing, other than simply creating a mild inconvenience for developers.
TIA

It's to prevent that a browser acts as a reverse proxy. Suppose you are browsing http://www.evil.com from a PC at your office, and suppose that in that office exists an intranet with sensitive information at http://intranet.company.com which is only accessible from the local network.
If the cross domain policy wouldn't exists, www.evil.com could made ajax requests to http://intranet.company.com, using your browser as a reverse proxy, and send that information to www.evil.com with another Ajax request.
This one of the reasons of the restriction I guess.

If you're the author for myblog.com and you make an XHR to facebook.com, should the request send your facebook cookie credentials? No, that would mean that you could request users' private facebook information from your blog.
If you create a proxy service to do it, your proxy can't access the facebook cookies.
You may also be questioning why JSONP is OK. The reason is that you're loading a script you didn't write, so unless facebook's script decides to send you the information from their JS code, you won't have access to it

The most important reason for this limit is a security concern: should JSON request make browser serve and accept cookies or security credentials with request to another domain? It is not a concern with server-side proxy, because it don't have direct access to client environment. There was a proposal for safe sanitized JSON-specific request methods, but it wasn't implemented anywhere yet.

The difference between direct access and a proxy are cookies and other security relevant identification/verification information which are absolutely restricted to one origin.
With those, your browser can access sensitive data. Your proxy won't, as it does not know the user's login data.
Therefore, the proxy is only applicable to public data; as is CORS.

I know you are asking for experts' answers, I'm just a neophyte, and this is my opinion to why the server side proxy is not a proper final solution:
Building a server side proxy is not as easy as not build it at all.
Not always is possible like in a Third Party JS widget. You are not gonna ask all your publisher to declare a DNS register for integrate your widget. And modify the document.domain of his pages with the colateral issues.
As I read in the book Third Party Javascript "it requires loading an intermediary tunnel file before it can make cross-domain requests". At least you put JSONP in the game with more tricky juggling.
Not supported by IE8, also from the above book: "IE8 has a rather odd bug that prevents a top-level domain from communicating with its subdomain even when they both opt into a common domain namespace".
There are several security matters as people have explained in other answers, even more than them, you can check the chapter 4.3.2 Message exchange using subdomain proxies of the above book.
And the most important for me:
It is a hack.. like the JSONP solution, it's time for an standard, reliable, secure, clean and confortable solution.
But, after re-read your question, I think I still didn't answer it, so Why this AJAX security?, again I think, the answer is:
Because you don't want any web page you visit to be able to make calls from your desktop to any computer or server into your office's intranet

Are there any clear limits of JavaScript in relation to manipulating the browser and DOM?

I heard that getting access to the text a Gmail email is very difficult if not impossible (iframes).
Are there certain areas where JavaScript is not capable of doing something?

iframes won't prevent you from accessing content. JavaScript doesn't really have any limits with regards to manipulating the DOM....it can't, however, access stuff on your computer, or be used to upload files and such. It can't read stuff inside flash files either. You don't really have any choices other than JS anyway.. what kind of road blocks are you anticipating?

Since you've chosen to use firefox-addon tag: no, getting access to Gmail text is unproblematic from an add-on. Doing the same from a regular website however isn't possible unless that website is hosted on mail.google.com. Reason is a security mechanism called same-origin policy. Websites are generally limited by the same-origin policy, add-ons are not.

Different browsers have different limitations that they impose on JavaScript as well as different APIs that they provide to JavaScript to grant it access to different forms of data. Until recently, it was not possible for JavaScript to access local files; however, there are now APIs in some browsers to do this.
There is a concept known as the "same origin" policy that is used to ensure that JavaScript running from the context of one domain or protocol cannot access data from another domain or protocol. However, browser add-ons or extensions can often exempt themselves from these restrictions. Also, some browsers provide APIs specifically for communication between different origins; however, these APIs generally require that this is done with the cooperation and permission of both origins.

From extension JS, you can access any part of Gmail. I wrote a browser extension that allowed me to forward a Gmail email to a Facebook contact. It also appeared in Facebook and allowed me to send Facebook message to Gmail contact. It was so that I didn't need to worry about adding contacts from Google to Facebook and vice versa.
That extension was easy. Once you get passed the iframe piece, it is cake. Good luck!

JavaScript Code Signing

How can a user, using one of the major modern browsers, know for sure that he is running my unmodified javascript code even over an untrusted network?
Here is some more info about my situation:
I have a web application that deals with private information.
The login process is an implementation of a password-authenticated key agreement in JavaScript. Basically during login, a shared secret key is established between the client and the server. Once the user logs in all communication with the server is encrypted using the shared key. The system must be safe against ACTIVE man-in-the-middle attacks.
Assuming that my implementation is correct and the user is smart enough not to fall victim to a phishing attack there remains just one large hole in the system: an attacker can tamper with my application as it is being downloaded and inject code that steals the password. Basically the entire system relies on the fact that the user can trust the code running on his machine.
I want something similar to signed applets but I would prefer a pure javascript solution, if possible.

Maybe I am misunderstanding your problem, but my first thought is to use SSL. It is designed to ensure that you're talking to the server you think you are, and that no one has modified the content midstream. You do not even have to trust the network in this case, because of the nature of SSL.
The good thing about this approach is that you can fairly easily drop it into your existing web application. In most cases, you can basically configure your HTTP server to use SSL, and change your http:// requests to https://.

This is an old, open question but the answers seemed to not do this justice.
https:// provides integrity, not true identification nor non-repudiation.
I direct you to http://www.matasano.com/articles/javascript-cryptography/
Don't do crypto in JS, because a malicious injected script can easily grab passwords or alter the library. SJCL is neat, but it offer a blatantly false sense of security (their quote, and quoted by above)
Unfortunately, this is not as great as in desktop applications
because it is not feasible to completely protect against code
injection, malicious servers and side-channel attacks.
The long-term issue is that JavaScript lacks:
Uniformly working const
The ability to make objects deeply const and not reprototypable.
Code-signing
// codesign: cert:(hex fingerprint) signature:(hex MAC)
Certs would be managed similar to CA certs. MAC would be used with appropriate sign/verify constructions.
Crypto, clipboard stuff are reasons to have JavaScript native plugins (signed, of course)
Getting JavaScript engines to all implement a standard is another thing, but it's doable an it's absolutely necessary to end a large swath of malware.

You could have an external Javascript file which takes an MD5 hash of your login JS, and sends an Ajax request to the server to verify that it is correct and up-to-date. Use basic security or encryption practices here - public/private keys or some other method to be sure that the response came from your server.
You can then confidently display to the user that the client-side scripts are verified, and allow the login script to proceed.

Elevating a browsers javascript permission?

I'm working on an internal tool, and I recall there being some way to make your script prompt for elevated permissions, and if accepted, allowing cross site requests etc...As this is an internal tool, this may accomplish something I need.
Does anyone know how to do this?
To elaborate, I'm actually trying to read (in javascript) the contents of 3rd party tracking iframes injected into our page, to give some performance analytic information. These iframes are obviously from a different domain. If I were to proxy them, they would no longer give accurate information, so that option is out.

I don't think you can actually make your script ask for elevated permission, but you can check for it and ask the user to change his browser security level if needed.
Mozilla.org has an interesting article about configurable security policies : http://www.mozilla.org/projects/security/components/ConfigPolicy.html
If you only need to bypass the same origin policy, you could use other tricks, e.g. a server-side proxy.

For Firefox/Mozilla, I think you mean netscape.security.PrivilegeManager.enablePrivilege() like in this other question.
You'll need the UniversalBrowserRead privilege to allow cross-site AJAX. (I've used it to build a set of local files for a simple stock-ticker page that grabs data from Yahoo's Finance service. The browser asks you (the user) the first time you run the browser on that page whether you want to grant the privilege. You don't have to grant it again until you restart the browser and read the same page again.)

We Keep Coding

JavaScript is the programming language of the Web.

Browser-based client-side scraping - javascript

Have a look at http://import.io, they provide a couple of crawlers, connectors and extractors. I'm not pretty sure how they get around bans but they do somehow (we are using their system over a year now with no problems).

You could build an browser extension with artoo. http://medialab.github.io/artoo/chrome/ That would allow you to get around the same orgin policy restrictions. It is all javascript and on the client side.

Related

Lightest single page scraper

What is the rationale behind AJAX cross-domain security?

Are there any clear limits of JavaScript in relation to manipulating the browser and DOM?

JavaScript Code Signing

Elevating a browsers javascript permission?

Categories

Resources