Lightest single page scraper

Lightest single page scraper - javascript

There is some website that publishes some everchanging data. There is no api but I need to access that value programmatically in js. This is a common problem for me.
I am writing a simple js script(HTML/js) hackyAPI, HTML/js or client side is enough for any manipulations I want to do on the data.
But the problem is XMLHttpRequest and cross server permissions. I do not want to introduce server elements just for this. Is there is a workaround considering I just want a typical html respone ?

CORS restrictions are enforced by browsers explicitly to prevent this kind of script (as well as more malicious XSS scripts); whoever is running that site is providing those resources (and paying to serve them up), so unless they're offering a public API it's not really fair to use that data in the way you seem to be trying to do. In this particular case it seems to be directly in conflict with the site's terms of service.

Related

Safety concern about html DOM events

First of all, all of this might be a newbie stupid question.
I am developing a web application with Laravel but ended up using tons and tons of Jquery/javascript. I tried to think of all the possible security risks as I was developing but the more I research this topic, the more I am concerned about usage of Jquery/javascript. It seems that dynamic content loading using Jquery/javascript is overall a very bad idea...But I don't want to rework everything since that would take weeks of extra developing of what is already developed. A quick example
Let's say I have a method attached to my div like so
<div class="img-container" id="{{$file->id}}" onmouseover="showImageButtons({{$file->id}})"></div>
And then a part of Javascript
function showImageButtons(id)
{
console.log(id);
}
When I open this in browser, I am able to alter the value of parameter sent to javascript through the chrome inspector.
from this
to this
And it actually gets executed, I can see "some malicious code" being printed to console.
What if I had an ajax call to server with that parameter? Would it pass?
Is there something I don't understand or is this seriously so easy to manipulate?

There are two basic aspects you need to consider regarding web security -
The connection between the browser and your server should be secure (i.e. https), that way, assuming you configured your server correctly, no one can intercept the client-server communication and you can share data through AJAX.
On the server side, you should treat information coming from the client as hostile and sanitize it; That is since anyone can send you anything through your webpage, even if you do input validation on the client side since the your javascript code is executed by the client and therefore in complete control of the attacker. While implanting "malicious code" in the webpage alone is not an actual attack, if an attacker gets you to store that malicious code in the server and send it to other clients she can run her javascript on your other clients browsers and that is bad (lookup "cross site scripting / XSS").

Safe way to execute JavaScript code - on same server or another server?

I have a website builder which allows users to drag and drop HTML blocks (img, div, etc...) into the page. They can save it. Once they save it, they can view the page.
I also allow custom code like JavaScript. Would it be safe to have their page be displayed on another server on a subdomain (mypage.example.com) but still fetched from the same database as the main server, or does it not matter to put it on the same server as the main server?
As far as I know, they cannot execute any PHP code since I will be using echo to display the page content.
Thanks for help!

That depends on your setup. If you allow them to run custom JavaScript, they can probably steal session tokens from other users, which could be used to steal other accounts. I would recommend reading about XSS (Cross-Site-Scripting).
In short: XSS is the vulnerability to inject code into a site, which will run on other peoples computers.
It wouldn't make sense to give you a strict tutorial on how to do this at this point, because every system is different and needs different configuration to be attack-resistant.
Letting users put code somewhere is always a risk!

there is no need for another server, but you do need another domain to prevent Cross Site Scripting attaks on your main page. and no, a subdomain may not be sufficient, put it on another domain altogether to be on the safe side. (luckily domains can be acquired for free if you're ok with a .tk domain)
Would it be safe to have their page be displayed on another server on a subdomain
even a subdomain could be dangerous, just put it on another domain altogether, and you'll be safe.
or does it not matter to put it on the same server as the main server?
you can have it on the same server. btw, did you know that with shared webhosting services (like GoDaddy, hostgator, etc) there's thousands of websites sharing a single physical server?
also, DO NOT listen to the people saying you need to sanitize or filter the HTML, that is NOT true. there is no need to filter out anything, in my opinion, that is corruption of data. don't do that to your users, there's no need to do it. (at least i can't think of any)
As far as I know, they cannot execute any PHP code since I will be using echo to display the page content.
correct. if you were doing include("file"); or eval($code); then they could execute server-sided code, but as long as you're just doing echo $code;, they won't be able to execute server-side code, that's not a security issue.

Relax Same Origin Rule for Javascript Widget (deprecated google finance backfill data)

Background:
I have a small javascript-powered widget that uses data from another origin. I tested it successfully in my local IDE using external (non-origin) data from:
https://www.google.com/finance/getprices?q=.NSEI&x=NSE&i=600&p=1d&f=d,o,h,l,c,v
When I uploaded this widget to my wordpress site I got the error:
SEC7120: Origin http://www.my-wordpress-site.com not found in
Access-Control-Allow-Origin header.
Which in turn made my calls to the data return something like this:
SCRIPT5007: SCRIPT5007: Unable to get property 'split' of undefined or
null reference
Disclaimer:
I know that google finance has long since been deprecated, but my understanding is that the servers were left running. This means that the data is still there. Google clearly states that using the data for consumption violates the terms of usage for this data. What I am building is not an API, or anything consumption related. It's merely a front-end widget serving strictly as aesthetic ornamentation. I just want it to make the site look cooler by plotting some finance data.
Nonetheless, I'm not sure if google is not allowing me to fetch the data because it doesn't recognize my website. Maybe I have misunderstood the terms of use, but I don't think I have, but who knows. I'm hoping its a small protocol type of fix where I add a few lines and everything is happily ever after. Alternatively, if it is indeed the other way around, I don't have any control over what domains google trusts and I don't imagine contacting google and saying "hey google, don't leave my website out in the cold" would work.
Question: How do I relax the same origin rule? A few similar questions have been asked on stack overflow for c#, but I still wonder how to do this for my wordpress javascript widget. Can javascript alone do it?
I see a lot of examples having:
header("Access-Control-Allow-Origin: http://mozilla.com");
I don't know what language that is. I'm guessing PHP or c#. I don't know anything about those languages. I just want to design my widgets without having to learn a billion things about back-end protocols. I'm prepared to put in some effort though.
Clarification
I have made widgets this way (pointing an iframe to my widget's html located on my cpanel server where wordpress is in wp-content/uploads/2018/03) before and had no trouble at all. This time around I'm getting the origin errors as noted above in my post. The only difference is now I'm using some external data from a google server. Maybe it's google that doesn't recognize my site? The widget worked perfectly offline in my IDE. I'm not sure how to proceed and whether or not word-press has anything to bring to bear for these things to help widget designers like myself out, or if it's another matter altogether.
Error result confirmed in the following browsers:
edge
chrome

The issue is fairly straightforward. Browsers enforce the Same Origin Policy, which prevents sites from accessing data fetched from a foreign domain. CORS is a protocol that servers can use to instruct browsers to allow foreign domains to read their data.
Public APIs that are intended to be used in a browser context will generally use CORS to make this possible. For whatever reason, Google does not seem to be using CORS to allow public access to this API. In particular, their response does not include the appropriate Access-Control-Allow-Origin header (part of the CORS protocol) to allow Javascript from your site to read the data. There is really nothing you can do about that.
The standard workaround is to proxy the data. That is, instead of having your code use the API directly, it would instead hit a URL on your server, which would then fetch the requested data from Google and return it. The server will be able to read the data because the Same Origin Policy (being a browser concern) won't apply. And your Javascript will be able to read the result since the request was made to the same domain.
If there are ways to get around the Same Origin Policy in the browser, I'm not familiar with them.

Browser-based client-side scraping

I wonder if its possible to scrape an external (cross-domain) page through the user's IP?
For a shopping comparison site, I need to scrape pages of an e-com site but several requests from the server would get me banned, so I'm looking for ways to do client-side scraping — that is, request pages from the user's IP and send to server for processing.

No, you won't be able to use the browser of your clients to scrape content from other websites using JavaScript because of a security measure called Same-origin policy.
There should be no way to circumvent this policy and that's for a good reason. Imagine you could instruct the browser of your visitors to do anything on any website. That's not something you want to happen automatically.
However, you could create a browser extension to do that. JavaScript browser extensions can be equipped with more privileges than regular JavaScript.
Adobe Flash has similar security features but I guess you could use Java (not JavaScript) to create a web-scraper that uses your user's IP address. Then again, you probably don't want to do that as Java plugins are considered insecure (and slow to load!) and not all users will even have it installed.
So now back to your problem:
I need to scrape pages of an e-com site but several requests from the server would get me banned.
If the owner of that website doesn't want you to use his service in that way, you probably shouldn't do it. Otherwise you would risk legal implications (look here for details).
If you are on the "dark side of the law" and don't care if that's illegal or not, you could use something like http://luminati.io/ to use IP adresses of real people.

Basically browsers are made to avoid doing this…
The solution everyone thinks about first:
jQuery/JavaScript: accessing contents of an iframe
But it will not work in most cases with "recent" browsers (<10 years old)
Alternatives are:
Using the official apis of the server (if any)
Try finding if the server is providing a JSONP service (good luck)
Being on the same domain, try a cross site scripting (if possible, not very ethical)
Using a trusted relay or proxy (but this will still use your own ip)
Pretends you are a google web crawler (why not, but not very reliable and no warranties about it)
Use a hack to setup the relay / proxy on the client itself I can think about java or possibly flash. (will not work on most mobile devices, slow, and flash does have its own cross site limitations too)
Ask google or another search engine for getting the content (you might have then a problem with the search engine if you abuse of it…)
Just do this job by yourself and cache the answer, this in order to unload their server and decrease the risk of being banned.
Index the site by yourself (your own web crawler), then use your own indexed website. (depends on the source changes frequency)
http://www.quora.com/How-can-I-build-a-web-crawler-from-scratch
[EDIT]
One more solution I can think about is using going through a YQL service, in this manner it is a bit like using a search engine / a public proxy as a bridge to retrieve the informations for you.
Here is a simple example to do so, In short, you get cross domain GET requests

Have a look at http://import.io, they provide a couple of crawlers, connectors and extractors. I'm not pretty sure how they get around bans but they do somehow (we are using their system over a year now with no problems).

You could build an browser extension with artoo.
http://medialab.github.io/artoo/chrome/
That would allow you to get around the same orgin policy restrictions. It is all javascript and on the client side.

Using a client's IP to get content

I'm a bit embarrassed here because I am trying to get content remotely, by using the client's browser and not the server. But I have specifications which make it look impossible to me, I literally spent all day on it with no success.
The data I need to fetch is on a distant server.
I don't own this server (I can't do any modification to it).
It's a string, and I need to get it and pass it to PHP.
It must be the client's (user browsing the website) browser that actually gets the data (it needs to be it's IP, and not the servers).
And, with the cross-domain policy I don't seem to be able to get around it. I already knew about it, still tried a simple Ajax query, which failed. Then I though 'why not use iFrames', but the same limitation seems to apply to them too. I then read about using YQL (http://developer.yahoo.com/yql/) but I noticed the server I was trying to reach blocked YQL's user-agent making it impossible to use this technique.
So, that's all I could think about or find. But I can't believe it's not possible to achieve such a thing, that doesn't even look hard...
Oh, and my Javascript knowledge is very basic, this mustn't help either.

This is one reason that the same-origin policy exists. You're trying to have your webpage access data on a different server, without the user knowing, and without having "permission" from the other server to do so.
Without establishing a two-way trust system (ie modifying the 'other' server), I believe this is not possible.
Even with new xhr and crossdomain support, two-way trust is still required for the communication to work.
You could consider a fat-client approach, or try #selbie suggestion and require manual user interaction.

The same origin policy prevents document or script loaded from one
origin from getting or setting properties of a document from a different
origin.
-- From http://www.mozilla.org/projects/security/components/same-origin.html
Now if you wish to do some hackery to get it... visit this site
Note: I have never tried any of the methods on the aforementioned site and cannot guarantee their success

I can only see a really ugly solution: iFrames. This article contains some good informations to start with.

You could do it with flash application:
flash with a crossdomain.xml file (won't help though since you don't control the other server)
On new browsers there is CORS - requires Access-Control-Allow-Origin header set on the server side.
You can also try to use JSONP (but I think that won't work since you don't own the other server).
I think you need to bite the bullet and find some other way to get the content (on the server side for example).

We Keep Coding

JavaScript is the programming language of the Web.

Lightest single page scraper - javascript

Related

Safety concern about html DOM events

Safe way to execute JavaScript code - on same server or another server?

Relax Same Origin Rule for Javascript Widget (deprecated google finance backfill data)

Browser-based client-side scraping

Using a client's IP to get content

Categories

Resources