How to dynamically change the IP address on a NodeJS web scraper

How to dynamically change the IP address on a NodeJS web scraper - javascript

I'm working on a project for scraping some data (odds) from different sites (bookmakers).
Due to their tracking sistem, sometime i need to change the ip (to be able again to scrape their site).
I know there are services that can help me to solve this problem, like some APIs with the capability of handling the requests for me (hiding me from the sites).
But the problem of those services is that the majority do not have italians IPs, and due to the italian regulation on online gambling, we can bet only on italian bookmakers, and we can access to them only with italian IPs).
But i know that there are services able to provide me tons of italian IPs, and my idea was to solve the problem by:
Creating my proxy server to call for handling my request (a server that change IPs when i need).
Implement a function able to change the ip of my request made by my scraping tool (the server with the scraping functions).
Honestly i don't know if those ideas could be a valid solution to solve my problem, soo my question is:
How can i implement those features to my code?
And if those options are not valid, is there a solution or a service that i can use? (even if is required a payment or a subscription).
Thanks in advance to all those who will be able to give me a hand!
P.S: I'm still improving my english. I hope I was as correct as possible :)
P.S.2: I'm using NodeJS with the Selenium JavaScript library.

Related

Browser-based client-side scraping

I wonder if its possible to scrape an external (cross-domain) page through the user's IP?
For a shopping comparison site, I need to scrape pages of an e-com site but several requests from the server would get me banned, so I'm looking for ways to do client-side scraping — that is, request pages from the user's IP and send to server for processing.

No, you won't be able to use the browser of your clients to scrape content from other websites using JavaScript because of a security measure called Same-origin policy.
There should be no way to circumvent this policy and that's for a good reason. Imagine you could instruct the browser of your visitors to do anything on any website. That's not something you want to happen automatically.
However, you could create a browser extension to do that. JavaScript browser extensions can be equipped with more privileges than regular JavaScript.
Adobe Flash has similar security features but I guess you could use Java (not JavaScript) to create a web-scraper that uses your user's IP address. Then again, you probably don't want to do that as Java plugins are considered insecure (and slow to load!) and not all users will even have it installed.
So now back to your problem:
I need to scrape pages of an e-com site but several requests from the server would get me banned.
If the owner of that website doesn't want you to use his service in that way, you probably shouldn't do it. Otherwise you would risk legal implications (look here for details).
If you are on the "dark side of the law" and don't care if that's illegal or not, you could use something like http://luminati.io/ to use IP adresses of real people.

Basically browsers are made to avoid doing this…
The solution everyone thinks about first:
jQuery/JavaScript: accessing contents of an iframe
But it will not work in most cases with "recent" browsers (<10 years old)
Alternatives are:
Using the official apis of the server (if any)
Try finding if the server is providing a JSONP service (good luck)
Being on the same domain, try a cross site scripting (if possible, not very ethical)
Using a trusted relay or proxy (but this will still use your own ip)
Pretends you are a google web crawler (why not, but not very reliable and no warranties about it)
Use a hack to setup the relay / proxy on the client itself I can think about java or possibly flash. (will not work on most mobile devices, slow, and flash does have its own cross site limitations too)
Ask google or another search engine for getting the content (you might have then a problem with the search engine if you abuse of it…)
Just do this job by yourself and cache the answer, this in order to unload their server and decrease the risk of being banned.
Index the site by yourself (your own web crawler), then use your own indexed website. (depends on the source changes frequency)
http://www.quora.com/How-can-I-build-a-web-crawler-from-scratch
[EDIT]
One more solution I can think about is using going through a YQL service, in this manner it is a bit like using a search engine / a public proxy as a bridge to retrieve the informations for you.
Here is a simple example to do so, In short, you get cross domain GET requests

Have a look at http://import.io, they provide a couple of crawlers, connectors and extractors. I'm not pretty sure how they get around bans but they do somehow (we are using their system over a year now with no problems).

You could build an browser extension with artoo.
http://medialab.github.io/artoo/chrome/
That would allow you to get around the same orgin policy restrictions. It is all javascript and on the client side.

Requests to non-existing pages that all include "undefined”

Some odd requests appear on our logs since ~October 20, 2014. They've increased to about a few dozens a day so while not a big problem, it's still interesting to find out the reason.
Earlier ones:
REQUEST[/en/undefinedsf_main.jsp?clientVersion=null&dlsource=null&CTID=null&userId=userIdFail&statsReporter=false] REFERER[http://colnect.com/en/coins]
REQUEST[/fr/undefined/GoogleExtension/deals.html?url=http://colnect.com&subid=STERKLY&appName=HypeNet&pos=2&frameId=buaovbluurbavptkwyaybzjrqweypsbavwrviv] REFERER[http://colnect.com/fr]
REQUEST[/br/stamps/undefined49507173c45043eba6dfb9da540e52de&chnl=slmbBRex&evt=DailyPing&prd=vbates&seg=1&ext=1&rnd=65983fb77b62e25cc2a8ef15af18273d] REFERER[http://colnect.com/br/stamps/countries]
Some current ones:
REQ[/ru/collectors/collector/undefined] REF[http://colnect.com/ru/collectors/collector/jokitsos]
REQ[/th/collectors/collector/undefined] REF[http://colnect.com/th/collectors/collector/VRABEC]
REQUEST[/en/account/undefined] REFERER[http://colnect.com/en/account/request_password]
REQUEST[/pt/stamps/undefined] REFERER[http://colnect.com/pt/stamps/years]
Some requests are by logged in members and some not.
I'd guess some Javascript on their browser is trying to call a url by some uninitialized variable thus the "undefined".
Reasons may be similar to Odd requests to non-existing pages that all include "6_S3_" (perhaps malware) but I'm wondering if this might be a different reason.
I do doubt it's a bug on our client side Javascript as this would generate much more than a few dozens of such requests a day from about a million daily page views.
Any ideas? Is it worth pursuing?

This is a big concern, but it's not coming from you.
These are Javascript Injection attacks (client machine malware) using self-signed root certificate.
Specifically, sf_main.html, and deals.html have been linked to Superfish, which has been shipping with Lenovo's recently. As Lenovo has been pushing its new lines of PC's, reports of the attacks have blown up recently.
These Man-in-the-middle attacks start by hijacking the client's requests and and then injecting HTML and Javascript.
The reason there are so many undefined symbols is because Superfish, true to it's name, is fishing for plugins, extensions, and libraries it can take advantage of using their expected names, tokens, and paths. This is brute force XSS.
Oh, no, what can I do??
Little. Not much.
As the requests are being hijacked on the client machine and by http request hijacking, you won't know the difference. You could try to "fish" for certain kind of hostile "indicators" but now you are doing the work of Anti-Malware.
Lenovo claims that
SuperFish has completely disabled server side interactions (since
January) on all Lenovo products so that the software product is no
longer active, effectively disabling SuperFish for all products in the
market
While I trust the sincerity of China-based Lenovo, which has serious market interests in the Western world, I wouldn't trust the word of China based malware company Superfish.
These attacks are less a problem for you than your customers
Unless you work for a big bank or popular Social Networking site, it's highly unlikely that Malware like Superfish, has targeted you specifically. Your customer's bank and social network accounts are at risk, but not because of anything you did or can do to stop it.
As always, the cure for client side fishing attacks is good client side protection.

Any ideas?
There seem to be two different options here:
There are mistakes in your code causing incorrect URLs to get generated
There are (search) bots trying to parse your Javascript and failing to do it correctly.
(A client side extension is stirring up trouble)
To differentiate between the two you would need to set up more specific logging. For example adding the user agents to any log lines containing the string undefined will answer this question. If it's your code causing the problem you would also wish to log the referer header, as it will expose on which page the faulty URL's are being generated.
Another way to identify the issue is if you have an analytics solution running on your site such as Google Analytics you can quite easily limit your report to only url's containing undefined. If there are no such request you can conclude that it has to be a bot (as it wouldn't cause the client side analytics code to run), otherwise it provides all the information to identify where this problem is caused.
Lastly it might be a good idea to include a javascript error logging solution (in its simplest form a window.onerror handler with an ajax request to \log.something. If your code is generating undefined's it's quite likely that some errors are being triggered as well.
Is it worth pursuing?
If users are actually being served invalid pages, then yes, this is definitely something to be investigated.

GWT: Login to a password protected third party website

After searching around a lot, I still can't find the answer to this question: Is it possible to login to a 3rd party website using GWT when the website is password protected?
I'm asking this because I would like to write a Google application that combines information from different websites (My news account, my forums accounts, etc.) ... like a kind of dashboard.
I have no problem doing it for non-password protected websites.
But for websites where you have to login and handle cookies, I'm just lost.
I found this very interesting tutorial that explains how to do it for Java: http://www.mkyong.com/java/how-to-automate-login-a-website-java-example/
But I can't figure out how to do it with GWT.
Any help will be greatly apreciated.

I thought about using RequestBuilder to send the authentication requests, and follow the tutorial you pointed to.
But after more consideration I guess the Same-Origin-Policy is going to prevent you to do ajax calls to another website from the client side in some browser. You also have to make sure that your domain is in HTTPS before calling remote HTTPS url. And you might run into other problems if you follow that path...
Conclusion: I suggest making those calls from the server using plain old java code.

Detect if user is accessing my network (WIFI) using NodeJS

My question is a NodeJS newbie question. I would like to somehow detect if a user is accessing my network (Wifi) and I would like to indicate the users presence in my network.
How would I go about identifying the user presence. I'm completely clueless as I was unable to find any library that could help me detect this in Node. Is this even possible? If so could you point me to the appropriate libraries and the approach that I need to use.

If you run a router like a Linksys, etc. that can support DD-WRT, you can flash that firmware and then use either SNMP or web page scraping to get connected clients. There are node.js libraries for SNMP and there are a ton of examples of web page scraping with Node.JS.
Alternatively, you could get really complicated and add in RADIUS authentication to your DD-WRT installation and watch for an authentication event.
Another option, you could send ICMP (Ping) packets to every potential address that your router would acknowledge. e.g. if it's 192.168.1.1 on a 255.255.255.0 subnet, you would have 253 addresses to ping. Of course, the connecting user's machine would have to have ICMP responses turned on (which they typically do by default).
I hope one of these suggestions helps.

Can I duplicate server-side functionality without being able to use server-side tech?

I have recently taken a position at a large corporation as a Web Developer for one of the company's divisions. For my first task I have been asked to create a web form that submits data to a database and then outputs the id# of that data to the user for reference later. Easy, right? Unfortunately not. Because this is a large company that has been around for a long time their systems are relatively antiquated and none of their servers support server-side technologies (PHP, ASP etc...) and since they are such a large company Corporate IT is pretty much a black hole and there is not any hope of actually getting such tech implemented.
SO! To my question... is there ANY way to do this without server-side? To me the answer is 'no' and I have spent the last week researching on sites like this and others without finding any miraculous work arounds. Really all I have at my disposal are things I can implement without involving IT, so things I can just upload to a web-server.
Also as a note: The web server it is on is supposedly an IBM Web Server (IHS) and the database I am supposed to be connecting to is a MS Access database and the company restricts us to using IE for any web access. As this form is on an internal company INTRAnet site IE is the only browser it will be accessed from.
I know this is a ridiculous situation but unfortunately that is what I am stuck with. Any ideas???

You must have something that takes form data and transforms it for insertion to the database.
There are no javascript libraries that will do this from the browser directly to database (security issues in traversing the network, cross domain issues etc...).
Something will be serving up the web pages - surely this can be the basis of the server side coding you need.
Seeing as you are using IBM HTTP Server (gleaned from comments on your question), there are server side scripting technologies available to you.

Maybe you could create a Web Database with Access Services?

Also as a note: The database I am supposed to be connecting to is a MS Access database and the company restricts us to using IE for any web access. As this form is on an internal company INTRAnet site IE is the only browser it will be accessed from.
That's easy. Use a dirty ActiveX hack to talk toe MS Access directly from the browser.
That's going to be a nightmare to code, but it'll work.

You didn't say which version of Access you're using; this page has information on how to set this up for Access 2003, click on "data access pages".
It's probably better in the long run if you don't solve this problem. Management frustration with IT may help you effect change, or at least get you permission to set up a local web server so you can demonstrate what's possible with the right support.

We Keep Coding

JavaScript is the programming language of the Web.