How to let crawlers have to run javascript in pages?

How to let crawlers have to run javascript in pages? - javascript

I want to implement some anti-crawler mechanism to protect data in my site. After reading many related topics in SO, I am going to focus on "enforce running javascript".
My plan is:
Implement a special function F (eg. MD5SUM) in javascript file C
Input: cookie string of current user (the cookie changes in each response)
Output: a verification string V
Send V along with other parameters to sensitive backend interface to request valuable data
Backend server has validation function T to check whether V is correct
The difficult part is how to obfuscate F. If crawlers can easily understand F, they will get V without C and bypass javascript.
Indeed, there are many js obfuscators, but I am going achieve the goal by implement a generator function G which is not appear in C.
G(K) generates F, where K is a large integer. F should be complicate enough, so that crawler writers have to take many hours to understand F. Given another K',
G(K') = F', F' should look like a new function in some extent, and again, crawler writers have to take hours to crack.
A possible implementation of G might be a mapping from integer to a digital circuit of many connected logic gates (like a maze). Using javascript grammar to represent it as F. Since F must be run in javascript, crawlers have to run PhantomJS. Furthermore, I can insert sleeps in F to slow down crawlers while normal users hardly aware 50-100ms delay.
I know there is a group of methods to detect crawlers. They will be applied. Let's only discuss "enforce running javascript" topic.
Could you give me some advice? Is there any better solution?

Using login to prevent the whole world to see the data is one option.
If you do not want logged in users to fetch all the data you make available to them, you could then limit the number of requests per minute for the user, adding a delay to your page load if it has been reached. Since the user is logged, you could easily track the requests server-side even if they manage to change cookies/localStorage/IP/Browser and whatnot.
You can use images for some texts, that will force them to use some resource-heavy mechanics to translate to usable information.
You could add hidden texts, this would even prevent users' copy/paste (you use spans filled with 3-4 random letters on every 3-4 real letter and make them font-size 0). That way they aren't seen, but still copied, and most likely will be taken from crawler.
Refuse connection from known crawler HTTP header signatures, although any crawler could mock those. And greasemonkey or some scripting extension could even turn a regular browser into a crawler so this has very little incidence.
Now, to force using javascript
The problem is that you cannot really force any javascript execution. What the javascript does is seen by everyone who has access to the page, so if it's some kind of MD5 hash you'd accomplish, this can be implemented in any language.
That's mainly unfeasible because the crawler has access to exactly everything the client's javascript has access to.
Forcing to use a javascript enabled crawler can be circumvented, and even if not, with the computing power available to anyone nowaday, it is very easy to launch a phantomJS instance... And as I said above, anyone with slight javascript knownledge can simply automate clicks on your website using their browser, which will make everything undetectable.
What should be done
The only bulletproof way to prevent crawlers to leech your data, and to prevent any automation is to ask something that only a human could do. Captcha comes to mind.
Think about your real users
First thing you should keep in mind is that is your website starts to get annoying to use for normal users, they will not come back. Having to type a 8 character captcha on each page request just because there MIGHT be someone who wants to pump the data will become too tedious for anyone. Also, blocking unknown browser agents might prevent legit users from accessing your website because of X or Y reason they are using a weird browser.
The impact on your legit users, and the time you'd take working hard on fighting crawlers might be too high to just accept that some crawling will happen. So your best bet is to rewrite your TOS to explicitly forbid crawling of any sort, log every http access of every user, and take action when needed.
Disclaimer:
I'm scrapping over a hundred websites monthly, following external
links to totalise about 3000 domains. At the time of posting, none of
them are resisting, while they employ one or more techniques of the
above. When a scrapping error is detected, it does not take long to
fix it...
The only thing is to crawl respectfully, not over crawl or make too
many requests in a small time frame. Just doing that will circumvent
most popular anti crawlers.

Related

Computationally difficult [JavaScript] problem?

I need a problem that is computationally difficult (in any language), that I can easily implement in JavaScript. I'm trying to do a CAPTCHA-like test to make it unlikely that hacker is accessing my page mechanically.
Yes, I know that he could use Rhino or some other JS engine and do it -- that's why I want it to be computationally expensive, so it takes him a few hours to set up and his machine a few seconds to fake each access.
I'm think getting a bunch of large primes on the back end and sending over the product of two of them and demand that web-page factor it, but if anybody has a better idea, I'm all ears. Also, does anybody have a good library for doing that factoring thing?

You can use the same method as bitcoin, ie. reversing a secure hash.
Explained here:
http://www.tomshardware.com/reviews/bitcoin-mining-make-money,3514-3.html
Bitcoin source
https://github.com/bitcoin/bitcoin

you can implement a standard captcha and make some more checking on the client side. for exaample, add a event listener on the captcha input text to listen for key down/key up events and xor the keycodes and send them along with the captcha. add a hidden input text in the form named email or something you find on every form. robots fill those up automatically. and if you get a value for post['email'] then it's a robot because the user won't see that. also you can have a piece of code in a totally unrelated javascript that automatically adds a field in the form that is required to validate. so...captcha no captcha, you can still enhance the robot protection client side without computation difficult processes.

The problem with this is that if it is known to be NP-Hard, it's going to be a pain in the rear for human beings to solve, as well, on non-trivial instances. Visual/auditory captchas are kind of cool in that they give people a leg up... we have very sophisticated sensory organs for processing these kinds of things, and computers are not too good at it (though they are getting better all the time!).
As such, you're probably better off coming up with a unique thing that people can do very easily, but that machines are not too good at. For instance, give some simple black and white pictures and ask the user which one doesn't belong, or show some pictures of foods and ask what kind of recipe you could make with them.

Clever approach. Whenever one-way complexity is needed it makes me think of a hash. Simply hash some aspect of their user account (not anything sensitive) and send the hash to the client. You would want to truncate/pad the string to get your desired complexity level. This isn't to secure an account so md5 or any other hashing algorithm would be fine.
Here is some sample code that you might be able to leverage for the client side.

javascript find replace

I have a question about optimization, but more on the browser/client side.
I am catering to a few societies that need about 3 different languages. So I'm just putting my user's language type in the php session, and swapping out the text for selected areas on each page they navigate to. So, really nothing complicated.
However, I'm toying with the idea of letting javascript do the find/replace of the selected texts on each page.
There are a few ways to skin this cat, and I've done them all, and they work. However, I do have a few hundred pages, and many words to replace with the correct language text.
If I were to go the Javascript route, does anyone have an opposing view to this? And if so, why? I'm interested in letting the user's browser do the work, rather than my servers constantly finding and replacing, or creating new CONSTANTS for each language specific situation.
I'm worried about their browsers getting slower. But that could be a very small problem.
For those individuals who love to get specifics, here's what I would do with javascript.
I would load a languages.js file with all appropriate word translations for any language I implement. Instead of running a huge find/replace on each page load, I'd localize the find/replace to the specific page, or possibly narrow an element to have an attribute that my scripting would load in the the DOM and perform a find/replace on that alone.
I'm open to better ideas.
Also, for those people who find "over-optimization" useless or "over-doing-it", please don't mention anything. This is for fun and not a critical decision item.
thanks guys!

Well, on the pro side, yes, you are offloading some of the work to the client, but I don't think that's going to make any real difference. You're probably talking about a tiny percentage of the overall performance of the site. The only way to know of course is to test it.
On the con side, you'll be increasing the bandwidth it takes to load your site, since the user will need to load the page plus the language file. It will be cached, if you set it up right, so that's probably not a huge concern either.
Another con is that this will make your site depend on javascript. A non-scripting visitor won't get the translation, and that includes search engines. Whether that matters to you depends on the nature of your site, but in general, that's a pretty big negative.
You'd also have to watch out for "flashing" of the non-localized language. It'd look horrible if the page loaded and then a split second later the language changed to something else. If you are doing the swapping from the DOM ready event ($(function() {}) in jquery, for example), it's probably too late. You could do it from a script you put at the bottom of the page, and that'd probably be ok, but even then, it may depend on the browser and the structure of the markup, not to mention the user's bandwidth and whether the server sends the content in chunks.
I think it comes down to what fits your needs best. Sorry that's not much of an answer, but it's an accurate one I think :)

I agree it is important to keep your server from overload. I would solve the problem one of two ways
Use your suggested javascript find and replace, whilst the javascript is working, have a loading.gif spinning round with a message to the effect of 'translating' nearly to explain to users why they must wait. If you are doing a word by word translation, you have to be careful about causing a browser like IE or to moan about having to do work 'The page has become unresponsive'; I would suggest running a setInterval(Translate(), 1) where translate translates a set number of words at a time so the browser doesn't think your script is going in an endless loop.
Provided the same sections are translated for all foreign visitors, you could make a PHP script that makes new, translated pages next to the originals. The translated pages could include the translator.php script to do a quick check to see if the original page has changed to decide whether or not to make a new translated page. This would not mean translating a page every time it needs to be viewed in a foreign language, but only a little check to see if the original had changed - putting less load on your server and none on the client side browser.
Personally I would implement 2 if possible to be more low-power-browser friendly (such as mobile devices) but in practice either would do and it's an interesting problem.

Best practice for "hidden" JavaScript HTTP request?

I'm not exactly sure how to formulate the question, but I think it's more of a suggestions request, instead of a question per se.
We are building an HTML5 service on which users get credited (rewarded, on social gaming lingo) for completing a series of offers. Most of these offers are video ad watching. We already have an implementation of this built on Flash, but for HTML5 I'm encountering a bit more issues on how to make the request calls to validate legit watched video ads. On the Flash interface, we have a series of HTTP requests that the SWF makes, some upon the video playback starts, in the middle and at the end, each one of those requests are related to each other, meaning, the response of one is needed on the next request, etc. Most of the logic to "hide" this "algorithm" is lightly hidden on the SWF binary, and it pretty much serves it purpose.
However, for HTML5 we have to rely on world visible JavaScript and that "hidden" logic is open wide. So, I guess this is a call for suggestions on how these cases are usually handled so that an skilled person could not (so easily) get access to it and exploit the service to get credited programmatically. Obfuscating the JavaScript seems like something that could help but that in no way protects fully.
There's of course some extra security on the backend (like frequency capping, per user capping, etc), but since our capping clears every day, an skilled person could still find a way to get credit for all available offers even without completing them.

It sounds like you want to ensure that your server can distinguish requests that happened as the result of the user interacting with your UI in ways you approve of from requests that did not happen that way.
There are a number of points of attack on such a system.
Inspect the JavaScript to find the event handler and invoke them via Firebug or another tool.
Inspect any keys from your code, and generate the HTTP requests without involving the browser.
Run code in the browser to programmatically generate events.
Use a 3rd-party tool that instruments the browser to generate clicks.
If you've got reasonable solutions to instrumentation attacks (3 and 4), then you can look at Is there any way to hide javascript functions from end user? for ways to get secrets into the client to allow you to sign your requests. Beyond that, obfuscation is the only (and imperfect) way to stop a not-too-determined attacker from any exploitation, and rate-limiting and UI event logging are probably your best bets for stopping determined attackers from benefiting from wide-scale fraud.

You will not be able to prevent a determined attacker (even with SWF, though it's more obfuscated). Your best bet is to make sure that:
Circumventing your measures is expensive in terms of effort, perhaps by using a computationally expensive crypto algorithm so they can't just set up a bunch of scripts to do it.
The payoff is minimal (user-capping is an example of how to reduce payoff; if you're giving out points, it's fine; if you're mailing out twenty dollar bills, you're out of luck)
Cost-benefit.

What good ways are there to prevent cheating in JavaScript multiplayer games?

Imagine a space shooter with a scrolling level. What methods are there for preventing a malicious player from modifying the game to their benefit? Things he could do that are hard to limit server-side is auto-aiming, peeking outside the visible area, speed hacking and other things.
What ways are there of preventing this? Assume that the server is any language and that the clients are connected via WebSocket.
Always assume that the code is 100% hackable. Think of ways to prevent a client completely rewritten (for the purposes of cheating) from cheating. These can be things such as methods for writing a secure game protocol, server-side detection, etc.

The server is king. Clients are hackable.
What you want to do is two things with your websocket.
Send game actions to the server and receive game state from the server.
You render the game state. and you send input to the server.
auto aiming - this one is hard to solve. You have to go for realism. If a user hits 10 headshots in 10ms then you kick him. Write a clever cheat detection algorithm.
peeking outside the visibile area - solved by only sending the visible area to each client
speeding hacking - solved by handling input correctly. You receive an event that user a moved forward and you control how fast he goes.
You can NOT solve these problems by minifying code. Code on the client is ONLY there to handle input and display output. ALL logic has to be done on the server.
You simply need to write server side validation . The only thing is that a game input is significantly harder to validate then form input due to complexity. It's the exact same thing you would do to make forms secure.
You need to be really careful with your "input is valid" detection though. You do not want to kick/ban highly skilled players from your game. It's very hard to hit the balance of too lax on bot detection and too strict on bot detection. The whole realm of bot detection is very hard overall. For example Quake had an auto aim detection that kicked legitedly skilled players back in the day.
As for stopping a bots from connecting to your websocket directly set up a seperate HTTP or HTTPS verification channel on your multiplayer game for added security. Use multiple Http/https/ws channels to validate a client as being "official", acting as some form of handshake. This will make connecting to the ws directly harder.
Example:
Think of a simple multiplayer game. A 2D room based racing game. Upto n users go on a flat 2D platformer map and race to get from A to B.
Let's say for arguments sake that you have a foolsafe system where there's a complex authetication going over a HTTPS channel so that users can not access your websocket channel directly and are forced to go through the browser. You might have a chrome extension that deals with the authentication and you force users to use that. This reduces the problem domain.
Your server is going to send all the visual data that the client needs to render the screen. You can not obscure this data away. No matter what you try a silled hacker can take your code and slow it down in the debugger editing it as he goes along until all he's left with is a primitive wrapper around your websocket. He let's you run the entire authentication but there is nothing you can do to stop him from stripping out any JavaScript you write from stopping him doing that. All you can achieve with that is limit the amount of hackers skilled enough of accessing your websocket.
So the hacker now has your websocket in a chrome sandbox. He sees the input. Of course your race course is dynamically and uniquely generated. If you had a set amount of them then the hacker could pre engineer the optimum race route. The data you send to visualise this map can be rendered faster then human interaction with your game and the optimum moves to win your racing game can be calculated and send to your server.
If you were to try and ban players who reacted too fast to your map data and call them bots then the hacker adjusts this and adds a delay. If you try and ban players who play too perfectly then the hacker adjusts this and plays less then perfect using random numbers. If you place traps in your map that only algorithmic bots fall into then they can be avoided by learning about them, through trial and error or a machine learning algorithm. There is nothing you can do to be absolutely secure.
You have only ONE option to absolutely avoid hackers. That is to build your own browser which cannot be hacked. Build the security mechanisms into the browser. Do not allow users to edit javascript at runtime in realtime.

At the server-side, there are 2 options:
1) Full server-side game
Each client sends their "actions" to the server. The server executes them and sends relevant data back. e.g. a ship wants to move north, the server calculates its new position and sends it back. The server also sends a list of visible ships (solving maphacks), etcetera.
2) Full client-side game
Each client still sends their actions to the server. But to reduce workload on the server, the server doesn't execute the actions but forwards them to all other clients. The clients then resolve all actions simultaneously. As a result, each client should end up with an identical game. Periodically, each client sends their absolute data (ship positions, etc.) to the server and the server checks if all client data is identical. Otherwise, the games are out of sync and someone must be hacking.
Disadvantage of the second method is that some hacks remain undetected: A maphack for example. A cheater could inject code so he sees everything, but still only sends the data he should normally be able to see to the server.
--
At the client-side, there is 1 option:
A javascript component that scans the game code to see if anything has been modified (e.g. code modified to render objects that aren't visible but send different validation data to the server).
Obviously, a hacker could easily disable this component. To fix that, you could force the client to periodically reload the component from the server (The server can check if the script file was requested by the user periodically). This introduces a new problem: the hacker simply periodically requests the component via AJAX but prevents it from running. To avoid that: have the component redownload itself, but a slightly modified version of itself.
For example: have the component be located at yoursite/cheatdetect.js?control=5.
The server will generate a slightly modified cheatdetect.js so that in the next iteration, cheatdetect.js?control=22 (for example) must be downloaded. If the control mechanism is sufficiently complicated, the hacker won't be able to predict which control number to request next, and cheatdetect.js must be executed in order to continue the game.

There's nothing you can really do to prevent anyone from modifying your JS or writing a GreaseMonkey script. However you can make it hard for them by minifying your script as well as making your code as cryptic as possible. Maybe even throwing in some fake methods or variables that do nothing but are used to throw an attacker off. But given enough time, none of these methods are completely foolproof, as once your code goes to the client, it is no longer yours.

The only way I can even think of implementing this is by modifying your Javascript to function as a client and then designing a central server mechanism to validate data sent from that client. This is probably a big change to implement and will most likely make your project more complex. However, as was said earlier, if the application runs entirely on the client, the client can pretty much do whatever they want with your script. The only way to secure it to use a trusted machine to handle validation.

They don't have to touch your client-side code -- they could just sniff and implement your Websocket protocol and write a tiny agent that pretends to be a human player.
Update: The problem has a few parts, and I don't have answers off the top of my head, but the various options could be evaluated with these questions in mind:
How far are you willing to go to prevent cheating? If you only care about casual cheating, how many barriers are enough to discourage the casual cheater? The intermediate Javascript programmer? A serious expert? Weighing this against the benefits of cheating, is there anything of real value at stake, like cash and prizes, or just reputation?
How do you get a high confidence that a human is providing inputs to your game? For example, with a good enough computer vision library I could model your game on a separate machine feed inputs to the computer pretending to be the mouse, but this has a high relative cost (not worth my time).
How can you create a chain of trust in your protocol such that knowledge of (2) can be passed to the server, and that your server is relatively confident your client code is sending the messages?
Sure many of the roadblocks you throw up can be side-stepped, but what is the cost to the player and you? See "Attrition warfare".

Some other methods that can be implemented:
Make the target elements difficult for a script to distinguish from other elements. Avoid divs with predictable class and id names if possible. Inject styling using JavaScript instead of using classes. Think like a hacker and make it hard on yourself.
Use decoys that a script will fire on. For instance, if the threat vector is a screen scraping algorithm using pixel colors, throw some common pixel colors in non-target elements. Hits on these non-targets could seem inconsequential to the cheater, but would be detectable. You don't want the cheater to know why you know.
Limit the minimum time between actions to slightly below the best human levels. The best players will hit that plateau, and it won't matter as much who's cheating, and immediately be able to detect anyone scripting faster than that by side-calling method calls.
Random number generators are typically uniform. Human nature is not. Likely a random number generator will have values within a set limit and even distribution. Natural distribution is a Gaussian curve. If you sampled the distribution and it looks like a square wave in the x and y axis, 100% it's a cheater. This will be fairly difficult for the cheater to detect the threshold for the algorithm because it's a derivative of the random, and not the random distribution itself. You're also using aggregate data and not individual plays to detect it, so reverse engineering the algorithm would be extremely difficult without knowing your detection algorithm.
Utilize entropy whenever possible. Avoid predictable game plays. Imagine a racing game on a set collection of race tracks. Each game play could have slightly differing levels of traction, horsepower, and momentum. The script would have to be extremely good to beat it. In a scrolling game, you can alter factors that are instinctual to humans, but difficult for computers, such as wind force, changes in gravity, etc. It would also make it more fun as a side benefit.
Server generated tokens can be used to validate UI elements were used and not calls to the code itself. Validation can be handled in one call at the end of the game comparing events to hashed codes of UI elements. The token should be a hash with a server private key and some value of the UI element.
Decoy the cheater with data they think you're using to detect cheats. Such as calls to a DetectCheat method with dummy calls to a fake backend. It's the old magician's trick. Wave your hand over here, while you slip a card into the deck with the other hand. Let them waste days on end in a maze that has no exit, with lot's of hair pulling.

I'd use a combination of minification and AJAX. If all of the functions and data aren't loaded into the page, it'd be more difficult to cheat.
On the other hand, modding turned out to be a very profitable tool for companies like Id Software. Perhaps allowing the system to be modded might make the game that much more enjoyable to the community at large.

Obfuscate your client exposed code as much as possible. Additionally, use some magic.

You can edit the javascript on the browser and make it work.
Some people suggest that make a call to check with the server. So after making a call to the server, it will be validated in the server. Once validated, it will come to client side and do actions. But I think even this is not foolproof.
For eg.,. for a Basic login action : in angular while making a call to server, the backend validates username & pwd and if validated, it will come back to the client and let the user login using angular.
When I say login using angular, it is going to store things in cookies, like user objects and other things. But still the user can remove the JS code which is making the call to backend, and return TRUE(wherever needed) and insert user object(dummy) to cookies and other objects(whatever needed) and login. It is a very difficult thing to do, but it is doable. In many scenarios, this is not desirable even if it takes hours to edit/hack the code.
This is possible in single page applications, where JS files dont get reloaded for each page. To mitigate the possibility of getting hacked we can use minified codes. And I guess if actions like this is done in backend(like login in Django) it is much safer.
Please correct me if I am wrong.

Ajax requests/responses: how to make them lightning fast?

I came across a site that does something very similar to Google Suggest. When you type in 2 characters in the search box (e.g. "ca" if you are searching for "canon" products), it makes 4 Ajax requests. Each request seems to get done in less than 125ms. I've casually observed Google Suggest taking 500ms or longer.
In either case, both sites are fast. What are the general concepts/strategies that should be followed in order to get super-fast requests/responses? Thanks.
EDIT 1: by the way, I plan to implement an autocomplete feature for an e-commerce site search where it 1.) provides search suggestion based on what is being typed and 2.) a list of potential products matches based on what has been typed so far. I'm trying for something similar to SLI Systems search (see http://www.bedbathstore.com/ for example).

This is a bit of a "how long is a piece of string" question and so I'm making this a community wiki answer — everyone feel free to jump in on it.
I'd say it's a matter of ensuring that:
The server / server farm / cloud you're querying is sized correctly according to the load you're throwing at it and/or can resize itself according to that load
The server /server farm / cloud is attached to a good quick network backbone
The data structures you're querying server-side (database tables or what-have-you) are tuned to respond to those precise requests as quickly as possible
You're not making unnecessary requests (HTTP requests can be expensive to set up; you want to avoid firing off four of them when one will do); you probably also want to throw in a bit of hysteresis management (delaying the request while people are typing, only sending it a couple of seconds after they stop, and resetting that timeout if they start again)
You're sending as little information across the wire as can reasonably be used to do the job
Your servers are configured to re-use connections (HTTP 1.1) rather than re-establishing them (this will be the default in most cases)
You're using the right kind of server; if a server has a large number of keep-alive requests, it needs to be designed to handle that gracefully (NodeJS is designed for this, as an example; Apache isn't, particularly, although it is of course an extremely capable server)
You can cache results for common queries so as to avoid going to the underlying data store unnecessarily

You will need a web server that is able to respond quickly, but that is usually not the problem. You will also need a database server that is fast, and can query very fast which popular search results start with 'ca'. Google doesn't use conventional database for this at all, but use large clusters of servers, a Cassandra-like database, and a most of that data is kept in memory as well for quicker access.
I'm not sure if you will need this, because you can probably get pretty good results using only a single server running PHP and MySQL, but you'll have to make some good choices about the way you store and retrieve the information. You won't get these fast results if you run a query like this:
select
q.search
from
previousqueries q
where
q.search LIKE 'ca%'
group by
q.search
order by
count(*) DESC
limit 1
This will probably work as long as fewer than 20 people have used your search, but will likely fail on you before you reach a 100.000.

This link explains how they made instant previews fast. The whole site highscalability.com is very informative.
Furthermore, you should store everything in memory and should avoid retrieving data from the disc (slow!). Redis for example is lightning fast!

You could start by doing a fast search engine for your products. Check out Lucene for full text searching. It is available for PHP, Java and .NET amongst other.

We Keep Coding

JavaScript is the programming language of the Web.