Script to download CAPTCHA images

Script to download CAPTCHA images - javascript

For completely non-nefarious purposes - machine learning specifically, I'd like to download a huge dataset of CAPTCHA images. However, CAPTCHA is always implemented using some obfuscated javascript that makes getting at the actual images without a browser a non-trivial task, at least to me, who is a javascript novice.
So, can anyone give me some helpful pointers on how to download the image of the obscured word using a script completely outside of a browser? And please don't point me to a dataset of already collected obscured words - I need to collect the images from a specific website for this particular experiment.
Thanks!
Edit: Another way this question could be asked is very simple. When you click "view source" on website with complicated javascript, you see the script references, but that's all you see. However, if you click "save webpage as..." (in firefox) and then view the source of the saved webpage, the javascript will be resolved and new html and the images (at least in the case of ASIRRA and reCAPTCHA) is in the source. How can I mimic this "save webpage as..." behavior using a script? This is an important web coding question in general, so please stop questioning me on my motives with this! This is knowledge I can use from now on in all web development involving scripting and I'm sure other stack overflow visitors can as well!

While waiting for an answer here I kept digging and eventually figured out a sort of hacked way of getting done what I wanted.
First off, the reason this is a somewhat complicated problem (at least to a javascript novice like me) is that the images from ASIRRA are loaded onto the webpage via javascript, which is a client-side technology. This is a problem when you download the webpage using something like wget or curl because it doesn't actually run the javascript, it just downloads the source html. Therefore, you don't get the images.
However, I realized that using firefox's "Save Page As..." did exactly what I needed. It ran the javascript which loaded the images, and then it saved it all into the well-known directory structure on my hard drive. That's exactly what I wanted to automate. So... I found a firefox Add-on called "iMacros" and wrote this macro:
VERSION BUILD=6240709 RECORDER=FX
TAB T=1
URL GOTO=http://www.asirra.com/examples/ExampleService.html
SAVEAS TYPE=CPL FOLDER=C:\Cat-Dog\Downloads FILE=*
Set to loop 10,000 times, it worked perfectly. In fact, since it was always saving to the same folder, duplicate images were overwritten (which is what I wanted).

Why not just get CAPTCHA yourself and generate images? reCAPTCHA's free too.
http://www.captcha.net/
Update: I see you want it from a specific site but if you get your own you can tweak it to give the same kind of images as the site you're targeting.

Get in contact with the people who run the site and ask for the dataset. If you try to download many images in any suspicious way, you'll end up on their kill list rather quickly which means that you won't get anything from them anymore.
CAPTCHAs are meant to protect people against abuse and what you do will look like abuse from their point of view.

Related

Determining the Order Files Are Run in a Website Built By Someone Else

Ok, this question is going to sound pretty dumb, but I'm an absolute novice when it comes to web development and have been tasked with fixing a website for my job (that has absolutely nothing in the way of documentation).
Basically, I'm wondering if there is any tool or method for tracking the order a website loads files when it is used. I just want to know a very high-level order of the pipeline. The app I've been tasked with maintaining is written in a mix of django, javascript, and HTML (none of which I really know, besides some basic django). I can understand how django works, and I kind of understand what's going on with HTML, but (for instance) I'm at a complete loss as to how the HTML code is calling javascript, and how that information is transfered back to HTML. I wish I could show the code I'm using, but it can't be released publicly.
I'm looking for what amounts to a debugger that will let me step through each file of code, but I don't think it works like that for web development.
Thank you

Try opening in the page in Chrome and hitting F12 - there's a tonne of developer tools and web page debuggers in there.
For your particular question about loading order, check the Network tab, then hit refresh on your page - it'll show you every file that the browser loads, starting with the HTML in your browsers address bar.
If you're trying to figure out javascript, check out the Sources tab. It even allows you to create break points -very handy for following along with a page is doing.

Use of JavaScript in lieu of hyperlinks

As RIAs and SPAs (or web apps with heavy javascript usage) have become more and more popular, I've been running into systems that, instead of using good old a href hyperlinks, I see them utilizing constructs using onclick with JavaScript code that manipulates navigation. This is particularly true with images.
For example, instead of seeing something like this:
<img src="...."/>
<div ... onclick='SomeJsFunctionThatNavsToAnotherPage()'><img src="..."/></a>
What is the advantage of this? It makes it incredibly hard to trace where pages transition to when debugging or trying to root cause a bug. I can get the idea when the target to navigate can change (so yes, here you could use a function that computes to what page to navigate to.)
But I see this pattern even when the pages to navigate to are constant. I find this extremely convoluted and hard to test. Not to mention that there is always the browser-specific bugs that come from stuff (sadly in my experience from over-complexifying the front-end.)
But I am not a RIA/SPA developer (just backend and traditional web development). Am I missing the rationale behind this?
TO CLARIFY
My question is not for the case when we want to redraw the page or change current content without changing the current location. My question is for plain
old transitions, from page A to page B.
In such a case, why use onclick=funcToChangeLocation() over <a href="some location"/>.
This has been a pain for me when troubleshooting systems that are already written (for I wouldn't write them like that), but there could be reasons I am not aware of.
Again, my question is not for pages that redraw themselves without changing the browser location, but for navigation from one page to the next.
ALSO
If you are going to vote to close this question, at least leave a message explaining why.

If you are making a web application, sometime you don't want to redirect the user to another page, but you want to dynamically change the content of the page without refreshing the page. It has some advantages. It can be faster. You can easily keep the state of the page/application. You are not obligated to communicate with the server. You can update only a part of the page.
You can also dynamically request data to print the page. If you are displaying an user profile page, you can only ask a json object that represent the user. This json object is smaller than the whole page and will be dynamically rendered. It can help to reduce the data transfer between users and server when your bandwidth is limited.
EDIT: In the case of a simple page redirection, I think it's a bad practice and I cannot see an advantage. I think it obfuscate the website when the google crawler try to parse the website.

I once had a pretty successful web directory website. One day Google decided that "directories" are competing businesses and started penalizing sites that had links on directories. I used the method you describe to cloak outgoing links to try and trick Google.

Download existing file in server root file system using HTML/JavaScript on a Lua/Luci Server

Let me preface by saying I have no idea of what I'm doing. I've inherited a system from a contractor that we hired to do a job. I'm not significantly familiar with web development, and I have no idea how the magic voodoo was configured or really works. If you're going to reply, be patient, and assume I don't know jack about what you're telling me - please don't leave anything "for the reader to figure out." I'm embedded by trade and would rather bang bits than develop back end code for a website.
Server is running on an embedded Linux platform (basis is OpenWRT). The core is Lua/Luci, but there's a plethora of HTM files that utilize both HTML and JavaScript.
What I want to do seems really, really straight forward, but I can't seem to make it work: There is a file in /etc that I want to be able to download from the server to the local machine. Needs to work with IE, Firefox, and Chrome.
I would have loved something like:
download
But it doesn't work for files outside the subdirectory area that lua/luci knows about (i.e. I can't "../../../etc/file")
I've tried several different things, but the biggest issue is I can't seem to get the lua/luci stuff to recognize anything new in the same directory that contains some of the htm files, nor anything from the server's root directory (e.g. /etc/file). Usually what I do goes back to the home page or displays:
No page is registered at '/admin/talon/file'.
If this url belongs to an extension, make sure it is properly installed.
If the extension was recently installed, try removing the /tmp/luci-indexcache file.
(And yes, I clear the cache before I reload the page).
I'm OK with creating a symlink to the /etc/, but that hasn't been fruitful, either - mainly because I really don't know what kind of magic the lua service is doing to find the existing files.
I'd prefer for the solution to be in just HTML and JavaScript.
Yes, I've looked around for a basic solution, but either the questions want to do more than just download, or there's not enough information for me to figure out what is supposed to be done.
Please post a full solution, not just snippets.

I was able to figure it out based on some other code within that same source. It worked on one page, but not another. Not sure why - just more sorcery. Had to work within the LUA scripting language to get to the file I wanted; HTML was straight forward. If I knew what the magic thing was to make it work, I'd post the actual solution, but I think the solution is somewhat unique to how the original developer put things together, so it wouldn't be useful to anyone else.

How do I make my server do all the loading and javascript and then server the page all ready

I got a webpage that calls oracle and then does some processing and then a lot of javascript.
The problem is that all of this make it slow for the user. I have to use internet explorer 6 so the javascript takes very long to load, around 15 seconds.
How can i make my server do all of this every minute for example and save the page so if a user requests it it would server them that page that is all ready calculated etc
im using tomcat server my webpage is mainly javascript and html
edit:
By the way I can not rewrite my webpage, it would have to remain as it is
I'm looking for something that would give the user a snapshot of the webpage that the server loaded

YSlow recommendations would tell you that you should put all your CSS in the head of your page and all JavaScript at the bottom, just before the closing body tag. This will allow the page to fully load the DOM and render it.
You should also minify and compress your JavaScript to reduce download size.

To do that, you'd need to have your server build up the DOM, run the JavaScript in an environment that looks (enough) like web browser, and then serialize the result as HTML.
There have been various attempts to do that, Jaxer is one of them (it was originally a product from Aptana, now an Apache project). Another related answer here on SO pointed to the jsdom project, which is a DOM implementation in JavaScript (video here).
Re
By the way I can not rewrite my webpage, it would have to remain as it is
That's very unlikely to be successful. There is bound to be some modification involved. At the very least, you're going to have to tell your server-side framework what parts it should process and what parts should be left to the client (e.g., user-interaction code).
Edit:
You might also look for "website thumbnail" services like shrinktheweb.com and similar. Their "pro" account allows full-size thumbnails (what I don't know is whether it's an image or HTML). But I'm not specifically suggesting them, just a line you might pursue. If you can find a project that does thumbnails, you may be able to adapt it to do what you want.
But again, take a look at Jaxer, you may find that it does what you need or very similar (and it's open-source, so you can modify it or extract the bits you want).

"How can i make my server do all of this every minute for example"
If you are asking how you can make your database server 'pre-run' a query, then look into materialized views.
If the Oracle query is responsible for (for example) 10 seconds of the delay there may be other things you can do to speed it up, but we'd need a lot more information on what the query does

Is there a way to stop Firebug from working on a particular site?

Is there some way to make Firebug not work at all on a website?

If the performance of your website suffers when Firebug is enabled, you may want to display a warning, asking users to switch it off. You can easily detect if Firebug is enabled through JavaScript.

WARNING: EXTREME EVIL. NEVER EVER USE THIS CODE. Also, it won't deter someone who is resourceful.
setTimeout(checkForFirebug, 100);
function checkForFirebug()
{
if (window.console && window.console.firebug) {
while(true); //Firebug is enabled
}
setTimeout(checkForFirebug, 100);
}
EDIT: I figured I would provide an answer to the real question behind the question. The fact is, Javascript is an interpreted language and that interpreter is in the browser. This makes it literally impossible to provide Javascript that is both secure and runnable. The same goes for HTML and CSS. The best you can do is minify the Javascript to make it a little less easy to reuse. If the company in question really wants "secure" Javascript, you just have to tell them it's not truly possible.

Ummm....
What does using Firefox (with or without Firebug) have to do with this?
I use IE and I can just as easily view your JavaScript. Likewise with Google Chrome. Hell, I can download your JavaScript when viewing your webpage on my Palm Treo.
Anything which can be accessed directly from a browser can be downloaded and analyzed at leisure. As others have said (better than I), JavaScript which runs on your website should be considered to be "open source". Find another way to do it (i.e. processing on your server) or accept that someone will hack in and look at it.
Mind you, are your routines so obviously good (in terms of what they do to your webpages) that a user will go to your website and immediately say "Hey, this is cool, I wonder how they do it?" If not, don't worry about it - most people won't be interested enough to try to look at your JavaScript.
You could try minifying your JavaScript, but that's not 100% going to stop someone who's determined. You could try encrypting it, but I've never tried. Or put a copyright notice in your JavaScript files, so at least someone else won't be able to subsequently pass off your work as yours without getting into legal trouble.

No. Nobody wants your javascript routines anyway. :-)
And if you're worried about unsecure code, you should rewrite your site to be secure instead of trying to hide its problems.

If you want to hide your HTML/CSS/JavaScript from visitors, that is not possible. Even if one cannot use Firebug, one can simply view the HTML source code. Any external JavaScripts and stylesheets can be downloaded as the plain text files they are. Because HTML, CSS, and JavaScript are client-side technologies, that are downloaded as plain text and interpreted by the web browser, it is theoretically impossible to hide your code. The best thing you can do to make the code harder to understand, is to obfuscate it. See Wikipedia.

You could click on the Off button to disable it.
Or are you trying to prevent your users from running it? If so, good luck...

"My javascript routines" belong to the company I work for and my company wants the stuff we develop secured.
You do not secure stuff by lightly patting "hackers" on the fingers when they use one specific debugging tool. Try to prevent them from using the ultimate hacker tool: "View Source".
If it's out there it's out there. "Secure" means something different in this context. It means securing whatever important data you have by employing techniques that are impenetrable* even with full knowledge of the source code. The source code itself is not securable, and neither does it need to be.
*) "impenetrable" = difficult enough to subvert in a reasonable amount of time, nothing is 100% :)

You could develop your site in Flash, Silverlight, or Java. Firebug will then be limited to displaying your base HTML.
I'm assuming you're worried about reverse engineering with FireBug.

Anything you send to the client, all your javascript, is open to whoever you send it to. Don't have anything there that you don't want people to see. There is no way to prevent someone else's browser from using Firebug, or a lot of other tools, to analyze your code. You could try to make your html, css, and javascript really bad, and that might slow them down! There are obfuscation programs to make it difficult to read. If you want to hide functionality, you'll need to have it happen on the server.

No, of course not. If Firebug is revealing something that you must prevent your users from seeing, then you are approaching this problem completely wrong. I am not trying to be rude or degrading, but attempting to block one particular program in an effort to fix a bug in your web application is about as logical as a bucket of steam. Firebug does nothing magical; I can do anything it does by writing some code. Having said that, there must be an underlying issue that you should be more concerned about.

Just to provide a little trick that i use helps lower people seeing your code,
One of the tricks i do that does not prevent the JavaScript from being found by the experianced developer or hacker, but deters the few people playing with Firebug / inspector,
use jQuery or another lib with a grate selector
the second port of call is all you files put them into a loader file E.G
Loader.js
(function($){
function loader(type, addr){
var head = $("head")[0];
switch(type){
case "script":{
var element = $(document.createElement("script"));
element.attr("type", "text/javascript");
element.attr("src", addr);
element.attr("loaded", "loader")
$(head).append(element);
}
case "style":{
var element = $(document.createElement("link"));
element.attr("rel", "stylesheet");
element.attr("type", "text/css");
element.attr("loaded", "loader");
element.attr("href", addr);
$(head).append(element);
}
}
}
loader("css", "path/to/your.css");
loader("script", "path/to/script.js");
loader("script", "unloader.js")
})(jQuery);
So to start with were using a closure this prevent anyone from using the console input of the inspector to see the code that has been run.
so once this file has been passed it will load your CSS and JS but you can still see there loaded in the head element of your inspector, thanks to browsers and the they way they work you can remove and not unload them this means the code will not be removed from execution but will prevent them being shown in the inspector this is what goes in the unloader.
unloader.js
(function($){
$("head *[loaded=loader]").remove();
})(jQuery);
The above will remove the the files loaded though the loader.
The only thing you need to remember is to add loaded="loader" to your scrip that that includes the loader, now this does not make it impossible for some one to see your files but stops the inspector from showing them in the HTML,
the ways around this can be to "View Source" code see the loader file and read that so make sure you minimize the code i use Google Closure Compiler (http://closure-compiler.appspot.com/home)
even this does not stop them it just make it more difficult. one of the steps i have tested but dont use is on the loader and files your loading use a .HTAccess rule to check that they have a reffer link form your site this will prevent them browsing directly to your js/css code files
another tip don't store them in normal places and don't use common names E.G scripts in /scripts/ CSS in /style/ or style.css
Here is an example of the loader Closure Compiled then Obfuscated
Loader.js
var _0xc596=["\x68\x65\x61\x64","\x73\x63\x72\x69\x70\x74","\x63\x72\x65\x61\x74\x65\x45\x6C\x65\x6D\x65\x6E\x74","\x74\x79\x70\x65","\x74\x65\x78\x74\x2F\x6A\x61\x76\x61\x73\x63\x72\x69\x70\x74","\x61\x74\x74\x72","\x73\x72\x63","\x6C\x6F\x61\x64\x65\x64","\x6C\x6F\x61\x64\x65\x72","\x61\x70\x70\x65\x6E\x64","\x6C\x69\x6E\x6B","\x72\x65\x6C","\x73\x74\x79\x6C\x65\x73\x68\x65\x65\x74","\x74\x65\x78\x74\x2F\x63\x73\x73","\x68\x72\x65\x66","\x73\x74\x79\x6C\x65","\x63\x73\x73","\x70\x61\x74\x68\x2F\x74\x6F\x2F\x79\x6F\x75\x72\x2E\x63\x73\x73","\x70\x61\x74\x68\x2F\x74\x6F\x2F\x73\x63\x72\x69\x70\x74\x2E\x6A\x73","\x75\x6E\x6C\x6F\x61\x64\x65\x72\x2E\x6A\x73"];(function (_0x76e5x1){function _0x76e5x2(_0x76e5x2,_0x76e5x3){var _0x76e5x4=_0x76e5x1(_0xc596[0])[0];switch(_0x76e5x2){case _0xc596[1]:var _0x76e5x5=_0x76e5x1(document[_0xc596[2]](_0xc596[1]));_0x76e5x5[_0xc596[5]](_0xc596[3],_0xc596[4]);_0x76e5x5[_0xc596[5]](_0xc596[6],_0x76e5x3);_0x76e5x5[_0xc596[5]](_0xc596[7],_0xc596[8]);_0x76e5x1(_0x76e5x4)[_0xc596[9]](_0x76e5x5);;case _0xc596[15]:_0x76e5x5=_0x76e5x1(document[_0xc596[2]](_0xc596[10]));_0x76e5x5[_0xc596[5]](_0xc596[11],_0xc596[12]);_0x76e5x5[_0xc596[5]](_0xc596[3],_0xc596[13]);_0x76e5x5[_0xc596[5]](_0xc596[7],_0xc596[8]);_0x76e5x5[_0xc596[5]](_0xc596[14],_0x76e5x3);_0x76e5x1(_0x76e5x4)[_0xc596[9]](_0x76e5x5);;} ;} ;_0x76e5x2(_0xc596[16],_0xc596[17]);_0x76e5x2(_0xc596[1],_0xc596[18]);_0x76e5x2(_0xc596[1],_0xc596[19]);} )(jQuery);
unloader.js
var _0xc2fb=["\x72\x65\x6D\x6F\x76\x65","\x68\x65\x61\x64\x20\x2A\x5B\x6C\x6F\x61\x64\x65\x64\x3D\x6C\x6F\x61\x64\x65\x72\x5D"];(function (_0x3db3x1){_0x3db3x1(_0xc2fb[1])[_0xc2fb[0]]();} )(jQuery);
to reproduce of to: http://closure-compiler.appspot.com/home put your code in under the // ADD YOUR CODE HERE
Then the result that is given back use: http://www.javascriptobfuscator.com/Default.aspx to make it even more unreadable.
Hope this helps any one else looking to make the JS as Secure as possible
But please remember as every one else has said this will not stop the pro hackers just make it very difficult to read and understand

No...............

Ultimately, no, as the browser (in this case firefox) on their machine can choose to run whatever javascript (such as firebug) it wants to. You cannot prevent users from running it along with your website.

if you want to protect your code, you could try encrypting your javascript source code
google encrypt javascript source

My reputation is too low to comment, but I just wanted to point out something that I noticed after learning about window.history.pushState(); it seems that you can change what is currently in the address bar, and once you do that, "view page source" doesn't work. So if there was a way to block developer tools from working, I wouldn't know how to view the source code.
EDIT: After using window.history.pushState(), when I view developer tools, it tells me to reload the page to view what is in a javascript file (but then again it does show the address to the JS file so that doesn't help much)

We Keep Coding

JavaScript is the programming language of the Web.