Regex from another domain? - javascript

I'm trying to make an automated process to retrieve some information from a site on my work network.
var duderegex = new RegExp("Title for Mr. [^\n]+","m");
var dude = duderegex.exec(input);
So far, so good. The problem is that I'm writing this on my work computer and probably won't be able to convince anyone to store this on the same domain as the site that is hosting it. So that technically makes it XSS. And I'd rather not have to get approval to install anything seriously funky (so I can't guarantee JQuery or a powershell that's easier to copy/paste from, for example).
I don't have any problems downloading files and manipulating them via webpage after download, but that adds a step clicking Save As...
Does anyone have any workable solutions for running regex on HTML source from a different domain? I don't need to limit it to Javascript, but getting PHP to work, for example, might require more resources than I have.
A commenter asked for clarification, so here goes. Let's say I have to contact 50 copyright holders a day (it has nothing to do with intellectual property but it will work). Right now, I have a form that takes me to
(1) http://foo.bar/form.htm?action=search&type=ArtistAlbum&Artist=Beatles&Album=White
and redirects to
(2) http://foo.bar/form.htm?id=4578469
From there, I click on a dropdown (let's say track listing), and that takes me to
(3) http://foo.bar/form.htm?id=4578469&track=7
There I have an alphabetical list of everyone who worked on the track, their agents, and legal representatives. I'm only interested in three names, the name of the person who holds the copyright of the lyrics, the name of the person who holds the copyright for the melody, and the name of the person who holds the copyright of the recording. So I have to search the document three times.
Since each name has a standard title, I should be able to write a script that asks for the artist and album, generates the link to (1), either copies the param from the url for (2) or uses a regex to find it from the link to (3), loads page (3), and then generates the output for a regex on the strings
/Lyrics Copyright Holder [^\n]+/
/Melody Copyright Holder [^\n]+/
/Performance Copyright Holder [^\n]+/
I could download all the files (it would take a long time), but the information changes on occasion, and I want to make sure I'm always pulling the newest information.
But I can't seem to get around the XSS bit.

You don't say what problem you're really trying to solve so it's a little hard to know what solutions make the most sense for you, but you can write javascript that works on any web page in a browser plug-in (like in Chrome or Firefox) or by using a scripting language outside of a browser (Python, Javascript, PHP, etc...) where you load the page contents and then manipulate the contents using the language tools.

Related

Why does gmail use eval?

This question suggests that using eval is a bad practice and many other questions suggest that it is 'evil'.
An answer to the question suggests that using eval() could be helpful in one of these cases:
Evaluate code received from a remote server. (Say you want to make a site that can be remotely controlled by sending JavaScript code to it?)
Evaluate user-written code. Without eval, you can't program, for
example, an online editor/REPL.
Creating functions of arbitrary length dynamically (function.length
is readonly, so the only way is using eval).
Loading a script and returning it's value. If your script is, for
example, a self-calling function, and you want to evaluate it and get
it's result (eg: my_result = get_script_result("foo.js")), the only
way of programming the function get_script_result is by using eval
inside it.
Re-creating a function in a different closure.
While looking at the Google Accounts page Source code I've found this:
(function(){eval('var f,g=this,k=void 0,p=Date.now||function(){return+new Date},q=function(a,b,c,d,e){c=a.split("."),d=g,c[0]in d||!d.execScript||d.execScript("var "+c[0]);for(;c.length&&(e=c.shift());) [a lot of code...] q("botguard.bg.prototype.invoke",K.prototype.ha);')})()</script>
I just can't get how is this helpful as it does not match any of the above cases. A comment there says:
/* Anti-spam. Want to say hello? Contact (base64)Ym90Z3VhcmQtY29udGFjdEBnb29nbGUuY29tCg== */
I can't see how eval would be used as anti-spam . Can somebody tell me why is it used in this specific case?
Mike Hearn from plan99.net created anti-bot JS system, and you see parts of its anti-reverse engineering methods (random encryption). There is his letter with mention about it: https://moderncrypto.org/mail-archive/messaging/2014/000780.html
[messaging] Modern anti-spam and E2E crypto
Mike Hearn
Fri Sep 5 08:07:30 PDT 2014
There's a significant amount of magic involved in preventing bulk signups.
As an example, I created a system that randomly generates encrypted
JavaScripts that are designed to resist reverse engineering attempts. These
programs know how to detect automated signup scripts and entirely wiped
them out
http://webcache.googleusercontent.com/search?q=cache:v6Iza2JzJCwJ:www.hackforums.net/archive/index.php/thread-2198360.html+&cd=8&hl=en&ct=clnk&gl=ch
You can google the info about system by its "Ym90Z3VhcmQtY29udGFjdEBnb29nbGUuY29tCg" base64 contact code or by "botguard-contact".
The post http://webcache.googleusercontent.com/search?q=cache:v6Iza2JzJCwJ:www.hackforums.net/archive/index.php/thread-2198360.html+&cd=8&hl=en&ct=clnk&gl=ch says:
The reason for this is being the new protection google introduced a couple of weeks/months ago.
Let me show you a part of the new Botguard ( as google calls it )
Code:
/* Anti-spam. Want to say hello? Contact (base64) Ym90Z3VhcmQtY29udGFjdEBnb29nbGUuY29tCg== */
You will have to crack the algorithm of this javascript, to be able to create VALID tokens that allow you to register a new account.
Google still allows you to create accounts without these tokens, and you wanna know why?
Its because they wait a couple of weeks, follow up the trace you and your stupid bot leave behind and than make a banwave.
ALL accounts you've sold, all accounts your customers created will be banned.
Your software might be able to be able to still create accounts after the banwave, but whats the use?
So, botguard is the optional security measure. It can be correctly computed in browser, but not in some/most javascript engines, used by bots. You can bypass it by not entering correct code, but the created account will be marked as bot-account and it will be disabled soon (and linked accounts will be terminated too).
There are also several epic threads on the GitHub:
https://github.com/assaf/zombie/issues/336
Why does Zombie produce an improper output compared to the more basic contextify version in the following example?
Output varies depending on when document.bg is initialized to new botguard.bg(), because the botguard script mixes in a timestamp salt when encoding.
mikehearn commented on May 21, 2012
Hi there,
I work for Google on signup and login security.
Please do not attempt to automate the Google signup form. This is not a good idea and you are analyzing a system that is specifically designed to stop you.
There are no legitimate use cases for automating this form. If you do so and we detect you, the accounts you create with it will be immediately terminated. Accounts associated with the IPs you use (ie, your personal accounts) may also be terminated.
If you believe you have a legitimate use case, you may be best off exploring other alternatives.
In the https://github.com/jonatkins/ingress-intel-total-conversion/issues/864 thread there are some details:
a contains heavily obfuscated code that starts with this comment:
The code contains a lot of generic stuff: useragent sniffing (yay, Internet Explorer), object type detection, code for listening to mouse/kb events... So it's looks like some generic library. After that there's a lot of cryptic stuff that makes absolutely no sense. The interesting bit is that it calls something labeled as "botguard.bg.prototype.invoke".
Evidently this must be google's botguard. From what I know, It collects data about user behavior on the page and its browser and avaluates it against other know data, this way it can detect anomaly usage and detect bots (kinda like clienBlob in ingress client). My guess would be it's detecting what kind of actions it takes the user to send requests (clicks, map events would be the most sensible)
So, google uses evil eval to fight evil users, which are unable to emulate the evaluated code fast/correctly enough.
eval() is dangerous when used on untrusted input. When used on a hardcoded string, that's not generally the case.

JavaScript : Find Captcha on a web page

I,am working on Captcha decode/break Firefox extension and I want to find captcha field on a page if it exists. I want to make a generic thing so that when ever a page is loaded, I get the captcha image.In short, Whenever a page is loaded, It checks for a captcha and if there, It gets its image.
An approach i was trying is that to find text 'captcha' on a page and then img tag if exist,can any one plz tell me a better solution that can run on max sites.Thanks in advance.
Saadsaf,
In one of your comments you mentioned that:
"...may be u r right, But actually i'm not using it for any enehical purpose, I'm just collecting captchas"
Collecting Captchas is pretty much a useless task.
How Captchas Work:
The way a Captcha works is that the user enters the characters in the image, and then submit a form - at which time the characters entered are compared to those in the image. If the two sets of characters match perfectly, a 'Pass' condition is declared and the guarded process is allowed to continue.
The characters in the Captcha image are generally not stored anywhere that the user has access to, otherwise there would be a severe security issue for Captchas. Most commonly, Captchas will be compared server-side so that the client has as little access to the character string as is possible.
Why collecting Captchas is a poor idea:
If you were to "collect" captchas for use on your sites, you would have to look at each and every one (thousands, if you figure out your code to collect them for you) and then somehow store the correct characters corresponding to each image for later use.
Between writing code to find you Captchas and then going through them all manually to correctly read the characters in each one, you will waste weeks or months of your life away.
What to do instead:
If you are interested in using Captchas on your sites to protect forms and prevent spam and abusive robots, your best bet is to learn how to create your own custom Captchas. There are endless resources at your disposal for just such a thing, including YouTube, Google, Stack Overflow, and more.
Where to start:
Hop on Google and search "How to create a Captcha". That is a good start. Other useful search terms might be "Custom Captcha", "PHP Captcha", "JavaScript and PHP Captcha"... Try the same searches on YouTube. Search for "Captcha" here on Stack Overflow.
Good luck. I hope you have only the best of intentions in mind when using this site.

Equivalent of SPContet.Current.ListItem in Client Object Model (ECMAScript)

I'm integrating an external application to SharePoint 2010 by developing custom ribbon tabs, groups, controls and commands that are made available to editors of a SharePoint 2010 site. The ribbon commands use the dialog framework to open dialogs with custom application pages.
In order to pass a number of query string parameters to the custom applications pages, I'm therefore looking for the equivalent of SPContext.Current.ListItem in the Client Object Model (ECMAScript).
Regarding available tokens (i.e. {ListItemId} or {SelectedItemId}) that can be used in the declarative XML, I already emitting all tokens, but unfortunately the desired tokens are not either not parsed or simply null, while in the context of a Publishing Page (i.e. http://domain/pages/page.aspx). Thus, none of the tokes that do render, are of use to establishing the context of the calling SPListItem in the application page.
Looking at the SP.ClientContext.get_current() provides a lot of information about the current SPSite, SPWeb etc. but nothing about the current SPListItem I'm currently positioned at (again, having the page rendered in the context of a Publishing Page).
What I've come up with so far is the idea of passing in the url of the current page (i.e. document.location.href) and parse that in the application page - however, it feels like I'm going in the wrong direction, and SharePoint surely should be able to provide this information.
I'm not sure this is a great answer, or even fully on-topic, but is basically something I originally intended to blog about - anyway:
It is indeed a pain that the Client OM does not seem to provide a method/property with details of the current SPListItem. However, I'd venture to say that this is a simple concept, but actually has quite wide-ranging implications in SharePoint which aren't apparent until you stop to think about it.
Consider:
Although a redirect exists, a discussion post can be surfaced on 2 or 3 different URLs (e.g. Threaded.aspx/Flat.aspx)
Similarly, a blog post can exist on a couple (Post.aspx/EditPost.aspx, maybe one other)
A list item obviously has DispForm.aspx/EditForm.aspx and (sort of) NewForm.aspx
Also for even for items with an associated SPFile (e.g. document, publishing page), consider that these URLs represent the same item:
http://mydomain/sites/someSite/someLib/Forms/DispForm.aspx?ID=x, http://mydomain/sites/someSite/someLib/Filename.aspx
Also, there could be other content types outside of this set which have a similar deal
In our case, we wanted to 'hang' data off internal and external items (e.g. likes, comments). We thought "well everything in SharePoint has a URL, so that could be a sensible way to identify an item". Big mistake, and I'm still kicking myself for falling into it. It's almost like we need some kind of 'normalizeUrl' method in the API if we wanted to use URLs in this way.
Did you ever notice the PageUrlNormalization class in Microsoft.SharePoint.Utilities? Sounds promising doesn't it? Unfortunately that appears to do something which isn't what I describe above - it doesn't work across the variations of content types etc (but does deal with extended web apps, HTTP/HTTPS etc).
To cut a long story short, we decided the best approach was to make the server emit details which allowed us to identify the current SPListItem when passed back to the server (e.g. in an AJAX request). We hide the 'canonical' list item ID in a JavaScript variable or hidden input field (whatever really), and these are evaluated when back at the server to re-obtain the list item. Not as efficient as obtaining everything from context, but for us it's OK because we only need to resolve when the user clicks something, not on every page load. By canonical, I mean:
SiteID|WebID|ListID|ListItemID
IIRC, one of the key objects has a CanonicalId property (or maybe it's internal), which may help you build such a string.
So in terms of using the window.location.href, I'd avoid that if you're in vaguely the same situation as us. Suggest considering an approach similar to the one we used, but do remember that there are some locations (e.g. certain forms) where even on the server SPContext.Current.ListItem is null, despite the fact that SPContext.Current.Web (and possibly SPContext.Current.List) are populated.
In summary - IDs are your friend, URLs are not.

Run external Javascript on given URL?

I want to be able to write javascript that executes on a page that I visit every day. It is for my employers blog-like "What's New" page where there are announcements such as new hires, terminations, etc. But among these are also article that announcing grants we received and other important things.
I want to filter out those things that I don't care about but keep those that I do. I could do this with CSS if there were selectors that I could match on. But there are none that seperate those posts from the ones I don't care about.
So, is there a way in Firefox to specify a js file that would act on a certain URL? Like there is a user defined userChrome.css that matches on all CSS in a page?
Thanks
Eric
You want Greasemonkey. Despite the name that makes you go ewwwww, it is a very powerful and useful tool.

What precautions should I take to prevent XSS on user submitted HTML?

I'm planning on making a web app that will allow users to post entire web pages on my website. I'm thinking of using HTML Purifier but I'm not sure because HTML Purifier edits the HTLM and it's important that the HTML is maintained just how it was posted. So I was thinking making some regex to get rid of all script tags and all the javascript attributes like onload, onclick, etc.
I saw a Google video a while ago that had a solution for this. Their solution was to use another website to post javascript in so the original website cannot be accessed by it. But I don't wanna purchase a new domain just for this.
be careful with homebrew regexes for this kind of thing
A regex like
s/(<.*?)onClick=['"].*?['"](.*?>)/$1 $3/
looks like it might get rid of onclick events, but you can circumvent it with
<a onClick<a onClick="malicious()">="malicious()">
running the regex on that will get you something like
<a onClick ="malicious()">
You can fix it by repeatedly running the regex on that string until it doesn't match, but that's just one example of how easy it is to get around simple regex sanitizers.
The most critical error people make when doing this is validating things on input.
Instead, you should validate on display.
The context matters when determing what is XSS and what isn't. Therefore, you can happily accept any input, as long as you pass it through appropriate cleaning functions when displaying it.
Consider that something that constitutes 'XSS' will be different when the input is placed in a '<a href="HERE"> as opposed to <a>here!</a>.
Thus, all you need to do, is make sure that any time you write user data, you consider, very carefully, where you are displaying it, and make sure that it can't escape the context you are writing it to.
If you can find any other way of letting users post content, that does not involve HTML, do that. There are plenty of user-side light markup systems you can use to generate HTML.
So I was thinking making some regex to get rid of all script tags and all the javascript attributes like onload, onclick, etc.
Forget it. You cannot process HTML with regex in any useful way. Let alone when security is involved and attackers might be deliberately throwing malformed markup at you.
If you can convince your users to input XHTML, that's much easier to parse. You still can't do it with regex, but you can throw it into a simple XML parser, and walk over the resulting node tree to check that every element and attribute is known-safe, and delete any that aren't, then re-serialise.
HTML Purifier edits the HTLM and it's important that the HTML is maintained just how it was posted.
Why?
If it's so they can edit it in their original form, then the answer is simply to purify it on the way out to be displayed in the browser, not on the way in at submit-time.
If you must let users input their own free-form HTML — and in general I'd advise against it — then HTML Purifier, with a whitelist approach (ban all elements/attributes that aren't known-safe) is about as good as it gets. It's very very complicated and you may have to keep it up to date when hacks are found, but it's streets ahead of anything you're going to hack up yourself with regexes.
But I don't wanna purchase a new domain just for this.
You can use a subdomain, as long as any authentication tokens (in particular, cookies) can't cross between subdomains. (Which for cookies they can't by default as the domain parameter is set to only the current hostname.)
Do you trust your users with scripting capability? If not don't let them have it, or you'll get attack scripts and iframes to Russian exploit/malware sites all over the place...
Make sure that user content doesn't contain anything that could cause Javascript to be ran on your page.
You can do this by using an HTML stripping function that gets rid of all HTML tags (like strip_tags from PHP), or by using another similar tool. There are actually many reasons besides XSS to do this. If you have user submitted content, you want to make sure that it doesn't break the site layout.
I belive you can simply use a sub-domain of your current domain to host Javascript, and you will get the same security benefits for AJAX. Not cookies however.
In your specific case, filtering out the <script> tag and Javascript actions is probably going to be your best bet.
1) Use clean simple directory based URIs to serve user feed data.
Make sure when you dynamically create URIs to address the user's uploaded data, service account, or anything else off your domain make sure you don't post information as parameters to the URI. That is an extremely easy point of manipulation that could be used to expose flaws in your server security and even possibly inject code onto your server.
2) Patch your server.
Ensure you keep your server up to date on all the latest security patches for all the services running on that server.
3) Take all possible server-side protections against SQL injection.
If somebody can inject code to your SQL database that can execute from services on your box that person will own your box. At that point they can then install malware onto your webserver to be feed back to your users or simple record data from the server and send it out to a malicious party.
4) Force all new uploads into a protected sandboxed area to test for script execution.
No matter how you try to remove script tags from submitted code there will be a way to circumvent your safeguards to execute script. Browsers are sloppy and do all kinds of stupid crap they are not supposed to do. Test your submissions in a safe area before you publish them for public consumption.
5) Check for beacons in submitted code.
This step requires the previous step and can be very complicated, because it can occur in script code that requires a browser plugin to execute, such as Action Script, but is just as much a vulnerability as allowing JavaScript to execute from user submitted code. If a user can submit code that can beacon out to a third party then your users, and possibly your server, is completely exposed to data loss to a malicious third party.
You should filter ALL HTML and whitelist only the tags and attributes that are safe and semantically useful. WordPress is great at this and I assume that you will find the regular expressions used by WordPress if you search their source code.

Categories