Regular expression assistance

Regular expression assistance - javascript

I have a client's site that keeps getting hacked with XSS injections somehow. These XSS attacks are without fail in the banners section, and the banner ads need to have <script> tags to function.
I am still trying to figure out where and when this happens (it is a HUGE site, is badly coded (sorry, previous guy...) and I am really swamped. So, in the mean time, I want to do a regular expression that deletes the partial tag that gets inserted.
So, if the script should be:
<script src="valid_script.js"></script>
The hacker simply does this:
<script src="valid_script.js"></script>
<script src="invalid_script.js"></script>
I need the regex to delete the script tag (there may be multiple matches) that contains "invalid_script.js" but leave the one that contains "valid_script.js" in tact.
My question: Could you experts out there please show me how to do this regex? I am sorry, but I do not understand regex, I tried so hard to understand, but it is way over my head :-(

Taking note of all the comments, as you have, to answer your question if you have the text to be outputted in the $content variable (that will be containing both the good and the bad script), then the following regular expression will strip just the bad:
$content = preg_replace('#<script[^>]*invalid_script\.js[^>]*></script>#s', '', $content);
This says, briefly, look for the following in sequence: <script, a string of non-> characters, invalid_script.js, a string of non-> characters, and ></script>.
But to reiterate all the comments, this could be got around and is certainly only a sticking plaster of sorts.

Related

Removing any script action from text

I'm writing a script that will detect and remove any potentially malicious script inserted by the user while posting something.
That was the easy part. The harder part is stopping all sneaky users by making sure all variations are detected. A simple regex can detect
<script>something</script>
but will fail on
<script>something</ script>
So I tried writing the rules as flexible as I could, considering regex isn't my strong suit.
There are 3 rules:
Remove script tags.
Disable attributes like "onclick".
Remove prefix 'javascript:' from links.
Here:
content.replace(/<[\s]*script[^>]*>[\w|\t|\r\|\W]*?<[\/\s]*script[^>]*>/gi, "");
content.replace(/<*\s(on[A-Za-z]*[\s]*=)/gi, " ignoreme=");
content.replace(/<*[\s]*(href)*(javascript:)/gi, "");
Here's a working example.
I could really use an expert advise in making this code more efficient or to point out any error.

You can replace your regex
/<[\s]*script[^>]*>[\w|\t|\r\|\W]*?<[\/\s]*script[^>]*>/gi
with
/<\s*script[^>]*>[\s\S]*?<[\/\s]*script[^>]*>/gi
removed unnecesary []
replaced [\w|\t|\r\|\W] with [\s\S] (both are equivalent)
You can also replace
/<*\s(on[A-Za-z]*[\s]*=)/gi
with
/<*\s(on[A-Za-z]*\s*=)/gi
and the following
/<*[\s]*(href)*(javascript:)/gi
with
/<*\s*(href)*(javascript:)/gi

regex for background-image:url('URL');

I trying to make a regex for finding: background-image:url('URL'); Where the URL is a external link for an image.
Been trying for something like this:
/\s*?[ \t\n]background-image:url('https?:\/\/(?:[a-z\-]+\.)+[a-z]{2,6}(?:\/[^\/#?]+)+\.(?:jpe?g|gif|png)$');/i
But couldn't get it to work.
I am using this with javascript/jquery

Does this get what you want?:
/\s*?[ \t\n]background-image:url\('.+?'\);/i

I think you can simplify it to this if you know it will only change with the URL in the middle. I probably went overboard with the \ escapes but better to be safe than sorry.
/background\-image\:url\(\'.*?\'\)\;/

Epascarello hit the nail on the head. Is this source you control? Or at least a predictable website? What are multiple different examples of input and your expected results?
Will this always be inline in double quotes, and therefore your URL will always be in single quotes? Some old websites use double-quotes in their CSS Files or header CSS.
Do you want to capture the whole thing? Or are you just trying to extract the resulting URL?
SirCapsAlot brings up a good question, are you just looking for background image URL's in general? Because they can use the Background property also, or even be set in JavaScript with .backgroundImage="url(image.jpg)".
And you definitely only want the ones that include http(s)?
With the limited requirements you gave, this is the best Regex:
background-image\s*:\s*url\('(https?://[^']+)
Comment here if you have answers to my questions which may alter your requirements, and thusly my answer.
Breakdown:
background-image:\s*url //Find the literal text to begin
\(' //Find the literal opening parens and quote
( //Begin Capture Group 1
https?:// //Require the match of https:// (the s is optional because of the ?)
[^']+ //Require that everything until the next quote is matched
) //Capture the result into Group 1
A Co-Worker pointed out that I might have been downvoted for not capturing the closing tick. Note: Capturing the closing tick would be a wasted step, and is not necessary for this regex to work.
He also pointed out somebody might have downvoted me for requiring http or https in the url portion. But the user's question was specifically for external URLs, not internal ones. So this is a valid requirement and gets him closer to what he asked.
Sooo... not sure why this got a downvote.

how to prevent scripts from being run

SO kept preventing me from posting the title I wanted so finally got a title that let me post though it kind of sucks so feel free to edit/change it.
I have fields a user can fill in and in the javascript we have
'${chart.title}'
and stuff like that. Is it sufficient to just strip out the single quote character such that they cannot escape it back to javascript? or are there other ways to close out the string that started with the single quote character.
${chart.title} inserts the title a user typed in on a previous page so naturally they could type something like "Title'+callMethod()+'RestOfTitle" injecting a callMethod into my javascript.
thanks,
Dean

The best way would be to restrict the input to alphanumerical and space characters.
If you want to allow anything inside the title, you can use a escaping function.
http://xkr.us/articles/javascript/encode-compare/
Just stripping the string of single quote characters is definitely not enough. Think of new lines for one reason.

There are couple of options.
First go very restrictive way and do both so called white-list validation for input field for you title and always encode the text that you output to the page. That will filtered out all unwanted (and potentially dangerous) characters and make sure that if some of them pass filter (or somebody update the text to contains some js code after the filters were applied) the encoding procedure make all malicious js scripts not runable (it turns it into plain text).
Second you do let your users input what ever they want (which is highly unrecommended way but sometime developers asked to do it) but always encode the text that you output to the page.
You can implement white-list validation by yourself using regular expression or you can use one of the libraries.

Javascript replace() function adding strange characters

Consider the following Javascript:
var previewImg = 'http://example.com/preview_img/hey.jpg';
var fullImg = previewImg.replace('preview','full');
I would expect the value of fullImg to be:
http://example.com/full_img/hey.jpg
In fact, it is... sort of. Running alert(fullImg); shows the expected url string. But when I deliver that variable to jQuery Fancybox, like this:
jQuery.fancybox.open(fullImg);
Something adds characters into the string, like this:
http://example.com/%EF%BF%BCfull_img/hey.jpg
Where is this %EF%BF%BC coming from? What is it? And most importantly, how do I get rid of it?
Some other clues: This is a Drupal 7 site, running jQuery 1.5.1. I'm using that same Fancybox script elsewhere on the site with no issues.

%EF%BF%BC is a sequence of three URL-encoded characters.
You clearly can't see any unexpected characters in the string. That's because the character sequence %EF%BF%BC is invisible.
It's actually a UTF-8 byte-order mark sequence. This sequence typically comes at the start of a UTF-8 encoded text file. They probably got into your code when you did a copy+paste from another file.
The quickest way to get rid of them is to find the bit of code that was copied+pasted, delete the characters on either side of the problem, and retype them. Depending on your editor, you may find the delete behaves strangely as it deletes the hidden characters.
Some text editors and IDEs will have an option to show hidden characters. If your editor has this, it may help you see where the mystery characters are so you can delete them.
Hope that helps.

Regex will not match

This is my string:
<link href="/post?page=4&tags=example" rel="last" title="Last Page">
From there I am trying to obtain the 4 out of that page parameter, using this regular expression:
link href="/post?page=(.*?)&tags=(.*?)" rel="last"
I will then collect the 4 out of the first group, the tags parameter has a wildcard because the contents can change. However, I don't seem to be getting a match with this, can anyone help?
And I know I shouldn't be using regex to parse HTML, but this is just a small thing and it would be a waste to import a huge module for this.

Assuming you are using a /regex literal/, you will need to escape the / in that path as \/.
Alternatively, it depends on how you are getting this string. Is it really typed that way, or is it part of an innerHTML that you are then reading out again? If that's the case, then the innerHTML won't be what you expect it to be, because the browser will "normalise" it.
If it is an innerHTML, then it'd be far easier to get the tag, then get the tag's href attribute, then regex that.

link href="/post\?page=(.*?)&tags=(.*?)" rel="last"
You forgot the slash before ?

I think it might be better to change your capture groups to something a little different, but will catch everything up to the terminating character:
link href="/post?page=([^&]+)&tags=([^\"]+)" rel="last"
Using the negating character first in the character group tells the regex engine "capture all characters EXCEPT the ones listed here". This makes it very easy to capture everything up until it hits a termination character, such as the amperstand and double-quote. Assuming you're using PHP or Java, this should also slightly improve regex performance.

If the page parameter always comes first, try the PCRE /\?page=(\d+)/. Match group 1 will contain the page number.

We Keep Coding

JavaScript is the programming language of the Web.