Removing any script action from text

Removing any script action from text - javascript

I'm writing a script that will detect and remove any potentially malicious script inserted by the user while posting something.
That was the easy part. The harder part is stopping all sneaky users by making sure all variations are detected. A simple regex can detect
<script>something</script>
but will fail on
<script>something</ script>
So I tried writing the rules as flexible as I could, considering regex isn't my strong suit.
There are 3 rules:
Remove script tags.
Disable attributes like "onclick".
Remove prefix 'javascript:' from links.
Here:
content.replace(/<[\s]*script[^>]*>[\w|\t|\r\|\W]*?<[\/\s]*script[^>]*>/gi, "");
content.replace(/<*\s(on[A-Za-z]*[\s]*=)/gi, " ignoreme=");
content.replace(/<*[\s]*(href)*(javascript:)/gi, "");
Here's a working example.
I could really use an expert advise in making this code more efficient or to point out any error.

You can replace your regex
/<[\s]*script[^>]*>[\w|\t|\r\|\W]*?<[\/\s]*script[^>]*>/gi
with
/<\s*script[^>]*>[\s\S]*?<[\/\s]*script[^>]*>/gi
removed unnecesary []
replaced [\w|\t|\r\|\W] with [\s\S] (both are equivalent)
You can also replace
/<*\s(on[A-Za-z]*[\s]*=)/gi
with
/<*\s(on[A-Za-z]*\s*=)/gi
and the following
/<*[\s]*(href)*(javascript:)/gi
with
/<*\s*(href)*(javascript:)/gi

Related

Regex replace with multiple wildcards works in PHP, not in JavaScript

I'm attempting to implement center alignment for two Markdown parsers:
In PHP for Parsedown (successfully)
In JavaScript for Bootstrap Markdown (without success)
The idea I'm following and finding the easiest is to work with the final HTML output, and just snap inline styling onto the tags.
The following regex does what I need, it adds style="text-align:center;" to any element so far*, as needed:
$text = preg_replace('/\<(.*?)\>\->(.*?)<\-\<\/(.*?)\>/', '<$1 style="text-align:center;">$2</$3>', $text);
That is, <p>text</p> becomes <p style="text-align:center;">text</p>.
However, when I attempted to port this into JavaScript to also make it available for previewing on client-side, the pattern does not match as it should:
content = content.replace('/\<(.*?)\>\->(.*?)<\-\<\/(.*?)\>/', '<$1 style="text-align:center;">$2</$3>');
The replacement in content does not occur.
I'm aware there are slight differences between Regex of PHP and JavaScript, but I have found examples for all the expected behavior here on both sides, working.
*If someone is wondering by any chance, I'm also successfully adding the center alignment to tags that already have a style attribute - on server side only, so far.

You'll need to use the literal syntax for regular expression in JavaScript, like so:
content = content.replace(/\<(.*?)\>\->(.+)<\-\<\/(.+)\>/gi, '<$1 style="text-align:center;">$2</$3>');
Note that the gi at the end of the regular expression simply enables global searching (that is, replace all occurrences matching the pattern) and case-insensitive matching. They are both technically optional, but you will most likely want the g flag enabled for certain. However, keeping the i flag is up to you (depends on whether or not your content contains &GT;, for example).

Regular expression assistance

I have a client's site that keeps getting hacked with XSS injections somehow. These XSS attacks are without fail in the banners section, and the banner ads need to have <script> tags to function.
I am still trying to figure out where and when this happens (it is a HUGE site, is badly coded (sorry, previous guy...) and I am really swamped. So, in the mean time, I want to do a regular expression that deletes the partial tag that gets inserted.
So, if the script should be:
<script src="valid_script.js"></script>
The hacker simply does this:
<script src="valid_script.js"></script>
<script src="invalid_script.js"></script>
I need the regex to delete the script tag (there may be multiple matches) that contains "invalid_script.js" but leave the one that contains "valid_script.js" in tact.
My question: Could you experts out there please show me how to do this regex? I am sorry, but I do not understand regex, I tried so hard to understand, but it is way over my head :-(

Taking note of all the comments, as you have, to answer your question if you have the text to be outputted in the $content variable (that will be containing both the good and the bad script), then the following regular expression will strip just the bad:
$content = preg_replace('#<script[^>]*invalid_script\.js[^>]*></script>#s', '', $content);
This says, briefly, look for the following in sequence: <script, a string of non-> characters, invalid_script.js, a string of non-> characters, and ></script>.
But to reiterate all the comments, this could be got around and is certainly only a sticking plaster of sorts.

VB.Net remove whitespace with regex excluding content inside <script>

I want to reduce the size out my HTML output stream by removing all empty lines and whitespace. However I'm not very good at regex and the pattern I have seems to remove more than wanted e.g. whole script blocks. How can I make sure that blocks are kept in tact?
This is what I have so far:
html = Regex.Replace(html, ">\s+<", "><", RegexOptions.Compiled)

I think you're looking for conditional regex. Look at examples here Regex Tutorial If-Then-Else
There are different regex for different systems (.Net, Python, etc)

regex for background-image:url('URL');

I trying to make a regex for finding: background-image:url('URL'); Where the URL is a external link for an image.
Been trying for something like this:
/\s*?[ \t\n]background-image:url('https?:\/\/(?:[a-z\-]+\.)+[a-z]{2,6}(?:\/[^\/#?]+)+\.(?:jpe?g|gif|png)$');/i
But couldn't get it to work.
I am using this with javascript/jquery

Does this get what you want?:
/\s*?[ \t\n]background-image:url\('.+?'\);/i

I think you can simplify it to this if you know it will only change with the URL in the middle. I probably went overboard with the \ escapes but better to be safe than sorry.
/background\-image\:url\(\'.*?\'\)\;/

Epascarello hit the nail on the head. Is this source you control? Or at least a predictable website? What are multiple different examples of input and your expected results?
Will this always be inline in double quotes, and therefore your URL will always be in single quotes? Some old websites use double-quotes in their CSS Files or header CSS.
Do you want to capture the whole thing? Or are you just trying to extract the resulting URL?
SirCapsAlot brings up a good question, are you just looking for background image URL's in general? Because they can use the Background property also, or even be set in JavaScript with .backgroundImage="url(image.jpg)".
And you definitely only want the ones that include http(s)?
With the limited requirements you gave, this is the best Regex:
background-image\s*:\s*url\('(https?://[^']+)
Comment here if you have answers to my questions which may alter your requirements, and thusly my answer.
Breakdown:
background-image:\s*url //Find the literal text to begin
\(' //Find the literal opening parens and quote
( //Begin Capture Group 1
https?:// //Require the match of https:// (the s is optional because of the ?)
[^']+ //Require that everything until the next quote is matched
) //Capture the result into Group 1
A Co-Worker pointed out that I might have been downvoted for not capturing the closing tick. Note: Capturing the closing tick would be a wasted step, and is not necessary for this regex to work.
He also pointed out somebody might have downvoted me for requiring http or https in the url portion. But the user's question was specifically for external URLs, not internal ones. So this is a valid requirement and gets him closer to what he asked.
Sooo... not sure why this got a downvote.

regex to change text inside a html tag

First of all I'm new to stackoverflow so I'm sorry if I posted this in the wrong section.
I need a regex to search within the html tag and replace the - with a _
e.g:
<TAG-NAME>-100</TAG-NAME>
would become
<TAG_NAME>-100</TAG_NAME>
note that the value inside the tag wasn't affected.
Can anyone help?
Thanks.

Since JavaScript is the language for DOM manipulation, you should generally consider parsing the XML properly and using JavaScript's DOM traversal functions instead of regular expressions.
Here is some example code on how to parse an XML document so that you can use the DOM traversal functions. Then you can traverse all elements and change their names. This will automatically exclude text nodes, attributes, comments and all other annoying things, you don't want to change.
If it has to be a regex, here is a makeshift solution. Note that it will badly fail you if you have tags (or even only >) inside attribute names or comments (in fact it will also apply the replacement to comments):
str = str.replace(/-(?=[^<>]*>)/g, '_');
This will match a - if it is followed by a > without encountering a < before. The concept is called a negative lookahead. The g modifier makes sure that all occurrences are replaced.
Note that this will apply the replacement to anything in front of a >. Even attribute values. If you don't want that you could also make sure that there is an even number of quotes between the hyphen and the closing >, like this:
str = str.replace(/-(?=[^<>"]*(?:"[^<>"]*"[^<>"]*)*>)/g, '_');
This will still change attribute names though.
Here is a regexpal demo that shows what works and what doesn't work. Especially the comment behavior is quite horrible. Of course this could be taken care of with an even more complex regex, but I guess you see where this is going? You should really, really use an XML parser!

s/(\<[^\>]+\>)\-([^\<]+\<\/)/\1_\2/
Although I am not familiar with JS libraries, but I am pretty sure there would be better libraries to parse HTML.

We Keep Coding

JavaScript is the programming language of the Web.