How can I remove escaping from a RegExp pattern? - javascript

I'm trying to simplify input for a particular regex for my users. A simple example of the regex might be
\b(C|C\+\+|Java)\b
I'm now giving the user the option of appending another branch at the end of the regex by inputting the raw string into a <input type="text"> field. The branch will be interpreted literally, so I need to escape it. I've used https://stackoverflow.com/a/2593661/785663 to get RegExp.quote to do this. I then store the complete regex in a database.
Now, when I retrieve the regex from the database and split it back up and display the branches to the user, I need to remove all the escape characters again. Is there some pre-made function for this or do I need to roll my own?
Yes, I know I ought to replace this with a list of strings to search for. But this only a part of a larger (regex based) picture.

The optimal solution is to change your design: store the unescaped regex, then only escape it when you actually use it. That way you don't have to worry about this messy business of converting it back and forth all the time.
If you use this regex a lot and are worried about the overhead of having to escape it all the time, then store both the unescaped and escaped versions. Update both whenever the user makes a change.
p.s. Allowing user-entered regexes may make your site vulnerable to attack. (Update: Though in this case it is less likely to be a problem, since you are only allowing literal strings)

Related

Sanitize function too sanitary?

I'm working on a webapp that sanitizes models for the view. However, it is stripping too many wanted characters, like forward slashes, semi-colons, colons, dollar signs, quote marks and accented letters from foreign languages. e.g. 3/8"W becomes 38w.
Do I need to modify the function to be less aggressive, or should I simply not use the sanitize function at all? I guess the bigger question is, what is sanitization for?
Full disclosure - I didn't write the function and I'm not fantastic with regex.
value = value.replace(/[^a-z0-9áéíóúñü .,_-]/gim, "").trim();
The sanitization concept is mainly aimed for sanitizing data from bad characters before being saved in database or processed with any type of queries.
That said, you shouldn't care about sanitizing data at front end so much because javascript can be disabled.
Any thing in client side can be bypassed.
You should care so much about that at back end.
Sanitization should be done for data before saving in database.
Escaping should be done for data after retrieving from database.

Same regex for client-side and server-side validation

I am trying to find out whether my client-side Javascript regex
/^(([^<>()[\]\\.,;:\s#\"]+(\.[^<>()[\]\\.,;:\s#\"]+)*)|(\".+\"))#((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\])|(([a-zA-Z\-0-9]+\.)+[a-zA-Z]{2,}))$/
for email validation (I'm using it just to make sure that the email is formatted properly, not as a primary validation method) will work on the server side with PHP.
I am not sure whether I can use the same one even though both languages use Perl-based regex syntax. Thank you for your help.
You should be able to use the same syntax.
You should use
preg_match(String $pattern, String $email[, array $matches])
with your pattern. It puts all occurrences into the array $matches, if given.
It returns true if a match is found. For E-Mails in particular it's always a
better idea to use the functions of others, because for example "$#us" is a valid
email address
This regex will work nearly identically in both JavaScript and PHP. There are some minuscule differences, for example \s matches the "next line" control character U+0085 in PHP, but not in JavaScript, but they are unlikely to matter in this context (it's unusual anyway to allow newlines and tabs in email addresses - why not use a simple space instead of the generic whitespace shorthand \s).
If you have to do these kinds of comparisons/conversions regularly, I heartily recommend you taking a look at RegexBuddy which can convert regexes between flavors with a single click.

Extracting data from JavaScript (Python Scraper)

I'm currently using a fusion of urllib2, pyquery, and json to scrape a site, and now I find that I need to extract some data from JavaScript. One thought would be to use a JavaScript engine (like V8), but that seems like overkill for what I need. I would use regular expressions, but the expression for this seems way to complex.
JavaScript:
(function(){DOM.appendContent(this, HTML("<html>"));;})
I need to extract the <html>, but I'm not entirely sure how to do so. The <html> itself can contain basically every character under the sun, so [^"] won't work.
Any thoughts?
Why regex? Can't you just use two substrings as you know how many characters you want to trim off the beginning and end?
string[42:-7]
As well as being quicker than a regex, it then doesn't matter if quotes inside <html> are escaped or not.
If every occurance of " inside the html code would be escaped by using \" (it is a JavaScript string after all), you could use
HTML\("((?:\\"|.)*?)"\)
to get the parameter to HTML into the first capturing group.
Note that this Regex is not yet escaped to be a Javascript String itself.

escape exactly what in javascript

Being a newbie in javascript I came to a situation where I need more information on escaping characters in a string.
Basically I know that in order to escape " I need to replace it with \" but what I don't know is for which characters I need to escape a particular string for. Is there a list of these "characters to escape"? or is it any character that is not a-zA-Z0-9 ?
In my situation, I don't have control over the content that is being displayed on my page. Users enter some text and save it. I then use a webservice to extract them from the database, build a json array of objects, then iterate the array when I need to display them. In this case, I have - naturally - no idea of what the text the user has entered and therefore for what characters I need to escape. I also use jQuery for this specific project (just in case it has a function I am not aware of, to do what I need)
Providing examples would be appreciated but I also want to learn the theory and logic behind it.
Hope someone can be of any help.
There's no need to escape everything that's not a-zA-Z0-9, take a look at this example:
http://www.c-point.com/javascript_tutorial/special_characters.htm
You may also want to check out this site which holds information about escaping string, especially URLs, etc. etc.
http://www.the-art-of-web.com/javascript/escape/

Finding beginning and end quotations

I'm starting to write a code syntax highlighter in JavaScript, and I want to highlight text that is in quotes (both "s and 's) in a certain color. I need it be able to not be messed up by one of one type of quote being in the middle of a pair of the other quotes as well, but i'm really not sure where to even start. I'm not sure how I should go about finding the quotes and then finding the correct end quote.
Unless you're doing this for the challenge, have a look at Google Code Prettify.
For your problem, you could read up on parsing (and lexers) at Wikipedia. It's a huge topic and you'll find that you'll come upon bigger problems than parsing strings.
To start, you could use regular expressions (although they rarely have the accuracy of a true lexer.) A typical regular expression for matching a string is:
/"(?:[^"\\]+|\\.)*"/
And then the same for ' instead of ".
Otherwise, for a character-by-character parser, you would set some kind of state that you're in a string once you hit ", then when you hit " that is not preceded by an uneven amount of backslashes (an even amount of backslashes would escape eachother), you exit the string.
You can find quotes using regular expressions but if you're writing a syntax highlighter then the only reliable way is to step through the code, character by character, and decide what to do from there.
E.g. of a Regex
/("|')((?:\\\1|.)+?)\1/g
(matches "this" and 'this' and "thi\"s")
use stack.. if unmatched quote found push it.. if match found pop
I did it with a single regular expression in php using backwards references. JS does not support it and i think that's what you need if you really want to detect undefined backslashes.

Categories