Javascript RegExp anomaly [duplicate] - javascript

I am trying to build a regexp from static text plus a variable in javascript. Obviously I am missing something very basic, see comments in code below. Help is very much appreciated:
var test_string = "goodweather";
// One regexp we just set:
var regexp1 = /goodweather/;
// The other regexp we built from a variable + static text:
var regexp_part = "good";
var regexp2 = "\/" + regexp_part + "weather\/";
// These alerts now show the 2 regexp are completely identical:
alert (regexp1);
alert (regexp2);
// But one works, the other doesn't ??
if (test_string.match(regexp1))
alert ("This is displayed.");
if (test_string.match(regexp2))
alert ("This is not displayed.");

First, the answer to the question:
The other answers are nearly correct, but fail to consider what happens when the text to be matched contains a literal backslash, (i.e. when: regexp_part contains a literal backslash). For example, what happens when regexp_part equals: "C:\Windows"? In this case the suggested methods do not work as expected (The resulting regex becomes: /C:\Windows/ where the \W is erroneously interpreted as a non-word character class). The correct solution is to first escape any backslashes in regexp_part (the needed regex is actually: /C:\\Windows/).
To illustrate the correct way of handling this, here is a function which takes a passed phrase and creates a regex with the phrase wrapped in \b word boundaries:
// Given a phrase, create a RegExp object with word boundaries.
function makeRegExp(phrase) {
// First escape any backslashes in the phrase string.
// i.e. replace each backslash with two backslashes.
phrase = phrase.replace(/\\/g, "\\\\");
// Wrap the escaped phrase with \b word boundaries.
var re_str = "\\b"+ phrase +"\\b";
// Create a new regex object with "g" and "i" flags set.
var re = new RegExp(re_str, "gi");
return re;
}
// Here is a condensed version of same function.
function makeRegExpShort(phrase) {
return new RegExp("\\b"+ phrase.replace(/\\/g, "\\\\") +"\\b", "gi");
}
To understand this in more depth, follows is a discussion...
In-depth discussion, or "What's up with all these backslashes!?"
JavaScript has two ways to create a RegExp object:
/pattern/flags - You can specify a RegExp Literal expression directly, where the pattern is delimited using a pair of forward slashes followed by any combination of the three pattern modifier flags: i.e. 'g' global, 'i' ignore-case, or 'm' multi-line. This type of regex cannot be created dynamically.
new RegExp("pattern", "flags") - You can create a RegExp object by calling the RegExp() constructor function and pass the pattern as a string (without forward slash delimiters) as the first parameter and the optional pattern modifier flags (also as a string) as the second (optional) parameter. This type of regex can be created dynamically.
The following example demonstrates creating a simple RegExp object using both of these two methods. Lets say we wish to match the word "apple". The regex pattern we need is simply: apple. Additionally, we wish to set all three modifier flags.
Example 1: Simple pattern having no special characters: apple
// A RegExp literal to match "apple" with all three flags set:
var re1 = /apple/gim;
// Create the same object using RegExp() constructor:
var re2 = new RegExp("apple", "gim");
Simple enough. However, there are significant differences between these two methods with regard to the handling of escaped characters. The regex literal syntax is quite handy because you only need to escape forward slashes - all other characters are passed directly to the regex engine unaltered. However, when using the RegExp constructor method, you pass the pattern as a string, and there are two levels of escaping to be considered; first is the interpretation of the string and the second is the interpretation of the regex engine. Several examples will illustrate these differences.
First lets consider a pattern which contains a single literal forward slash. Let's say we wish to match the text sequence: "and/or" in a case-insensitive manner. The needed pattern is: and/or.
Example 2: Pattern having one forward slash: and/or
// A RegExp literal to match "and/or":
var re3 = /and\/or/i;
// Create the same object using RegExp() :
var re4 = new RegExp("and/or", "i");
Note that with the regex literal syntax, the forward slash must be escaped (preceded with a single backslash) because with a regex literal, the forward slash has special meaning (it is a special metacharacter which is used to delimit the pattern). On the other hand, with the RegExp constructor syntax (which uses a string to store the pattern), the forward slash does NOT have any special meaning and does NOT need to be escaped.
Next lets consider a pattern which includes a special: \b word boundary regex metasequence. Say we wish to create a regex to match the word "apple" as a whole word only (so that it won't match "pineapple"). The pattern (as seen by the regex engine) needs to be: \bapple\b:
Example 3: Pattern having \b word boundaries: \bapple\b
// A RegExp literal to match the whole word "apple":
var re5 = /\bapple\b/;
// Create the same object using RegExp() constructor:
var re6 = new RegExp("\\bapple\\b");
In this case the backslash must be escaped when using the RegExp constructor method, because the pattern is stored in a string, and to get a literal backslash into a string, it must be escaped with another backslash. However, with a regex literal, there is no need to escape the backslash. (Remember that with a regex literal, the only special metacharacter is the forward slash.)
Backslash SOUP!
Things get even more interesting when we need to match a literal backslash. Let's say we want to match the text sequence: "C:\Program Files\JGsoft\RegexBuddy3\RegexBuddy.exe". The pattern to be processed by the regex engine needs to be: C:\\Program Files\\JGsoft\\RegexBuddy3\\RegexBuddy\.exe. (Note that the regex pattern to match a single backslash is \\ i.e. each must be escaped.) Here is how you create the needed RegExp object using the two JavaScript syntaxes
Example 4: Pattern to match literal back slashes:
// A RegExp literal to match the ultimate Windows regex debugger app:
var re7 = /C:\\Program Files\\JGsoft\\RegexBuddy3\\RegexBuddy\.exe/;
// Create the same object using RegExp() constructor:
var re8 = new RegExp(
"C:\\\\Program Files\\\\JGsoft\\\\RegexBuddy3\\\\RegexBuddy\\.exe");
This is why the /regex literal/ syntax is generally preferred over the new RegExp("pattern", "flags") method - it completely avoids the backslash soup that can frequently arise. However, when you need to dynamically create a regex, as the OP needs to here, you are forced to use the new RegExp() syntax and deal with the backslash soup. (Its really not that bad once you get your head wrapped 'round it.)
RegexBuddy to the rescue!
RegexBuddy is a Windows app that can help with this backslash soup problem - it understands the regex syntaxes and escaping requirements of many languages and will automatically add and remove backslashes as required when pasting to and from the application. Inside the application you compose and debug the regex in native regex format. Once the regex works correctly, you export it using one of the many "copy as..." options to get the needed syntax. Very handy!

You should use the RegExp constructor to accomplish this:
var regexp2 = new RegExp(regexp_part + "weather");
Here's a related question that might help.

The forward slashes are just Javascript syntax to enclose regular expresions in. If you use normal string as regex, you shouldn't include them as they will be matched against. Therefore you should just build the regex like that:
var regexp2 = regexp_part + "weather";

I would use :
var regexp2 = new RegExp(regexp_part+"weather");
Like you have done that does :
var regexp2 = "/goodweather/";
And after there is :
test_string.match("/goodweather/")
Wich use match with a string and not with the regex like you wanted :
test_string.match(/goodweather/)

While this solution may be overkill for this specific question, if you want to build RegExps programmatically, compose-regexp can come in handy.
This specific problem would be solved by using
import {sequence} from 'compose-regexp'
const weatherify = x => sequence(x, /weather/)
Strings are escaped, so
weatherify('.')
returns
/\.weather/
But it can also accept RegExps
weatherify(/./u)
returns
/.weather/u
compose-regexp supports the whole range of RegExps features, and let one build RegExps from sub-parts, which helps with code reuse and testability.

Related

Regular expresion for RegExp in Dart

I have the following in JavaScript:
function escape(text)
{
var tx = text.replace(/[&<>"']/g);
}
Im having problems trying to do the same on Dart:
var reg = new RegExp("/[&<>"']/g"); -->this throws error.
How can I get an equivalent expression?
The Dart RegExp source does not use / to delimit regular expressions, they're just strings passed to the RegExp constructor.
It's usually recommended that you use a "raw string" because backslashes mean something in RegExps as well as in non-raw string literals, and the JavaScript RegExp /\r\n/ would be RegExp("\\r\\n") in Dart without raw strings, but RegExp(r"\r\n") with a raw string, much more readable.
In this particular case, where the string contains both ' and ", that becomes harder, but you can use a "multiline string" instead - it uses tripple quote characters as delimiters, so it can contain single quote characters unescaped (it doesn't have to actually span multiple lines).
Dart doesn't have something similar to the g flag of JavaScript regexps. Dart regexps are stateless, it's the functions using them which need to care about remembering where it matched, not the RegExp itself. So, no need for the g.
So:
RegExp(r"""[&<>"']""");
// or
RegExp(r'''[&<>"']''');
That gets a little crowded with all those quotes, and you can choose to use a non-raw string instead so you can escape the quote which matches the string (which is easier because your RegExp does not contain any backslashes itself):
RegExp("[&<>\"']");
// or
RegExp('[&<>"\']');
If you do that when your regexp uses a RegExp backslash, then you'll need to double the backslash, something which is easy to forget, which is why raw strings are recommended.
You forgot to escape double quotes
new RegExp("/[&<>\"']", 'g');

reactStringReplace() inconsistent regex match

I'm trying to use react-string-replace to match all $Symbols within a string of text.
Here are a few example values we'd like to be matched (stock / crypto / forex pairs): $GPRO, $AMBA, $BTC/USD, $LTC/ETH
Here is our attempted regex
/\$\S+[^\s]*/g
when passing the string
$this works great $this/works great too.
through .match() - the proper symbols are returned in an array.
0: "$this"
1: "$this/works"
When using
reactStringReplace() - each match is returning
works great
Any ideas why
reactStringReplace()
seems to be handling this regex incorrectly?
Thanks ya'll!
Check the React String Replace documentation, it is written there:
reactStringReplace(string, match, func)
...
match
Type: regexp|string
The string or RegExp you would like to replace within string. Note that when using a RegExp you MUST include a matching group.
Why should you add a capturing group? See the replaceString function. There is var result = str.split(re); line that uses the pattern to actually split the contents you pass to the regex with your pattern thus tokenizing the whole input into parts that match and those that do not match your regex.
If you do not add a group to the regex passed as a String, the capturing parentheses will be added automatically around the whole pattern:
if (!isRegExp(re)) {
re = new RegExp('(' + escapeRegExp(re) + ')', 'gi');
}
If you pass your regex as a RegExp without capturing parentheses, the matches will be missing from the resulting array, hence, they will disappear.
So, use
/(\$\S+)/g
If you want to keep the $ chars in the output, or
/\$(\S+)/g
if you want to omit the dollars.

Understanding some JavaScript with a RegExp

I have the following js code
var regex = new RegExp('([\'"]?)((?:\\\\\\1|.)+?)\\1(,|$)', 'g'),
key = regex.exec( m ),
val = regex.exec( m );
I would like to understand it.
In particular:
why there are all those backslash in the definition of the RegExp? I can clearly see that \\1 is a reference to the first saved element. Why in a new RegExp using ' and not " we need to use \\1 and not simple \1?
why there is a comma between the two definitions of key and val? I may guess that it depends on the "instances" finded using "g", but it is not very clear anyway to me.
I tried to execute the code with
m = 'batman, robin'
and the result is pretty a mess, and I cannot really explain it very well.
The code is taken from JQuery Cookbook, 2.12
why there are all those backslash in the definition of the RegExp?
"\\" is a string whose value is \. One backslash is used as an escape, the second for the value. Then, within the regex you also need to escape the backslash character again because backslash characters are used to mean special things within regex.
For example
"\\1"
is a string whose value is \1, which, in a regular expression, matches the first captured group.
"\\\\"
is a string whose value is \\, which, in a regular expression, matches a single \ character.
"\\\\\\1"
is a string whose value is \\\1, which, in a regular expression, matches a single \ followed by the first captured group.
This need to escape backslashes, and then escape them again is called "double escaping". The reason you need to double escape is so that you have the correct value within the regular expression. The first escape is to make sure that the string has the correct value, the second escape is so that the regular expression matches the correct pattern.
why there is a comma between the two definitions of key and val?
The code you posted is a variable declaration. It's easier to see when formatted:
var regex = ...,
key = ...,
val = ...;
Each of the variable names in the list are declared via the var keyword. It is the same as declaring the keywords separately:
var regex,
key,
val;
regex = ...
key = ...
val = ...
Which is the same as declaring each var with a different var keyword:
var regex = ...
var key = ...
var val = ...
There's a difference when writing dynamic regex objects and static regex objects. When you initialize a regex object with a string it needs to be transformed into a regex object. However, not only does the '\' holds a special value within regex objects but also within javascript strings, hence the double escape.
Edit: Regarding your second question. You can do multiple declarations with comma, like so:
var one = 'one',
two = 'two',
three = 'three';
2nd Edit: Here's what happens with your string once it compiles into a RegEx object.
/(['"]?)((?:\\\1|.)+?)\1(,|$)/g
The regex is better represented as a regex literal:
var regex = /(['"]?)((?:\\\1|.)+?)\1(,|$)/g;
Backslashes are used to escape special characters. For example, if your regex needs to match a literal period, writing . will not work, since . matches any character: you need to "escape" the period with a backslash: \..
Backslashes that are not themselves part of an escape sequence must be escaped, so if you want to match just a backslash in the text, you must escape it with a backslash: \\.
The reason your regular expression is so complicated when passed into the RegExp constructor is because you are representing the above regular expression as a string, which adds another "layer" of escaping. Thus, every single backslash must be escaped by yet another backslash and because the string is enclosed in single quotes, your single quote must be escaped with yet another backslash:
var regex = new RegExp('([\'"]?)((?:\\\\\\1|.)+?)\\1(,|$)', 'g'),

Regular Expression to match URLs / web addresses

I have a JS function which is passed a string that a RegEx is run against, and returns any matches:
searchText= // some string which may or may not contain URLs
Rxp= new RegExp("([a-zA-Z\d]+://)?(\w+:\w+#)?([a-zA-Z\d.-]+\.[A-Za-z]{2,4})(:\d+)?(/.*)?/ig")
return searchText.match(Rxp);
The RegExp should return matches for any of the following (and similar derivations):
google.com
www.google.com
http://www.google.com
http://google.com
google.com?querystring=value
www.google.com?querystring=value
http://www.google.com?querystring=value
http://google.com?querystring=value
However, no such luck. Any suggestions?
In a string, \ has to be escaped: \\.
First, the string is interpreted. \w turns in w, because it has no significant meaning.
Then, the parsed string is turned in a RegEx. But \ is lost during the string parsing, so your RegEx breaks.
Instead of using the RegExp constructor, use RegEx literals:
Rxp = /([a-zA-Z\d]+:\/\/)?(\w+:\w+#)?([a-zA-Z\d.-]+\.[A-Za-z]{2,4})(:\d+)?(\/.*)?/ig;
// Note: I recommend to use a different variable name. Variables starting with a
// capital usually indicate a constructor, by convention.
If you're not 100% sure that the input is a string, it's better to use the exec method, which coerces the argument to a string:
return Rxp.exec(searchText);
Here's a pattern which includes the query string and URL fragment:
/([a-zA-Z\d]+:\/\/)?(\w+:\w+#)?([a-zA-Z\d.-]+\.[A-Za-z]{2,4})(:\d+)?(\/[^?#\s]*‌)?(\?[^#\s]*)?(#\S*)?/ig
Firstly, there's no real need to create your pattern via the RegExp constructor since it doesn't contain anything dynamic. You can just use the literal /pattern/ instead.
If you do use the constructor, though, you have to remember your pattern is declared as a string, not a literal REGEXP, so you'll need to double-escape special characters, e.g. \\d, not \d. Also, there were several forward slashes you weren't escaping at all.
With the constructor, modifiers (g, i) are passed as a second argument, not appended to the pattern.
So to literally change what you have, it would be:
Rxp= new RegExp("([a-zA-Z\\d]+:\\/\\/)?(\\w+:\\w+#)?([a-zA-Z\\d.-]+\\.[A-Za-z]{2,4})(:\\d+)?(\\/.*)?", "ig")
But better would be:
Rxp = /([a-zA-Z\d]+:\/\/)?(\w+:\w+#)?([a-zA-Z\d.-]+\.[A-Za-z]{2,4})(:\d+)?(\/.*)?/gi;

Javascript: String replace problem

I've got a string which contains q="AWORD" and I want to replace q="AWORD" with q="THEWORD". However, I don't know what AWORD is.. is it possible to combine a string and a regex to allow me to replace the parameter without knowing it's value? This is what I've got thus far...
globalparam.replace('q="/+./"', 'q="AWORD"');
What you have is just a string, not a regular expression. I think this is what you want:
globalparam.replace(/q=".+?"/, 'q="THEWORD"');
I don't know how you got the idea why you have to "combine" a string and a regular expression, but a regex does not need to exist of wildcards only. A regex is like a pattern that can contain wildcards but otherwise will try to match the exact characters given.
The expression shown above works as follows:
q=": Match the characters q, = and ".
.+?": Match any character (.) up to (and including) the next ". There must be at least one character (+) and the match is non-greedy (?), meaning it tries to match as few characters as possible. Otherwise, if you used .+", it would match all characters up to the last quotation mark in the string.
Learn more about regular expressions.
Felix's answer will give you the solution, but if you actually want to construct a regular expression using a string you can do it this way:
var fullstring = 'q="AWORD"';
var sampleStrToFind = 'AWORD';
var mat = 'q="'+sampleStrToFind+'"';
var re = new RegExp(mat);
var newstr = fullstring.replace(re,'q="THEWORD"');
alert(newstr);
mat = the regex you are building, combining strings or whatever is needed.
re = RegExp constructor, if you wanted to do global, case sensitivity, etc do it here.
The last line is string.replace(RegExp,replacement);

Categories