How to obtain index of subpattern in JavaScript regexp?

How to obtain index of subpattern in JavaScript regexp? - javascript

I wrote a regular expression in JavaScript for searching searchedUrl in a string:
var input = '1234 url( test ) 5678';
var searchedUrl = 'test';
var regexpStr = "url\\(\\s*"+searchedUrl+"\\s*\\)";
var regex = new RegExp(regexpStr , 'i');
var match = input.match(regex);
console.log(match); // return an array
Output:
["url( test )", index: 5, input: "1234 url( test ) 5678"]
Now I would like to obtain position of the searchedUrl (in the example above it is the position of test in 1234 url( test ) 5678.
How can I do that?

As far as I could tell it wasn't possible to get the offset of a sub-match automatically, you have to do the calculation yourself using either lastIndex of the RegExp, or the index property of the match object returned by exec(). Depending on which you use you'll either have to add or subtract the length of groups leading up to your sub-match. However, this does mean you have to group the first or last part of the Regular Expression, up to the pattern you wish to locate.
lastIndex only seems to come into play when using the /g/ global flag, and it will record the index after the entire match. So if you wish to use lastIndex you'll need to work backwards from the end of your pattern.
For more information on the exec() method, see here:
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp/exec
The following succinctly shows the solution in operation:
var str = '---hello123';
var r = /([a-z]+)([0-9]+)/;
var m = r.exec( str );
alert( m.index + m[1].length ); // will give the position of 123
update
This would apply to your issue using the following:
var input = '1234 url( test ) 5678';
var searchedUrl = 'test';
var regexpStr = "(url\\(\\s*)("+searchedUrl+")\\s*\\)";
var regex = new RegExp(regexpStr , 'i');
var match = regex.exec(input);
Then to get the submatch offset you can use:
match.index + match[1].length
match[1] now contains url( (plus two spaces) due to the bracket grouping which allows us to tell the internal offset.
update 2
Obviously things are a little more complicated if you have patterns in the RegExp, that you wish to group, before the actual pattern you want to locate. This is just a simple act of adding together each group length.
var s = '~- [This may or may not be random|it depends on your perspective] -~';
var r = /(\[)([a-z ]+)(\|)([a-z ]+)(\])/i;
var m = r.exec( s );
To get the offset position of it depends on your perspective you would use:
m.index + m[1].length + m[2].length + m[3].length;
Obviously if you know the RegExp has portions that never change length, you can replace those with hard coded numeric values. However, it's probably best to keep the above .length checks, just in case you — or someone else — ever changes what your expression matches.

JS doesn't have a direct way to get the index of a subpattern/capturing group. But you can work around that with some tricks. For example:
var reStr = "(url\\(\\s*)" + searchedUrl + "\\s*\\)";
var re = new RegExp(reStr, 'i');
var m = re.exec(input);
if(m){
var index = m.index + m[1].length;
console.log("url found at " + index);
}

You can add the 'd' flag to the regex in order to generate indices for substring matches.
const input = '1234 url( test ) 5678';
const searchedUrl = 'test';
const regexpStr = "url\\(\\s*("+searchedUrl+")\\s*\\)";
const regex = new RegExp(regexpStr , 'id');
const match = regex.exec(input).indices[1]
console.log(match); // return [11, 15]

You don't need the index.
This is a case where providing just a bit more information would have gotten a much better answer. I can't fault you for it; we're encouraged to create simple test cases and cut out irrelevant detail.
But one important item was missing: what you plan to do with that index. In the meantime, we were all chasing the wrong problem. :-)
I had a feeling something was missing; that's why I asked you about it.
As you mentioned in the comment, you want to find the URL in the input string and highlight it in some way, perhaps by wrapping it in a <b></b> tag or the like:
'1234 url( <b>test</b> ) 5678'
(Let me know if you meant something else by "highlight".)
You can use character indexes to do that, however there is a much easier way using the regular expression itself.
Getting the index
But since you asked, if you did need the index, you could get it with code like this:
var input = '1234 url( test ) 5678';
var url = 'test';
var regexpStr = "^(.*url\\(\\s*)"+ url +"\\s*\\)";
var regex = new RegExp( regexpStr , 'i' );
var match = input.match( regex );
var start = match[1].length;
This is a bit simpler than the code in the other answers, but any of them would work equally well. This approach works by anchoring the regex to the beginning of the string with ^ and putting all the characters before the URL in a group with (). The length of that group string, match[1], is your index.
Slicing and dicing
Once you know the starting index of test in your string, you could use .slice() or other string methods to cut up the string and insert the tags, perhaps with code something like this:
// Wrap url in <b></b> tag by slicing and pasting strings
var output =
input.slice( 0, start ) +
'<b>' + url + '</b>' +
input.slice( start + url.length );
console.log( output );
That will certainly work, but it is really doing things the hard way.
Also, I left out some error handling code. What if there is no matching URL? match will be undefined and the match[1] will fail. But instead of worrying about that, let's see how we can do it without any character indexing at all.
The easy way
Let the regular expression do the work for you. Here's the whole thing:
var input = '1234 url( test ) 5678';
var url = 'test';
var regexpStr = "(url\\(\\s*)(" + url + ")(\\s*\\))";
var regex = new RegExp( regexpStr , 'i' );
var output = input.replace( regex, "$1<b>$2</b>$3" );
console.log( output );
This code has three groups in the regular expression, one to capture the URL itself, with groups before and after the URL to capture the other matching text so we don't lose it. Then a simple .replace() and you're done!
You don't have to worry about any string lengths or indexes this way. And the code works cleanly if the URL isn't found: it returns the input string unchanged.

Related

Using RegExp to substring a string at the position of a special character

Suppose I have a sting like this: ABC5DEF/G or it might be ABC5DEF-15 or even just ABC5DEF, it could be shorter AB7F, or AB7FG/H.
I need to create a javascript variable that contains the substring only up to the '/' or the '-'. I would really like to use an array of values to break at. I thought maybe to try something like this.
...
var srcMark = array( '/', '-' );
var whereAt = new RegExp(srcMark.join('|')).test.str;
alert("whereAt= "+whereAt);
...
But this returns an error: ReferenceError: Can't find variable: array
I suspect I'm defining my array incorrectly but trying a number of other things I've been no more successful.
What am I doing wrong?

Arrays aren't defined like that in JavaScript, the easiest way to define it would be with:
var srcMark = ['/','-'];
Additionally, test is a function so it must be called as such:
whereAt = new RegExp(srcMark.join('|')).test(str);
Note that test won't actually tell you where, as your variable suggests, it will return true or false. If you want to find where the character is, use String.prototype.search:
str.search(new RegExp(srcMark.join('|'));
Hope that helps.

You need to use the split method:
var srcMark = Array.join(['-','/'],'|'); // "-|/" or
var regEx = new RegExp(srcMark,'g'); // /-|\//g
var substring = "222-22".split(regEx)[0] // "222"
"ABC5DEF/G".split(regEx)[0] // "ABC5DEF"

From whatever i could understand from your question, using this RegExp /[/-]/ in split() function will work.
EDIT:
For splitting the string at all special characters you can use new RegExp(/[^a-zA-Z0-9]/) in split() function.
var arr = "ABC5DEF/G";
var ans = arr.split(/[/-]/);
console.log(ans[0]);
arr = "ABC5DEF-15";
ans = arr.split(/[/-]/);
console.log(ans[0]);
// For all special characters
arr = "AB7FG/H";
ans = arr.split(new RegExp(/[^a-zA-Z0-9]/));
console.log(ans[0]);

You can use regex with String.split.
It will look something like that:
var result = ['ABC5DEF/G',
'ABC5DEF-15',
'ABC5DEF',
'AB7F',
'AB7FG/H'
].map((item) => item.split(/\W+/));
console.log(result);
That will create an Array with all the parts of the string, so each item[0] will contain the text till the / or - or nothing.

If you want the position of the special character (non-alpha-numeric) you can use a Regular Expression that matches any character that is not a word character from the basic Latin alphabet. Equivalent to [^A-Za-z0-9_], that is: \W
var pattern = /\W/;
var text = 'ABC5DEF/G';
var match = pattern.exec(text);
var position = match.index;
console.log('character: ', match[0]);
console.log('position: ', position);

Regex - match the better part of a word in a search string

I am using Javascript and currently looking for a way to match as many of my pattern's letters as possible, maintaining the original order..
For example a search pattern queued should return the march Queue/queue against the any of the following search strings:
queueTable
scheduledQueueTable
qScheduledQueueTable
As of now I've reached as far as this:
var myregex = new RegExp("([queued])", "i");
var result = myregex.exec('queueTable');
but it doesn't seem to work correctly as it highlights the single characters q,u,e,u,e and e at the end of the word Table.
Any ideas?

Generate the regex with optional non-capturing group part where regex pattern can be generate using Array#reduceRight method.
var myregex = new RegExp("queued"
.split('')
.reduceRight(function(str, s) {
return '(?:' + s + str + ')?';
}, ''), "i");
var result = myregex.exec('queueTable');
console.log(result)
The method generates regex : /(?:q(?:u(?:e(?:u(?:e(?:d?)?)?)?)?)?)?/
UPDATE : If you want to get the first longest match then use g modifier in regex and find out the largest using Array#reduce method.
var myregex = new RegExp(
"queued".split('')
.reduceRight(function(str, s) {
return '(?:' + s + str + ')?';
}, ''), "ig");
var result = 'qscheduledQueueTable'
.match(myregex)
.reduce(function(a, b) {
return a.length > b.length ? a : b;
});
console.log(result);

I think the logic would have to be something like:
Match as many of these letters as possible, in this order.
The only real answer that comes to mind is to get the match to continue if possible, but allow it to bail out. In this case...
myregex = /q(?:u(?:e(?:u(?:e(?:d|)|)|)|)|)/;
You can generate this, of course:
function matchAsMuchAsPossible(word) { // name me something sensible please!
return new RegExp(
word.split("").join("(?:")
+ (new Array(word.length).join("|)"))
);
}

You are using square brackets - which mean that it will match a single instance of any character listed inside.
There are a few ways of interpreting your intentions:
You want to match the word queue with an optional 'd' at the end:
var myregex = new RegExp("queued?", "i");
var result = myregex.exec('queueTable');
Note this can be shorter try this:
'queueTable'.match(/queued?/i);
I also removed the brackets as these were not adding anything here.
This link provides some good examples that may help you further: https://www.w3schools.com/js/js_regexp.asp

When you use [] in a regular expression, it means you want to match any of the characters inside the brackets.
Example: if I use [abc] it means "match a single character, and this character can be 'a', 'b' or 'c'"
So in your code [queued] means "match a single character, and this character can be 'q', 'u', 'e' or 'd'" - note that 'u' and 'e' appear twice so they are redundant in this case. That's why this expression matches just one single character.
If you want to match the whole string "queued", just remove the brackets. But in this case it won't match, because queueTable doesn't have 'd'. If you want 'd' to be optional, you can use queued? as already explained in previous answers.

Try something like the following :
var myregex = /queued?\B/g;
var result = myregex.exec('queueTable');
console.log(result);

Find chars in string but prefer consecutive chars with NFA without atomic grouping

I'm trying to create a regex that will find chars anywhere in a string. I would prefer if they would first find consecutive chars though.
Let me give an example, assume s = 'this is a test test string' and I'm searching for tst I would want to find it like so:
// Correct
// v vv
s = 'this is a test test string'
And not:
// Incorrect
// v v v
s = 'this is a test test string'
Also if s = 'this is a test test tst string'
// Correct
// vvv
s = 'this is a test test tst string'
A couple of things to note:
The searching chars are user supplied (tst in this case)
I'm using javascript so I can't support atomi grouping, which I suspect would make this alot easier
My best try is something like this:
var find = 'tst';
var rStarts = [];
var rEnds = [];
for (var i = 0; i < find.length - 1; i++) {
rStarts.push(= '(' + find[i] + find[i + 1] )
rEnds.push( find[i] + '[^]*?' + find[i + 1] + ')' );
}
But halfway through I realized I had no idea where I was going with it.
Any ideas how to do this?

You can do something like this:
Compute regexps for all combinations of substrings of the needle in the order you prefer and match them sequentially. So for your test, you can do the following matches:
/(tst)/
/(ts).*(t)/
/(t).*(st)/ // <- this one matches
/(t).*(s).*(t)/
Computing the regexps is tricky and making them in the right order depends on whether you prefer a 4-1-1 split over a 2-2-2 split.

This finds the shortest collection of a supplied group of letters:
function findChars(chars,string)
{
var rx = new RegExp(chars.split("").join(".*?"),"g");
var finds = [];
while(res = rx.exec(string))
{
finds.push(res[0]);
rx.lastIndex -= res[0].length-1;
}
finds.sort(function(a,b) { return a.length-b.length; })
return finds[0];
}
var s2 = 'this is a test test tst string';
console.log(findChars('tst',s2));//"tst"
console.log(findChars('ess',s2));//"est ts"

Well, I'm still not sure what you're looking for exactly, but maybe that will do for a first try:
.*?(t)(s)(t)|.*?(t)(s).*?(t)|.*?(t).*?(s)(t)|(t).*?(s).*?(t)
regex101 demo
I'm capturing each of the letters here, but if you don't mind grouping them...
.*?(tst)|.*?(ts).*?(t)|.*?(t).*?(st)|(t).*?(s).*?(t)
This will match the parts you mentioned in your question.

You can use lookaheads to mimic atomic groups, as discussed in this article. This regex seems to do what want:
/^(?:(?=(.*?tst))\1|(?=(.*?ts.+?t))\2|(?=(.*?t.+?st))\3|(?=(.*?t.+?s.+?t))\4)/
...or in human-readable form:
^
(?:
(?=(.*?tst))\1
|
(?=(.*?ts.+?t))\2
|
(?=(.*?t.+?st))\3
|
(?=(.*?t.+?s.+?t))\4
)
ref

Javascript RegExp match & Multiple backreferences

I'm having trouble trying to use multiple back references in a javascript match so far I've got: -
function newIlluminate() {
var string = "the time is a quarter to two";
var param = "time";
var re = new RegExp("(" + param + ")", "i");
var test = new RegExp("(time)(quarter)(the)", "i");
var matches = string.match(test);
$("#debug").text(matches[1]);
}
newIlluminate();
#Debug when matching the Regex 're' prints 'time' which is the value of param.
I've seen match examples where multiple back references are used by wrapping the match in parenthesis however my match for (time)(quarter)... is returning null.
Where am I going wrong? Any help would be greatly appreciated!

Your regex is literally looking for timequarterthe and splitting the match (if it finds one) into the three backreferences.
I think you mean this:
var test = /time|quarter|the/ig;

Your regex test simply doesn't match the string (as it does not contain the substring timequarterthe). I guess you want alternation:
var test = /time|quarter|the/ig; // does not even need a capturing group
var matches = string.match(test);
$("#debug").text(matches!=null ? matches.join(", ") : "did not match");

How can I concatenate regex literals in JavaScript?

Is it possible to do something like this?
var pattern = /some regex segment/ + /* comment here */
/another segment/;
Or do I have to use new RegExp() syntax and concatenate a string? I'd prefer to use the literal as the code is both more self-evident and concise.

Here is how to create a regular expression without using the regular expression literal syntax. This lets you do arbitary string manipulation before it becomes a regular expression object:
var segment_part = "some bit of the regexp";
var pattern = new RegExp("some regex segment" + /*comment here */
segment_part + /* that was defined just now */
"another segment");
If you have two regular expression literals, you can in fact concatenate them using this technique:
var regex1 = /foo/g;
var regex2 = /bar/y;
var flags = (regex1.flags + regex2.flags).split("").sort().join("").replace(/(.)(?=.*\1)/g, "");
var regex3 = new RegExp(expression_one.source + expression_two.source, flags);
// regex3 is now /foobar/gy
It's just more wordy than just having expression one and two being literal strings instead of literal regular expressions.

Just randomly concatenating regular expressions objects can have some adverse side effects. Use the RegExp.source instead:
var r1 = /abc/g;
var r2 = /def/;
var r3 = new RegExp(r1.source + r2.source,
(r1.global ? 'g' : '')
+ (r1.ignoreCase ? 'i' : '') +
(r1.multiline ? 'm' : ''));
console.log(r3);
var m = 'test that abcdef and abcdef has a match?'.match(r3);
console.log(m);
// m should contain 2 matches
This will also give you the ability to retain the regular expression flags from a previous RegExp using the standard RegExp flags.
jsFiddle

I don't quite agree with the "eval" option.
var xxx = /abcd/;
var yyy = /efgh/;
var zzz = new RegExp(eval(xxx)+eval(yyy));
will give "//abcd//efgh//" which is not the intended result.
Using source like
var zzz = new RegExp(xxx.source+yyy.source);
will give "/abcdefgh/" and that is correct.
Logicaly there is no need to EVALUATE, you know your EXPRESSION. You just need its SOURCE or how it is written not necessarely its value. As for the flags, you just need to use the optional argument of RegExp.
In my situation, I do run in the issue of ^ and $ being used in several expression I am trying to concatenate together! Those expressions are grammar filters used accross the program. Now I wan't to use some of them together to handle the case of PREPOSITIONS.
I may have to "slice" the sources to remove the starting and ending ^( and/or )$ :)
Cheers, Alex.

Problem If the regexp contains back-matching groups like \1.
var r = /(a|b)\1/ // Matches aa, bb but nothing else.
var p = /(c|d)\1/ // Matches cc, dd but nothing else.
Then just contatenating the sources will not work. Indeed, the combination of the two is:
var rp = /(a|b)\1(c|d)\1/
rp.test("aadd") // Returns false
The solution:
First we count the number of matching groups in the first regex, Then for each back-matching token in the second, we increment it by the number of matching groups.
function concatenate(r1, r2) {
var count = function(r, str) {
return str.match(r).length;
}
var numberGroups = /([^\\]|^)(?=\((?!\?:))/g; // Home-made regexp to count groups.
var offset = count(numberGroups, r1.source);
var escapedMatch = /[\\](?:(\d+)|.)/g; // Home-made regexp for escaped literals, greedy on numbers.
var r2newSource = r2.source.replace(escapedMatch, function(match, number) { return number?"\\"+(number-0+offset):match; });
return new RegExp(r1.source+r2newSource,
(r1.global ? 'g' : '')
+ (r1.ignoreCase ? 'i' : '')
+ (r1.multiline ? 'm' : ''));
}
Test:
var rp = concatenate(r, p) // returns /(a|b)\1(c|d)\2/
rp.test("aadd") // Returns true

Providing that:
you know what you do in your regexp;
you have many regex pieces to form a pattern and they will use same flag;
you find it more readable to separate your small pattern chunks into an array;
you also want to be able to comment each part for next dev or yourself later;
you prefer to visually simplify your regex like /this/g rather than new RegExp('this', 'g');
it's ok for you to assemble the regex in an extra step rather than having it in one piece from the start;
Then you may like to write this way:
var regexParts =
[
/\b(\d+|null)\b/,// Some comments.
/\b(true|false)\b/,
/\b(new|getElementsBy(?:Tag|Class|)Name|arguments|getElementById|if|else|do|null|return|case|default|function|typeof|undefined|instanceof|this|document|window|while|for|switch|in|break|continue|length|var|(?:clear|set)(?:Timeout|Interval))(?=\W)/,
/(\$|jQuery)/,
/many more patterns/
],
regexString = regexParts.map(function(x){return x.source}).join('|'),
regexPattern = new RegExp(regexString, 'g');
you can then do something like:
string.replace(regexPattern, function()
{
var m = arguments,
Class = '';
switch(true)
{
// Numbers and 'null'.
case (Boolean)(m[1]):
m = m[1];
Class = 'number';
break;
// True or False.
case (Boolean)(m[2]):
m = m[2];
Class = 'bool';
break;
// True or False.
case (Boolean)(m[3]):
m = m[3];
Class = 'keyword';
break;
// $ or 'jQuery'.
case (Boolean)(m[4]):
m = m[4];
Class = 'dollar';
break;
// More cases...
}
return '<span class="' + Class + '">' + m + '</span>';
})
In my particular case (a code-mirror-like editor), it is much easier to perform one big regex, rather than a lot of replaces like following as each time I replace with a html tag to wrap an expression, the next pattern will be harder to target without affecting the html tag itself (and without the good lookbehind that is unfortunately not supported in javascript):
.replace(/(\b\d+|null\b)/g, '<span class="number">$1</span>')
.replace(/(\btrue|false\b)/g, '<span class="bool">$1</span>')
.replace(/\b(new|getElementsBy(?:Tag|Class|)Name|arguments|getElementById|if|else|do|null|return|case|default|function|typeof|undefined|instanceof|this|document|window|while|for|switch|in|break|continue|var|(?:clear|set)(?:Timeout|Interval))(?=\W)/g, '<span class="keyword">$1</span>')
.replace(/\$/g, '<span class="dollar">$</span>')
.replace(/([\[\](){}.:;,+\-?=])/g, '<span class="ponctuation">$1</span>')

It would be preferable to use the literal syntax as often as possible. It's shorter, more legible, and you do not need escape quotes or double-escape backlashes. From "Javascript Patterns", Stoyan Stefanov 2010.
But using New may be the only way to concatenate.
I would avoid eval. Its not safe.

You could do something like:
function concatRegex(...segments) {
return new RegExp(segments.join(''));
}
The segments would be strings (rather than regex literals) passed in as separate arguments.

You can concat regex source from both the literal and RegExp class:
var xxx = new RegExp(/abcd/);
var zzz = new RegExp(xxx.source + /efgh/.source);

Use the constructor with 2 params and avoid the problem with trailing '/':
var re_final = new RegExp("\\" + ".", "g"); // constructor can have 2 params!
console.log("...finally".replace(re_final, "!") + "\n" + re_final +
" works as expected..."); // !!!finally works as expected
// meanwhile
re_final = new RegExp("\\" + "." + "g"); // appends final '/'
console.log("... finally".replace(re_final, "!")); // ...finally
console.log(re_final, "does not work!"); // does not work

No, the literal way is not supported. You'll have to use RegExp.

the easier way to me would be concatenate the sources, ex.:
a = /\d+/
b = /\w+/
c = new RegExp(a.source + b.source)
the c value will result in:
/\d+\w+/

I prefer to use eval('your expression') because it does not add the /on each end/ that ='new RegExp' does.

We Keep Coding

JavaScript is the programming language of the Web.

How to obtain index of subpattern in JavaScript regexp? - javascript

Related

Using RegExp to substring a string at the position of a special character

Regex - match the better part of a word in a search string

Find chars in string but prefer consecutive chars with NFA without atomic grouping

Javascript RegExp match & Multiple backreferences

How can I concatenate regex literals in JavaScript?

Categories

Resources