Multiple nested matches in JavaScript Regular Expression

Multiple nested matches in JavaScript Regular Expression - javascript

Trying to write a regular expression to match GS1 barcode patterns ( https://en.wikipedia.org/wiki/GS1-128 ), that contain 2 or more of these patterns that have an identifier followed by a certain number of characters of data.
I need something that matches this barcode because it contains 2 of the identifier and data patterns:
human readable with the identifiers in parens: (01)12345678901234(17)501200
actual data: 011234567890123417501200
but should match not this barcode when there is only one pattern in:
human readable: (01)12345678901234
actual data: 0112345678901234
It seems like the following should work:
var regex = /(?:01(\d{14})|10([^\x1D]{6,20})|11(\d{6})|17(\d{6})){2,}/g;
var str = "011234567890123417501200";
console.log(str.replace(regex, "$4"));
// matches 501200
console.log(str.replace(regex, "$1"));
// no match? why?
For some strange reason as soon as I remove the {2,} it works, but I need the {2,} so that it only returns matches if there is more than one match.
// Remove {2,} and it will return the first match
var regex = /(?:01(\d{14})|10([^\x1D]{6,20})|11(\d{6})|17(\d{6}))/g;
var str = "011234567890123417501200";
console.log(str.replace(regex, "$4"));
// matches 501200
console.log(str.replace(regex, "$1"));
// matches 12345678901234
// but then the problem is it would also match single identifiers such as
var str2 = "0112345678901234";
console.log(str2.replace(regex, "$1"));
How do I make this work so it will only match and pull the data if there is more than 1 set of match groups?
Thanks!

Your RegEx is logically and syntatically correct for Perl-Compatible Regular Expressions (PCRE). The issue I believe you are facing is the fact that JavaScript has issues with repeated capture groups. This is why the RegEx works fine once you take out the {2,}. By adding the quantifier, JavaScript will be sure to return only the last match.
What I would recommend is removing the {2,} quantifier and then programmatically checking for matches. I know it's not ideal for those who are big fans of RegEx, but c'est la vie.
Please see the snippet below:
var regex = /(?:01(\d{14})|10([^\x1D]{6,20})|11(\d{6})|17(\d{6}))/g;
var str = "011234567890123417501200";
// Check to see if we have at least 2 matches.
var m = str.match(regex);
console.log("Matches list: " + JSON.stringify(m));
if (m.length < 2) {
console.log("We only received " + m.length + " matches.");
} else {
console.log("We received " + m.length + " matches.");
console.log("We have achieved the minimum!");
}
// If we exec the regex, what would we get?
console.log("** Method 1 **");
var n;
while (n = regex.exec(str)) {
console.log(JSON.stringify(n));
}
// That's not going to work. Let's try using a second regex.
console.log("** Method 2 **");
var regex2 = /^(\d{2})(\d{6,})$/;
var arr = [];
var obj = {};
for (var i = 0, len = m.length; i < len; i++) {
arr = m[i].match(regex2);
obj[arr[1]] = arr[2];
}
console.log(JSON.stringify(obj));
// EOF
I hope this helps.

The reason is that the capture groups only give the last match by that particular group. Imagine that you would have two barcodes in your sequence that have both the same identifier 01... now it becomes clear that $1 cannot refer to both at the same time. The capture group only retains the second occurrence.
A straightforward way, but not so elegant, is to drop the {2,}, and instead repeat the whole regular expression pattern for matching the second barcode sequence. I think you also need to use the ^ (start of string anchor) to be sure the match is at the start of the string, otherwise you might pick up an identifier halfway an invalid sequence. After the repeated regular expression pattern you should also add .* if you want to ignore anything that follows after the second sequence, and not have it come back to you when using replace.
Finally, as you don't know which identifier will be found for the first and second match, you need to reproduce $1$2$3$4 in your replace, knowing that only one of those four will be a non-empty string. Same for the second match: $5$6$7$8.
Here is the improved code applied to your example string:
var regex = /^(?:01(\d{14})|10([^\x1D]{6,20})|11(\d{6})|17(\d{6}))(?:01(\d{14})|10([^\x1D]{6,20})|11(\d{6})|17(\d{6})).*/;
var str = "011234567890123417501200";
console.log(str.replace(regex, "$1$2$3$4")); // 12345678901234
console.log(str.replace(regex, "$5$6$7$8")); // 501200
If you need to also match the barcodes that follow the second, then you cannot escape from writing a loop. You cannot do that with just a regular expression based replace.
With a loop
If a loop is allowed, then you can use the regex#exec method. I would then suggest to add in your regular expression a kind of "catch all", which will match one character if none of the other identifiers match. If in the loop you detect such a "catch all" match, you exit:
var str = "011234567890123417501200";
var regex = /(?:01(\d{14})|10([^\x1D]{6,20})|11(\d{6})|17(\d{6})|(.))/g;
// 1: ^^^^^^ 2: ^^^^^^^^^^^^^ 3: ^^^^^ 4: ^^^^^ 5:^ (=failure)
var result = [], grp;
while ((grp = regex.exec(str)) && !grp[5]) result.push(grp.slice(1).join(''));
// Consider it a failure when not at least 2 matched.
if (result.length < 2) result = [];
console.log(result);

update
1st example
example with $1 $2 $3 $4 don't know why in matrix :)
but you see $1 -> abc
$2 -> def $3 -> ghi $4 -> jkl
// $1 $2 $3 $4
var regex = /(abc)|(def)|(ghi)|(jkl)/g;
var str = "abcdefghijkl";
// test
console.log(str.replace(regex, "$1 1st "));
console.log(str.replace(regex, "$2 2nd "));
console.log(str.replace(regex, "$3 3rd "));
console.log(str.replace(regex, "$4 4th "));
2nd example
sth in here is mixing faulty
// $1 $2 $3 $4
var regex = /((abc)|(def)|(ghi)|(jkl)){2,}/g;
var str = "abcdefghijkl";
// test
console.log(str.replace(regex, "$1 1st "));
console.log(str.replace(regex, "$2 2nd "));
console.log(str.replace(regex, "$3 3rd "));
console.log(str.replace(regex, "$4 4th "));
As you see there is ($4)( )( )( ) instead of ($1)( )( )( ).
If I think correctly the problem is with outside brackets () confusing 'pseudo' $1 is $4. If you have in outside brackets () a pattern and then {2,} so in outside brackets () it is $4 but in subpattern there is (?:01(\d{14})) but it reads like not $1 but faulty in this case $4 . Maybe this cause conflicts between the remembered values in outside brackets () and 1st remembered values but inside brackets (this is $1) . That's why it doesn't display. In other words you have ($4 ($1 $2 $3 $4) ) and this is not correct.
I add the picture to show what I mean.
As #Damian said
By adding the quantifier, JavaScript will be sure to return only the last match.
so $4 is the last match.
end update
I added useful little test
var regex = /(?:01(\d{14})|10(\x1D{6,20})|11(\d{6})|17(\d{6})){2,}/g;
var str = "011234567890123417501200";
// test
console.log(str.replace(regex, "$1 1st "));
console.log(str.replace(regex, "$2 2nd "));
console.log(str.replace(regex, "$3 3rd "));
console.log(str.replace(regex, "$4 4th "));

Related

Regex match all punctuations except 'D.C.'

I'm trying to write a regex that finds all punctuation marks [.!?] in order to capitalize the next word, however if the period is part of the string 'D.C.' it should be ignored, so far I have the first part working, but not sure about how to ignore 'D.C.'
const punctuationCaps = /(^|[.!?]\s+)([a-z])/g;

You can match the D.C. part and use an alternation using the 2 capturing groups that you already have.
In the replacement check for one of the groups. If it is present, concatenate them making group 2 toUpperCase(), else return the match keeping D.C. in the string.
const regex = /D\.C\.|(^|[.!?]\s+)([a-z])/g;
let s = "this it D.C. test. and? another test! it is.";
s = s.replace(regex, (m, g1, g2) => g2 ? g1 + g2.toUpperCase() : m);
console.log(s);

Use a negative lookahead:
var str = 'is D.C. a capital? i don\'t know about X.Y. stuff.';
var result = str.replace(/(^|[.!?](?<![A-Z]\.[A-Z]\.)\s+)([a-z])/g, (m, c1, c2) => { return c1 + c2.toUpperCase(); });
console.log('in: '+str);
console.log('out: '+result);
Console output:
in: is D.C. a capital? i don't know about X.Y. stuff.
out: Is D.C. a capital? I don't know about X.Y. stuff.
Explanation:
(^|[.!?]) - expect start of string, or a punctuation char
(?<![A-Z]\.[A-Z]\.) - negative lookahead: but not a sequence of upper char and dot, repeated twice
\s+ - expect one or more whitespace chars
all of the above is captured because of the parenthesis
([a-z]) - expect a lower case char, in parenthesis for second capture group

regular expression replacement in JavaScript with some part remaining intact

I need to parse a string that comes like this:
-38419-indices-foo-7119-attributes-10073-bar
Where there are numbers followed by one or more words all joined by dashes. I need to get this:
[
0 => '38419-indices-foo',
1 => '7119-attributes',
2 => '10073-bar',
]
I had thought of attempting to replace only the dash before a number with a : and then using .split(':') - how would I do this? I don't want to replace the other dashes.

Imo, the pattern is straight-forward:
\d+\D+
To even get rid of the trailing -, you could go for
(\d+\D+)(?:-|$)
Or
\d+(?:(?!-\d|$).)+
You can see it here:
var myString = "-38419-indices-foo-7119-attributes-10073-bar";
var myRegexp = /(\d+\D+)(?:-|$)/g;
var result = [];
match = myRegexp.exec(myString);
while (match != null) {
// matched text: match[0]
// match start: match.index
// capturing group n: match[n]
result.push(match[1]);
match = myRegexp.exec(myString);
}
console.log(result);
// alternative 2
let alternative_results = myString.match(/\d+(?:(?!-\d|$).)+/g);
console.log(alternative_results);
Or a demo on regex101.com.

Logic
lazy matching using quantifier .*?
Regex
.*?((\d+)\D*)(?!-)
https://regex101.com/r/WeTzF0/1
Test string
-38419-indices-foo-7119-attributes-10073-bar-333333-dfdfdfdf-dfdfdfdf-dfdfdfdfdfdf-123232323-dfsdfsfsdfdf
Matches
Further steps
You need to split from the matches and insert into your desired array.

javascript - regexp exec internal index doesn't progress if first char is not a match

I need to match numbers that are not preceeded by "/" in a group.
In order to do this I made the following regex:
var reg = /(^|[^,\/])([0-9]*\.?[0-9]*)/g;
First part matches start of the string and anything else except "/", second part matches a number. Everything works ok regarding the regex (it matches what I need). I use https://regex101.com/ for testing. Example here: https://regex101.com/r/7UwEUn/1
The problem is that when I use it in js (script below) it goes into an infinite loop if first character of the string is not a number. At a closer look it seems to keep matching the start of the string, never progressing further.
var reg = /(^|[^,\/])([0-9]*\.?[0-9]*)/g;
var text = "a 1 b";
while (match = reg.exec(text)) {
if (typeof match[2] != 'undefined' && match[2] != '') {
numbers.push({'index': match.index + match[1].length, 'value': match[2]});
}
}
If the string starts with a number ("1 a b") all is fine.
The problem appears to be here (^|[^,/]) - removing ^| will fix the issue with infinite loop but it will not match what I need in strings starting with numbers.
Any idea why the internal index is not progressing?

Infinite loop is caused by the fact your regex can match an empty string. You are not likely to need empty strings (even judging by your code), so make it match at least one digit, replace the last * with +:
var reg = /(^|[^,\/])([0-9]*\.?[0-9]+)/g;
var text = "a 1 b a 2 ana 1/2 are mere (55";
var numbers=[];
while (match = reg.exec(text)) {
numbers.push({'index': match.index + match[1].length, 'value': match[2]});
}
console.log(numbers);
Note that this regex will not match numbers like 34. and in that case you may use /(^|[^,\/])([0-9]*\.?[0-9]+|[0-9]*\.)/g, see this regex demo.
Alternatively, you may use another "trick", advance the regex lastIndex manually upon no match:
var reg = /(^|[^,\/])([0-9]*\.?[0-9]+)/g;
var text = "a 1 b a 2 ana 1/2 are mere (55";
var numbers=[];
while (match = reg.exec(text)) {
if (match.index === reg.lastIndex) {
reg.lastIndex++;
}
if (match[2]) numbers.push({'index': match.index + match[1].length, 'value': match[2]});
}
console.log(numbers);

Javascript Regular Expression to search a multiline textarea for an expression

Let's say there is a textarea with the following value ('*' being used as a bullet point):
*south
*north
*west
I want to be able to automatically generate an array of these words using Regular Expression, like this.
["south","north","west"]
Below is the expression I tried.
/\*.*/gm.exec(text)
Unfortunately it returns this instead.
["*south"]
Apparently, RegExp recognizes there is a line break such that it only returns the first item, yet it doesn't pick up the 2nd and 3rd lines.
/\*.*/gm.exec('*south \n *north')
This also has the same result.

You need to tell the regex engine to match at the beginning of a line with ^, and capture the part after the first * with a pair of unescaped parentheses. Then, you can use RegExp#exec() in a loop, and get the value you need in Group 1. The ^\s*\*\s*(.*) regex matches:
^ - start of a line (due to /m multiline modifier)
\s* - zero or more whitespace symbols
\* - a literal asterisk
\s* - again, optional whitespace(s)
(.*) - zero or more characters other than a newline.
var re = /^\s*\*\s*(.*)/gm;
var str = '*south\n *north\n* west ';
var res = [];
while ((m = re.exec(str)) !== null) {
res.push(m[1]);
}
document.write("<pre>" + JSON.stringify(res, 0, 4) + "</pre>");
Another solution:
Split with newline (a regex is possible here if there can be \r or \n) and then get rid of the initial *:
var str = '*south\n*north\n*west ';
var res = [];
str.split(/[\r\n]+/).forEach(function(e) {
res.push(e.replace(/^\s*\*\s*/, ''));
});
document.write("<pre>" + JSON.stringify(res, 0, 4) + "</pre>");

#VKS solution works, but if it is not mandatory to use regex then try this fiddle
<textarea id="textA1"></textarea>
$( "#textA1" ).blur( function(){
var value = $( this ).val();
console.log( value.split( "\n" ) );
} )

You will have to run a loop.
var re = /\*(.*)/gm;
var str = '*south\n*north\n*west ';
var m;
while ((m = re.exec(str)) !== null) {
// View your result using the m-variable.
// eg m[0] etc.
}
See demo.
https://regex101.com/r/iJ7bT6/11
or you an split by (?=\*).See demo.
https://regex101.com/r/iJ7bT6/12

How to split this string into an array

If I have a string like below and It is not static every time.
var str = "#a
b
c
_
ele1
ele2
#d
e
f
_
ele3
";
from the above string I want to retrieve an array like below
arr = [ "#a
b
c
_",
"ele1",
"ele2",
"#d
e
f
_",
"ele3"
]
The criterion is: everything between # and _ as a single item; every line outside those delimiters is a separate item.
How can I do that.Any idea.... Please use this fiddle.

Again, given the criteria in the comment this works
var arr = str.match(/(?:#([^_]*)_|([^#_\s])+)/g)
http://jsfiddle.net/fhDPj/1/
And to explain the regex
#([^_]*)_ - find anything that isn't _ that falls between a # and a _ ( * means even empty strings are captured)
([^#_\s])+ - find anything that isn't #, _ or whitespace ( + means only non-empty strings are captured)
(?: | ) - find either of the above (but non-capturing as the above expressions already capture the strings needed)
/ /g - global match, to return all matches in the string rather than just the first one

are the whitespaces intentional?
try this instead:
<div id ="a">#abc_ele1ele2#def_ele34</div>
script:
var str = $('#a').text();
var result = str.match(/(#[^_]*_)|([^#][^\d]*\d{1,})/g)
console.log(result)
EXPLANATION:
string.match() - returns an array of matches
#[^_]*_ - finds anything that begins with # and ends with _ and with anything but _ in between
[^#][^\d]*\d{1,} - finds anything with that does NOT start with #, followed by 0 or more non-numeric characters, and ends with at least one digit
DEMO: check your console
this will still run with all those whitespaces. you MUST be clear with your split rules.

Given the criteria in my comment under the question:
var str = "#a\nb\nc\n_\nfoo\nbar\n#d\ne\nf\n_";
var re = /((?:#[^_]*_)|(?:^.*$))/mg;
var result = str.match(re);
console.log(result);
// [ '#a\nb\nc\n_', 'foo', 'bar', '#d\ne\nf\n_' ]
Regexp explanation: a match is either everything from # to _ - (?:#[^_]*_) - or everything on a single line - (?:^.*$).
EDIT: due to whitespace... a bit different strategy:
var str = $('#a').text();
var re = /^\s*((?:#[^_]*_)|(?:.*?$))/mg;
var result = [], match;
while ((match = re.exec(str))) {
result.push(match[1]);
}
console.log(result);

var x = str.match(/(#?[a-z]+[0-9_]+?)/g);

try split:
arr = str.split("_");

We Keep Coding

JavaScript is the programming language of the Web.

Multiple nested matches in JavaScript Regular Expression - javascript

Related

Regex match all punctuations except 'D.C.'

regular expression replacement in JavaScript with some part remaining intact

javascript - regexp exec internal index doesn't progress if first char is not a match

Javascript Regular Expression to search a multiline textarea for an expression

How to split this string into an array

Categories

Resources