Javascript regex, determining what group was matched on - javascript

I have the following regex in javascript for matching similar to book[n], book[1,2,3,4,5,...,n], book[author="Kristian"] and book[id=n] (n is an arbitrary number):
var opRegex = /\[[0-9]+\]|\[[0-9]+,.*\]|\[[a-zA-Z]+="*.+"*\]/gi;
I can use this in the following way:
// If there is no match in any of the groups hasOp will be null
hasOp = opRegex.exec('books[0]');
/*
Result: ["[0]", index: 5, input: "books[0]"]
*/
As shown above I not only get the value but also the [ and ]. I can avoid this by using groups. So I changed the regex to:
var opRegex = /\[([0-9]+)\]|\[([0-9]+,.*)\]|\[([a-zA-Z]+=".+")\]/gi;
Running the same as above the results will instead be:
["[0]", "0", undefined, undefined, index: 5, input: "books[0]"]
Above I get the groups as index 1, 2 and 3 in the array. For this example the match is in the first but if the match is in the second regex group the match will be in index 2 or the array.
Can I change my first regex to get the value without the brackets or do I go with the grouped approach and a while loop to get the first defined value?
Anything else I'm missing? Is it greedy?
Let me know if you need more information and I'll be happy to provide it.

I have a few suggestions. First, especially since you are looking for literal brackets, avoid the regex brackets when you can (replace [0-9] with \d, for example). Also, you were allowing multiple quotes with the *, so I changed it to "?. But most importantly, I moved the match for the brackets outside the alternation, since they should be in every alternate match. That way, you have the same group no matter which part matches.
/\[(\d+(,\d+)*|[a-zA-Z]+="?[^\]]+"?)\]/gi

Related

How can I include the delimiter with regex String.split()?

I need to parse the tokens from a GS1 UDI format string:
"(20)987111(240)A(10)ABC123(17)2022-04-01(21)888888888888888"
I would like to split that string with a regex on the "(nnn)" and have the delimiter included with the split values, like this:
[ "(20)987111", "(240)A", "(10)ABC123", "(17)2022-04-01", "(21)888888888888888" ]
Below is a JSFiddle with examples, but in case you want to see it right here:
// This includes the delimiter match in the results, but I want the delimiter included WITH the value
// after it, e.g.: ["(20)987111", ...]
str = "(20)987111(240)A(10)ABC123(17)2022-04-01(21)888888888888888";
console.log(str.split(/(\(\d{2,}\))/).filter(Boolean))
// Result: ["(20)", "987111", "(240)", "A", "(10)", "ABC123", "(17)", "2022-04-01", "(21)", "888888888888888"]
// If I include a pattern that should (I think) match the content following the delimiter I will
// only get a single result that is the full string:
str = "(20)987111(240)A(10)ABC123(17)2022-04-01(21)888888888888888";
console.log(str.split(/(\(\d{2,}\)\W+)/).filter(Boolean))
// Result: ["(20)987111(240)A(10)ABC123(17)2022-04-01(21)888888888888888"]
// I think this is because I'm effectively mathching the entire string, hence a single result.
// So now I'll try to match only up to the start of the next "(":
str = "(20)987111(240)A(10)ABC123(17)2022-04-01(21)888888888888888";
console.log(str.split(/(\(\d{2,}\)(^\())/).filter(Boolean))
// Result: ["(20)987111(240)A(10)ABC123(17)2022-04-01(21)888888888888888"]
I've found and read this question, however the examples there are matching literals and I'm using character classes and getting different results.
I'm failing to create a regex pattern that will provide what I'm after. Here's a JSFiddle of some of the things I've tried: https://jsfiddle.net/6bogpqLy/
I can't guarantee the order of the "application identifiers" in the input string and as such, match with named captures isn't an attractive option.
You can split on positions where parenthesised element follows, by using a zero-length lookahead assertion:
const text = "(20)987111(240)A(10)ABC123(17)2022-04-01(21)888888888888888"
const parts = text.split(/(?=\(\d+\))/)
console.log(parts)
Instead of split use match to create the array. Then find 1) digits in parenthesis, followed by a group that might contain a digit, a letter, or a hyphen, and then 2) group that whole query.
(PS. I often find a site like Regex101 really helps when it comes to testing out expressions outside of a development environment.)
const re = /(\(\d+\)[\d\-A-Z]+)/g;
const str = '(20)987111(240)A(10)ABC123(17)2022-04-01(21)888888888888888';
console.log(str.match(re));

Regexp group not excluding dots

Let's say I have the following string: div.classOneA.classOneB#idOne
Trying to write a regexp which extracts the classes (classOneA, classOneB) from it. I was able to do this but with Lookbehind assertion only.
It looks like this:
'div.classOneA.classOneB#idOne'.match(/(?<=\.)([^.#]+)/g)
> (2) ["classOneA", "classOneB"]
Now I would like to archive this without the lookbehind approach and do not really understand why my solution's not working.
'div.classOneA.classOneB#idOne'.match(/\.([^.#]+)/g)
> (2) [".classOneA", ".classOneB"]
Thought that the grouping will solve my problem but all matching item contains the dot as well.
There isn't a good way in Javascript to both match multiple times (/g option) and pick up capture groups (in the parens). Try this:
var input = "div.classOneA.classOneB#idOne";
var regex = /\.([^.#]+)/g;
var matches, output = [];
while (matches = regex.exec(input)) {
output.push(matches[1]);
}
This is because with g modifier you get all matching substrings but not its matching groups (that is as if (...) pairs worked just like (?:...) ones.
You see. Whithout g modifier:
> 'div.classOneA.classOneB#idOne'.match(/\.([^.#]+)/)
[ '.classOneA',
'classOneA',
index: 3,
input: 'div.classOneA.classOneB#idOne',
groups: undefined ]
With g modifier:
> 'div.classOneA.classOneB#idOne'.match(/\.([^.#]+)/g)
[ '.classOneA', '.classOneB' ]
In other words: you obtain all matches but only the whole match (0 item) per each.
There are many solutions:
Use LookBehind assertions as you pointed out yourself.
Fix each result later adding .map(x=>x.replace(/^\./, ""))
Or, if your input structure won't be much more complicated than the example you provide, simply use a cheaper approach:
> 'div.classOneA.classOneB#idOne'.replace(/#.*/, "").split(".").slice(1)
[ 'classOneA', 'classOneB' ]
Use .replace() + callback instead of .match() in order to be able to access capture groups of every match:
const str = 'div.classOneA.classOneB#idOne';
const matches = [];
str.replace(/\.([^.#]+)/g, (...args)=>matches.push(args[1]))
console.log(matches); // [ 'classOneA', 'classOneB' ]
I would recommend the third one (if there aren't other possible inputs that could eventually break it) because it is much more efficient (actual regular expressions are used only once to trim the '#idOne' part).
If you want to expand you regex. you can simply map on results and replace . with empty string
let op = 'div.classOneA.classOneB#idOne'.match(/\.([^.#]+)/g)
.map(e=> e.replace(/\./g,''))
console.log(op)
If you know you are searching for a text containing class, then you can use something like
'div.classOneA.classOneB#idOne'.match(/class[^.#]+/g)
If the only thing you know is that the text is preceded by a dot, then you must use lookbehind.
This regex will work without lookbehind assertion:
'div.classOneA.classOneB#idOne'.match(/\.[^\.#]+/g).map(item => item.substring(1));
Lookbehind assertion is not available in JavaScript recently.
I'm not an expert on using regex - particularly in Javascript - but after some research on MDN I've figured out why your attempt wasn't working, and how to fix.
The problem is that using .match with a regexp with the /g flag will ignore capturing groups. So instead you have to use the .exec method on the regexp object, using a loop to execute it multiple times to get all the results.
So the following code is what works, and can be adapted for similar cases. (Note the grp[1] - this is because the first element of the array returned by .exec is the entire match, the groups are the subsequent elements.)
var regExp = /\.([^.#]+)/g
var result = [];
var grp;
while ((grp = regExp.exec('div.classOneA.classOneB#idOne')) !== null) {
result.push(grp[1]);
}
console.log(result)

Javascript regex match returning a string with comma at the end

Just as the title says...i'm trying to parse a string for example
2x + 3y
and i'm trying to get only the coefficients (i.e. 2 and 3)
I first tokenized it with space character as delimiter giving me "2x" "+" "3y"
then i parsed it again to this statement to get only the coefficients
var number = eqTokens[i].match(/(\-)?\d+/);
I tried printing the output but it gave me "2,"
why is it printing like this and how do i fix it? i tried using:
number = number.replace(/[,]/, "");
but this just gives me an error that number.replace is not a function
What's wrong with this?
> "2x + 3y".match(/-?\d+(?=[A-Za-z]+)/g)
[ '2', '3' ]
The above regex would match the numbers only if it's followed by one or more alphabets.
Match is going to return an array of every match. Since you put the optional negative in a parentheses, it's another capture group. That capture group has one term and it's optional, so it'll return an empty match in addition to your actual match.
Input 2x -> Your output: [2,undefined] which prints out as "2,"
Input -2x -> Your output: [2,-]
Remove the parentheses around the negative.
This is just for the sake of explaining why your case is breaking but personally I'd use Avinash's answer.

Javascript skip double pipes in a string

I have the following string:
var test = "test|2014-07-22 12:13:47||ASD|\|nameOfSomething123\||anothersmt";
var s = test.split('|');
console.log(s);
//outputs
[ 'test',
'2014-07-22 12:13:47',
'',
'ASD',
'',
'nameOfSomething123',
'',
'anothersmt' ]
Because the |nameOfSomething123| also has pipes, the split('|'), the result is not good, I need to get rid of the 5 and 6th position. No good.
I would like to split it, but skipping \|nameOfSomething123\|
Does anyone know how to solve it ?
Thank you.
First, I'm going to assume that your test string actually contains \| sequences. If you were to write the string literal as you've shown, \| would be interpreted as an escape sequence for |. For this script to work as you've shown, you'd need to write test like this:
var test = "test|2014-07-22 12:13:47||ASD|\\|nameOfSomething123\\||anothersmt";
You can accomplish this pretty easily using match instead of split:
test.match(/(\\\||[^|])+/g);
// outputs
[ "test",
"2014-07-22 12:13:47",
"ASD",
"\|nameOfSomething123\|",
"anothersmt" ]
This pattern matches one or more sequences of either \| or any character other than |. Note that the the \ and the | need to be escaped to refer to literal \ and | characters. Given your sample input, this should accomplish the goal. (Of course if the \ can be escaped, too, that's complicates it a bit)
If you need to capture empty strings between two pipes like ||, then you can use split around the matched values and filter out the separators. For example:
test.split(/((?:\\\||[^|])*)/g).filter(function(x, i) { return i % 2 });
// outputs
[ "test",
"2014-07-22 12:13:47",
"",
"ASD",
"\|nameOfSomething123\|",
"anothersmt" ]
This works because split will return any captured substrings as a separate entry in the result array. Then filter just picks every other element from the result. Note that filter requires ECMAScript 5.1 or later, so it may not work in older browsers. If this is a problem, see the polyfill option described in the linked documentation.
I don't see why this is a hard problem. If your separator is always |, then the only case when you get an empty string from .split is going to be when you have a double | (or triple or quadruple). As long as the double pipes have no semantic purpose for you, all you need to do is get rid of the empty strings:
function check_for_empty_string(element){
if (element.length != 0) return element;
}
s = s.filter(check_for_empty_string);
Now s should only contain non-empty strings and you're done. Array.filter is a javascript built-in that takes a callback that checks an element. Whatever you return from the callback passes through the filter and into the new array. Here I've used the old array as the target, for brevity, but .filter returns a new array so you can keep the old one if you want.

capture with regex in javascript

I have a string like "ListUI_col_order[01234567][5]". I'd like to capture the two numeric sequences from the string. The last part between the square brackets may contain 2 digits, while the first numeric sequence always contains 8 digits (And the numbers are dynamically changing of course.) Im doing this in javascript and the code for the first part is simple: I get the only 8digit sequence from the string:
var str = $(this).attr('id');
var unique = str.match(/([0-9]){8}/g);
Getting the second part is a bit complicated to me. I cannot simply use:
var column = str.match(/[0-9]{1,2}/g)
Because this will match '01', '23', '45', '67', '5' in our example, It's clear. Although I'm able to get the information what I need as column[4], because the first part always contains 8 digits, but I'd like a nicer way to retrieve the last number.
So I define the contex and I can tell the regex that Im looking for a 1 or 2 digit number which has square brackets directly before and after it:
var column = str.match(/\[[0-9]{1,2}\]/g)
// this will return [5]. which is nearly what I want
So to get Only the numeric data I use parenthesis to capture only the numbers like:
var column = str.match(/\[([0-9]){1,2}\]/g)
// this will result in:
// column[0] = '[5]'
// column[1] = [5]
So my question is how to match the '[5]' but only capture the '5'? I have only the [0-9] between the parenthesis, but this will still capture the square brackets as well
You can get both numbers in one go :
var m = str.match(/\[(\d{8})\]\[(\d{1,2})\]$/)
For your example, this makes ["[01234567][5]", "01234567", "5"]
To get both matches as numbers, you can then do
if (m) return m.slice(1).map(Number)
which builds [1234567, 5]
Unfortunately, JavaScript does not support the lookbehind necessary to do this. In other languages such as PHP, it'd be as simple as /(?<=\[)\d{1,2}(?=\])/, but in JavaScript I am not aware of any way to do this other than use a capturing subpattern as you are here, and getting that index from the result array.
Side-note, it's usually better to put the quantifier inside the capturing group - otherwise you're repeating the group itself, not its contents!

Categories