Regex to extract search terms is not working as expected - javascript

I have the test string
ti: harry Potter OR kw: magic AND sprint: title OR ti: HARRY
and want the output as
["ti: harry Potter OR kw:", "kw: magic AND sprint:", "sprint: title OR ti:", "ti: HARRY"]
but the output I am getting is
["ti: harry Potter OR kw:", "kw: magic AND sprint:", "nt: title OR ti:", "ti: HARRY"]
It is taking only 2 characters before the colon
The regex I am using is
const match = /[a-z0-9]{2}:.*?($|[a-z0-9]{2}:)/g;
and I am extracting it and putting it in an array
I tried replacing it with /[a-z0-9]+:.*?($|[a-z0-9]+:)/g; but when I increase index and add the strings to parsed, it does it weirdly (This is included in code as well)
I tried changing the {2} to n and that is also not working as expected.
const parsed = [];
const match = /[a-z0-9]{2}:.*?($|[a-z0-9]{2}:)/g;
const message = "ti: harry Potter OR kw: magic AND sprint: title OR ti: HARRY";
let next = match.exec(message);
while (next) {
parsed.push(next[0]);
match.lastIndex = next.index + 1;
next = match.exec(message);
console.log("next again", next);
}
console.log("parsed", parsed);
https://codesandbox.io/s/regex-forked-6op514?file=/src/index.js

For the desired matches, you might use a pattern where you would also optionally match AND or OR and get the match in capture group 1, which is denoted be m[1] in the example code.
\b(?=([a-z0-9]+:.*?(?: (?:AND|OR) [a-z0-9]+:|$)))
In parts, the pattern matches:
\b A word boundary to prevent a partial match
(?= Positive lookahead to assert what is on the right is
( Capture group 1
[a-z0-9]+:
.*? Match any char except a newline as least as possible
(?: Non capture group
(?:AND|OR) [a-z0-9]+: Match either AND or OR followed by a space and 1+ times a char a-z0-9 and :
| Or
$ Assert the end of the string
) Close non capture group
) Close group 1
) Close the lookahead
See a regex demo.
const regex = /\b(?=([a-z0-9]+:.*?(?: (?:AND|OR) [a-z0-9]+:|$)))/gm;
const str = `ti: harry Potter OR kw: magic AND sprint: title OR ti: HARRY`;
const result = Array.from(str.matchAll(regex), m => m[1]);
console.log(result);

Related

Can someone help me with this javascript regex to turn a string of list items into an array with the item texts only (no list item # or new line)

I have 2 kinds of strings coming back from an API. They look as follows:
let string1 = "\n1. foo\n2. bar\n3. foobar"
let string2 = "\n\n1. foo.\n\n2. bar.\n\n3. foobar."
To be clear, string 1 will always have 3 items coming back, string 2 has an unknown number of items coming back. But very similar patterns.
What I want to do is pull out the text only from each item into an array.
So from string1 I want ["foo", "bar", "foobar"]
From string2 I want ["foo.", "bar.", "foobar."]
I'm awful at regex but I somehow stumbled myself into an expression that accomplishes this for both string types, however, it uses regex's lookbehind which I'm trying to avoid as it isn't supported in all browsers:
let regex = /(?<=\. )(.*[a-zA-Z])/g;
let resultArray = str.match(regex);
Would someone be able to help me refactor this regex into something that doesn't use lookbehind?
Some notes about the pattern (?<=\. )(.*[a-zA-Z]) that you tried:
The first part asserts a dot and space to the left, but does not take any newlines into account so it could possibly also match on other positions
It does not take the digits into account of matching a list item
The second part of your pattern matches till the last occurrence of a character A-Za-z which does not match ending dot in the examples in string2
All your example strings start with a newline, a number, a dot and 1 or more spaces.
You can get the matches without using a lookbehind, and make the pattern more specific by starting the match with the list item format.
Then capture the rest of the line after it in a capture group.
\n\d+\.[^\S\r\n]+(.+)
Explanation
\n Match a newline
\d+\. Match 1+ digits and a dot
[^\S\r\n]+ Match 1 or more spaces without newlines
(.+) Capture group 1, match 1 or more chars
See a regex demo.
const regex = /\n\d+\.[^\S\r\n]+(.+)/g;
[
"\n1. foo\n2. bar\n3. foobar",
"\n\n1. foo.\n\n2. bar.\n\n3. foobar."
].forEach(s => {
console.log(Array.from(s.matchAll(regex), m => m[1]))
});
If the string should also match without a leading newline, you can use an anchor ^ to assert the start of the string and use the multiline flag /m
const regex = /^\d+\.[^\S\r\n]+(.+)/gm;
[
"1. foo\n2. bar\n3. foobar",
"1. foo.\n\n2. bar.\n\n3. foobar."
].forEach(s => {
console.log(Array.from(s.matchAll(regex), m => m[1]))
});
OP's code, which uses a String.match, is actually better than the proposed solutions. It only needed a minor tweak to make it work, which is what the question asked:
string1.match(/([A-z].+)/g)
// TEST
[
"\n1. foo\n2. bar\n3. foobar",
"\n\n1. foo.\n\n2. bar.\n\n3. foobar."
].forEach(p => {
console.log( p.match(/([A-z].+)/g) )
});
If I understood you correctly, then perhaps this solution can help you:
const string1 = "\n1. foo\n2. bar\n3. foobar";
const string2 = "\n\n1. foo.\n\n2. bar.\n\n3. foobar.";
const parse = (str) => {
return str.split(/\n[0-9.\s]*/g).filter((item) => item !== "");
}
console.log('Example 1:', parse(string1));
console.log('Example 2:', parse(string2));
EDIT: see #Oleg-Barabanov 's solution, it's technically a bit quicker.
string.replace(/\n[0-9.]*/g, "").split(" ").slice(1)
\n for the new line
0-9 for digits (also could use \d
. for the dot after the number
g to replace all
.split(" ") to chop it up wherever there's a space (\s)
Demo:
let string1 = "\n1. foo\n2. bar\n3. foobar"
let string2 = "\n\n1. foo.\n\n2. bar.\n\n3. foobar."
const parse = (str) => str.replace(/\n[0-9.]*/g, "").split(" ").slice(1)
console.log(parse(string1))
console.log(parse(string2))

Regex match all punctuations except 'D.C.'

I'm trying to write a regex that finds all punctuation marks [.!?] in order to capitalize the next word, however if the period is part of the string 'D.C.' it should be ignored, so far I have the first part working, but not sure about how to ignore 'D.C.'
const punctuationCaps = /(^|[.!?]\s+)([a-z])/g;
You can match the D.C. part and use an alternation using the 2 capturing groups that you already have.
In the replacement check for one of the groups. If it is present, concatenate them making group 2 toUpperCase(), else return the match keeping D.C. in the string.
const regex = /D\.C\.|(^|[.!?]\s+)([a-z])/g;
let s = "this it D.C. test. and? another test! it is.";
s = s.replace(regex, (m, g1, g2) => g2 ? g1 + g2.toUpperCase() : m);
console.log(s);
Use a negative lookahead:
var str = 'is D.C. a capital? i don\'t know about X.Y. stuff.';
var result = str.replace(/(^|[.!?](?<![A-Z]\.[A-Z]\.)\s+)([a-z])/g, (m, c1, c2) => { return c1 + c2.toUpperCase(); });
console.log('in: '+str);
console.log('out: '+result);
Console output:
in: is D.C. a capital? i don't know about X.Y. stuff.
out: Is D.C. a capital? I don't know about X.Y. stuff.
Explanation:
(^|[.!?]) - expect start of string, or a punctuation char
(?<![A-Z]\.[A-Z]\.) - negative lookahead: but not a sequence of upper char and dot, repeated twice
\s+ - expect one or more whitespace chars
all of the above is captured because of the parenthesis
([a-z]) - expect a lower case char, in parenthesis for second capture group

Javascript regex to always find a match from the end?

I have a string which looks like below
str = "hey there = pola"
Now I need to check if there is equal = sign and the first word to the left of it. So this is what I do
str.match(/\w+(?= *=)/)[0]
So I get the desired result
But say I have a string like this
str = "hey there= pola so = boba"
Now I have two = signs. But the above regex will only give me the result for the first = sign.
Is there any regex that can always look for the first instance of = from the end of the string?
You can assert what is on the right is an equals sign followed by matching any char except an equals sign until the end of the string
\w+(?= *=[^=]*$)
In parts:
\w+
(?= Positive lookahead
*= Match 0+ occurrences of a space followed by =
[^=]* Match 0+ occurrences of = ( Use [^=\r\n]* to not cross line breaks)
$ End of string
) Close lookahead
Regex demo
const regex = /\w+(?= *=[^=]*$)/;
const str = `hey there= pola so = boba`;
console.log(str.match(regex)[0]);
Without using a lookahead, you could use a capturing group:
^.*\b(\w+) *=[^=]*$
Regex demo
const regex = /^.*\b(\w+) *=[^=]*$/m;
const str = `hey there= pola so = boba`;
console.log(str.match(regex)[1]);
I'm not much expert on regex but for you requirement I think split and pop should work
let str = "hey there= pola so = boba";
let endres = str.split('=').pop(); // gives the last element in the split array
Hope this helps.

How to change given string to regex modified string using javascript

Example strings :
2222
333333
12345
111
123456789
12345678
Expected result:
2#222
333#333
12#345
111
123#456#789
12#345#678
i.e. '#' should be inserted at the 4th,8th,12th etc last position from the end of the string.
I believe this can be done using replace and some other methods in JavaScript.
for validation of output string i have made the regex :
^(\d{1,3})(\.\d{3})*?$
You can use this regular expression:
/(\d)(\d{3})$/
this will match and group the first digit \d and group the last three \d{3} which are then grouped in their own group. Using the matched groups, you can then reference them in your replacement string using $1 and $2.
See example below:
const transform = str => str.replace(/(\d)(\d{3})$/, '$1#$2');
console.log(transform("2222")); // 2#222
console.log(transform("333333")); // 333#333
console.log(transform("12345")); // 12#345
console.log(transform("111")); // 111
For larger strings of size N, you could use other methods such as .match() and reverse the string like so:
const reverse = str => Array.from(str).reverse().join('');
const transform = str => {
return reverse(reverse(str).match(/(\d{1,3})/g).join('#'));
}
console.log(transform("2222")); // 2#222
console.log(transform("333333")); // 333#333
console.log(transform("12345")); // 12#345
console.log(transform("111")); // 111
console.log(transform("123456789")); // 123#456#789
console.log(transform("12345678")); // 12#345#678
var test = [
'111',
'2222',
'333333',
'12345',
'123456789',
'1234567890123456'
];
console.log(test.map(function (a) {
return a.replace(/(?=(?:\B\d{3})+$)/g, '#');
}));
You could match all the digits. In the replacement insert an # after every third digit from the right using a positive lookahead.
(?=(?:\B\d{3})+$)
(?= Positive lookahead, what is on the right is
(?:\B\d{3})+ Repeat 1+ times not a word boundary and 3 digits
$ Assert end of string
) Close lookahead
Regex demo
const regex = /^\d+$/;
["2222",
"333333",
"12345",
"111",
"123456789",
"12345678"
].forEach(s => console.log(
s.replace(/(?=(?:\B\d{3})+$)/g, "#")
));

Lazy match front part in JavaScript regex

I have a string
Steve Jobs steve.jobs#example.com somethingElse
I hope to match steve.jobs#example.com somethingElse (a space in front)
This is my regular expression (JavaScript)
\s.+?#.+
But now it matches Jobs steve.jobs#example.com somethingElse (a space in front)
I know I can use ? to lazy match the following part, but how to lazy match front part?
A . can be any character, including the whitespaces.
Normally e-mails don't contain whitespaces.
(although it's actually allowed between 2 ")
So you could change the regex so that it looks for non-whitespaces \S before and after the #.
It can be greedy.
A whitespace followed by 1 or more non-whitespaces and a # and 1 or more non-whitespaces. Then by whitespace(s) and something else:
\s(\S+#\S+)(?:\s+(\S+))?
You can also use trim, split and pop
var output = "Steve Jobs steve.jobs#example.com ".trim().split(" ").pop();
Regex solution
You can use trim and match
var output = "Steve Jobs steve.jobs#example.com ".trim().match( /[\w.]+#[\w.]+/g )
Regex - /[\w.]+#[\w.]+$/gi
Edit
var output = "Steve Jobs steve.jobs#example.com somethingelse ".trim().match( /[\w.]+#[\w.]+/g )
Demo
var regex = /[\w.]+#[\w.]+/g;
var input1 = "Steve Jobs steve.jobs#example.com ";
var input2 = "Steve Jobs steve.jobs#example.com somethingelse ";
var fn = (str) => str.trim().match(regex);
console.log( fn(input1) );
console.log( fn(input2) );
The allowed characters in an email are;
*0-9* | *a-z* | *. - _*
And it must have a # symbol too.
So our regex must start with allowed characters,
[a-zA-z0-9-_.]
It must continue with # symbol;
[a-zA-z0-9-_.]+#
Then it can end with .com or anything which includes dot
[a-zA-z0-9-_.]+#[a-zA-z0-9.]+

Categories