Regex match all punctuations except 'D.C.'

Regex match all punctuations except 'D.C.' - javascript

I'm trying to write a regex that finds all punctuation marks [.!?] in order to capitalize the next word, however if the period is part of the string 'D.C.' it should be ignored, so far I have the first part working, but not sure about how to ignore 'D.C.'
const punctuationCaps = /(^|[.!?]\s+)([a-z])/g;

You can match the D.C. part and use an alternation using the 2 capturing groups that you already have.
In the replacement check for one of the groups. If it is present, concatenate them making group 2 toUpperCase(), else return the match keeping D.C. in the string.
const regex = /D\.C\.|(^|[.!?]\s+)([a-z])/g;
let s = "this it D.C. test. and? another test! it is.";
s = s.replace(regex, (m, g1, g2) => g2 ? g1 + g2.toUpperCase() : m);
console.log(s);

Use a negative lookahead:
var str = 'is D.C. a capital? i don\'t know about X.Y. stuff.';
var result = str.replace(/(^|[.!?](?<![A-Z]\.[A-Z]\.)\s+)([a-z])/g, (m, c1, c2) => { return c1 + c2.toUpperCase(); });
console.log('in: '+str);
console.log('out: '+result);
Console output:
in: is D.C. a capital? i don't know about X.Y. stuff.
out: Is D.C. a capital? I don't know about X.Y. stuff.
Explanation:
(^|[.!?]) - expect start of string, or a punctuation char
(?<![A-Z]\.[A-Z]\.) - negative lookahead: but not a sequence of upper char and dot, repeated twice
\s+ - expect one or more whitespace chars
all of the above is captured because of the parenthesis
([a-z]) - expect a lower case char, in parenthesis for second capture group

Related

Replace not numbers or words to underscore but leave dash and remove spaces around it

So I got this string
'word word - word word 24/03/21'
And I would like to convert it to
'word_word-word_word_24_03_21'
I have tried this
replace(/[^aA-zZ0-9]/g, '_')
But I get this instead
word_word___word_word_24_03_21

You can use 2 .replace() calls:
const s = 'word word - word word 24/03/21'
var r = s.replace(/\s*-\s*/g, '-').replace(/[^-\w]+/g, '_')
console.log(r)
//=> "word_word-word_word_24_03_21"
Explanation:
.replace(/\s*-\s*/g, '-'): Remove surrounding spaces of a hyphen
.replace(/[^-\w]+/g, '_'): Replace all character that are not a hyphen and not a word character with an underscore

You can use
console.log(
'word word - word word 24/03/21'.replace(/\s*(-)\s*|[^\w-]+/g, (x,y) => y || "_")
)
Here,
/\s*(-)\s*|[^\w-]+/g - matches and captures into Group 1 a - enclosed with zero or more whitespaces, and just matches any non-word char excluding -
(x,y) => y || "_") - replaces with Group 1 if it was matched, and if not, replacement is a _ char.

With a function for replace and an alternation in the pattern, you could also match:
(\s*-\s*) Match a - between optional whtiespace chars
| Or
[^a-zA-Z0-9-]+ Match 1+ times any of the listed ranges
In the callback, check if group 1 exists. If it does, return only a -, else return _
Note that this notation [^aA-zZ0-9] is not the same as [a-zA-Z0-9], see what [A-z] matches.
let s = "word word - word word 24/03/21";
s = s.replace(/(\s*-\s*)|[^a-zA-Z0-9-]+/g, (_, g1) => g1 ? "-" : "_");
console.log(s);

You can use the + regex operator to replace 1 or more continuous matches at once.
let s = 'word word - word word 24/03/21';
let r = s
.replace(/[^aA-zZ0-9]*-[^aA-zZ0-9]*/g, '-')
.replace(/[^aA-zZ0-9-]+/g, '_');
console.log(r);
// 'word_word-word_word_24_03_21'

Is it possible to have one regex that solves this task?

string = '1,23'
When a comma is present in the string, I want the regex to match the first digit (\n) after the comma e.g.2.
Sometimes the comma will not be there. When it's not present, I want the regex to match the first digit of the string e.g. 1.
Also, we can't reverse the order of the string to solve this task.
I am genuinely stuck. The only idea I had was prepending this: [,|nothing]. I tried '' to mean nothing but that didn't work.

You can match an optional sequence of chars other than a comma and then a comma at the start of a string, and then match and capture the digit with
/^(?:[^,]*,)?(\d)/
See the regex demo.
Details
^ - start of string
(?:[^,]*,)? - an optional sequence of
[^,]* - 0 any chars other than a comma
, - a comma
(\d) - Capturing group 1: any digit
See the JavaScript demo:
const strs = ['123', '1,23'];
const rx = /^(?:[^,]*,)?(\d)/;
for (const s of strs) {
const result = (s.match(rx) || ['',''])[1];
// Or, const result = s.match(rx)?.[1] || "";
console.log(s, '=>', result);
}

Javascript Regex: Capture between two asterisks with multiple asterisks in comma delimited string

I am trying to capture all characters between multiple instances of asterisks, which are comma delimited in a string. Here's an example of the string:
checkboxID0*,*checkboxID1*,&checkboxID2&,*checkboxID3*,!checkboxID4!,checkboxID5*
The caveat is that the phrase must start and end with an asterisk. I have been able to come close by using the following regex, however, it won't discard any matches when the captured string is missing the starting asterisk(*):
let str = "checkboxID0*,*checkboxID1*,&checkboxID2&,*checkboxID3*,!checkboxID4!,checkboxID5*"
const regex = /[^\,\*]+(?=\*)/gi;
var a = str.match(regex)
console.log(a) // answer should exclude checkboxID0 and checkboxID5
The answer returns the following, however, "checkboxID0 and checkboxID5" should be excluded as it doesn't start with an asterisk.
[
"checkboxID0",
"checkboxID1",
"checkboxID3",
"checkboxID5"
]
Thanks, in advance!

You need to use asterisks on both ends of the pattern and capture all 1 or more chars other than commas and asterisks in between:
/\*([^,*]+)\*/g
See the regex demo
Pattern details
\* - an asterisk
([^,*]+) - Capturing group 1: one or more chars other than , and *
\* - an asterisk
JS demo:
var regex = /\*([^,*]+)\*/g;
var str = "checkboxID0*,*checkboxID1*,&checkboxID2&,*checkboxID3*,!checkboxID4!,checkboxID5*";
var m, res = [];
while (m = regex.exec(str)) {
res.push(m[1]);
}
console.log(res);

Non-capturing group matching whitespace boundaries in JavaScript regex

I have this function that finds whole words and should replace them. It identifies spaces but should not replace them, ie, not capture them.
function asd (sentence, word) {
str = sentence.replace(new RegExp('(?:^|\\s)' + word + '(?:$|\\s)'), "*****");
return str;
};
Then I have the following strings:
var sentence = "ich mag Äpfel";
var word = "Äpfel";
The result should be something like:
"ich mag *****"
and NOT:
"ich mag*****"
I'm getting the latter.
How can I make it so that it identifies the space but ignores it when replacing the word?
At first this may seem like a duplicate but I did not find an answer to this question, that's why I'm asking it.
Thank you

You should put back the matched whitespaces by using a capturing group (rather than a non-capturing one) with a replacement backreference in the replacement pattern, and you may also leverage a lookahead for the right whitespace boundary, which is handy in case of consecutive matches:
function asd (sentence, word) {
str = sentence.replace(new RegExp('(^|\\s)' + word + '(?=$|\\s)'), "$1*****");
return str;
};
var sentence = "ich mag Äpfel";
var word = "Äpfel";
console.log(asd(sentence, word));
See the regex demo.
Details
(^|\s) - Group 1 (later referred to with the help of a $1 placeholder in the replacement pattern): a capturing group that matches either start of string or a whitespace
Äpfel - a search word
(?=$|\s) - a positive lookahead that requires the end of string or whitespace immediately to the right of the current location.
NOTE: If the word can contain special regex metacharacters, escape them:
function asd (sentence, word) {
str = sentence.replace(new RegExp('(^|\\s)' + word.replace(/[-\/\\^$*+?.()|[\]{}]/g, '\\$&') + '(?=$|\\s)'), "$1*****");
return str;
};

Multiple nested matches in JavaScript Regular Expression

Trying to write a regular expression to match GS1 barcode patterns ( https://en.wikipedia.org/wiki/GS1-128 ), that contain 2 or more of these patterns that have an identifier followed by a certain number of characters of data.
I need something that matches this barcode because it contains 2 of the identifier and data patterns:
human readable with the identifiers in parens: (01)12345678901234(17)501200
actual data: 011234567890123417501200
but should match not this barcode when there is only one pattern in:
human readable: (01)12345678901234
actual data: 0112345678901234
It seems like the following should work:
var regex = /(?:01(\d{14})|10([^\x1D]{6,20})|11(\d{6})|17(\d{6})){2,}/g;
var str = "011234567890123417501200";
console.log(str.replace(regex, "$4"));
// matches 501200
console.log(str.replace(regex, "$1"));
// no match? why?
For some strange reason as soon as I remove the {2,} it works, but I need the {2,} so that it only returns matches if there is more than one match.
// Remove {2,} and it will return the first match
var regex = /(?:01(\d{14})|10([^\x1D]{6,20})|11(\d{6})|17(\d{6}))/g;
var str = "011234567890123417501200";
console.log(str.replace(regex, "$4"));
// matches 501200
console.log(str.replace(regex, "$1"));
// matches 12345678901234
// but then the problem is it would also match single identifiers such as
var str2 = "0112345678901234";
console.log(str2.replace(regex, "$1"));
How do I make this work so it will only match and pull the data if there is more than 1 set of match groups?
Thanks!

Your RegEx is logically and syntatically correct for Perl-Compatible Regular Expressions (PCRE). The issue I believe you are facing is the fact that JavaScript has issues with repeated capture groups. This is why the RegEx works fine once you take out the {2,}. By adding the quantifier, JavaScript will be sure to return only the last match.
What I would recommend is removing the {2,} quantifier and then programmatically checking for matches. I know it's not ideal for those who are big fans of RegEx, but c'est la vie.
Please see the snippet below:
var regex = /(?:01(\d{14})|10([^\x1D]{6,20})|11(\d{6})|17(\d{6}))/g;
var str = "011234567890123417501200";
// Check to see if we have at least 2 matches.
var m = str.match(regex);
console.log("Matches list: " + JSON.stringify(m));
if (m.length < 2) {
console.log("We only received " + m.length + " matches.");
} else {
console.log("We received " + m.length + " matches.");
console.log("We have achieved the minimum!");
}
// If we exec the regex, what would we get?
console.log("** Method 1 **");
var n;
while (n = regex.exec(str)) {
console.log(JSON.stringify(n));
}
// That's not going to work. Let's try using a second regex.
console.log("** Method 2 **");
var regex2 = /^(\d{2})(\d{6,})$/;
var arr = [];
var obj = {};
for (var i = 0, len = m.length; i < len; i++) {
arr = m[i].match(regex2);
obj[arr[1]] = arr[2];
}
console.log(JSON.stringify(obj));
// EOF
I hope this helps.

The reason is that the capture groups only give the last match by that particular group. Imagine that you would have two barcodes in your sequence that have both the same identifier 01... now it becomes clear that $1 cannot refer to both at the same time. The capture group only retains the second occurrence.
A straightforward way, but not so elegant, is to drop the {2,}, and instead repeat the whole regular expression pattern for matching the second barcode sequence. I think you also need to use the ^ (start of string anchor) to be sure the match is at the start of the string, otherwise you might pick up an identifier halfway an invalid sequence. After the repeated regular expression pattern you should also add .* if you want to ignore anything that follows after the second sequence, and not have it come back to you when using replace.
Finally, as you don't know which identifier will be found for the first and second match, you need to reproduce $1$2$3$4 in your replace, knowing that only one of those four will be a non-empty string. Same for the second match: $5$6$7$8.
Here is the improved code applied to your example string:
var regex = /^(?:01(\d{14})|10([^\x1D]{6,20})|11(\d{6})|17(\d{6}))(?:01(\d{14})|10([^\x1D]{6,20})|11(\d{6})|17(\d{6})).*/;
var str = "011234567890123417501200";
console.log(str.replace(regex, "$1$2$3$4")); // 12345678901234
console.log(str.replace(regex, "$5$6$7$8")); // 501200
If you need to also match the barcodes that follow the second, then you cannot escape from writing a loop. You cannot do that with just a regular expression based replace.
With a loop
If a loop is allowed, then you can use the regex#exec method. I would then suggest to add in your regular expression a kind of "catch all", which will match one character if none of the other identifiers match. If in the loop you detect such a "catch all" match, you exit:
var str = "011234567890123417501200";
var regex = /(?:01(\d{14})|10([^\x1D]{6,20})|11(\d{6})|17(\d{6})|(.))/g;
// 1: ^^^^^^ 2: ^^^^^^^^^^^^^ 3: ^^^^^ 4: ^^^^^ 5:^ (=failure)
var result = [], grp;
while ((grp = regex.exec(str)) && !grp[5]) result.push(grp.slice(1).join(''));
// Consider it a failure when not at least 2 matched.
if (result.length < 2) result = [];
console.log(result);

update
1st example
example with $1 $2 $3 $4 don't know why in matrix :)
but you see $1 -> abc
$2 -> def $3 -> ghi $4 -> jkl
// $1 $2 $3 $4
var regex = /(abc)|(def)|(ghi)|(jkl)/g;
var str = "abcdefghijkl";
// test
console.log(str.replace(regex, "$1 1st "));
console.log(str.replace(regex, "$2 2nd "));
console.log(str.replace(regex, "$3 3rd "));
console.log(str.replace(regex, "$4 4th "));
2nd example
sth in here is mixing faulty
// $1 $2 $3 $4
var regex = /((abc)|(def)|(ghi)|(jkl)){2,}/g;
var str = "abcdefghijkl";
// test
console.log(str.replace(regex, "$1 1st "));
console.log(str.replace(regex, "$2 2nd "));
console.log(str.replace(regex, "$3 3rd "));
console.log(str.replace(regex, "$4 4th "));
As you see there is ($4)( )( )( ) instead of ($1)( )( )( ).
If I think correctly the problem is with outside brackets () confusing 'pseudo' $1 is $4. If you have in outside brackets () a pattern and then {2,} so in outside brackets () it is $4 but in subpattern there is (?:01(\d{14})) but it reads like not $1 but faulty in this case $4 . Maybe this cause conflicts between the remembered values in outside brackets () and 1st remembered values but inside brackets (this is $1) . That's why it doesn't display. In other words you have ($4 ($1 $2 $3 $4) ) and this is not correct.
I add the picture to show what I mean.
As #Damian said
By adding the quantifier, JavaScript will be sure to return only the last match.
so $4 is the last match.
end update
I added useful little test
var regex = /(?:01(\d{14})|10(\x1D{6,20})|11(\d{6})|17(\d{6})){2,}/g;
var str = "011234567890123417501200";
// test
console.log(str.replace(regex, "$1 1st "));
console.log(str.replace(regex, "$2 2nd "));
console.log(str.replace(regex, "$3 3rd "));
console.log(str.replace(regex, "$4 4th "));

We Keep Coding

JavaScript is the programming language of the Web.

Regex match all punctuations except 'D.C.' - javascript

Related

Replace not numbers or words to underscore but leave dash and remove spaces around it

Is it possible to have one regex that solves this task?

Javascript Regex: Capture between two asterisks with multiple asterisks in comma delimited string

Non-capturing group matching whitespace boundaries in JavaScript regex

Multiple nested matches in JavaScript Regular Expression

Categories

Resources