Regex javascript tuning

Regex javascript tuning - javascript

First steps in regex and been trying to get some values out of emails that are being fetched. So far I've achieved some (even if it's not with the best approach) but in need of some more values and can't figure out how to get them.
This is the email template that is more or less always the same:
London, 06/20/09
Mr. Tom Waits
Process ref.: CR // 1943061
Your reference: 338256
Clients' names: Mary Lamb, John Snow
We return to your contact regarding the complaint on behalf of the clients mentioned above.
We inform you that the refund process has already started, so you should receive the respective amount (375EUR) within 4/6 weeks.
Payment ref.: 2500062960.
Our compliments,(...)
WHAT I NEED:
Date after "London,"
Process ref. (2 letters + // + digits)
Your ref. number
Clients' names
Amount
Payment reference number
Notes the amount not always comes between "( )", sometimes it's preceeded by "amount" others by "amount of", sometimes "EUR" is separated by a space, but the needed value is always the first digits combination on the paragraph
WHAT I HAVE SO FAR:
(?:London,)(.*)\s+(?:Mr. Tom Waits)\s+(?:Process ref.:)\s+(....\d+)\s+(?:Your reference:)\s+(\d+)\s+(?:Clients' names:)\s+(.*)\s+
WHAT IT RETRIEVES:
Date after "London,"
Process ref. (2 letters + // + digits)
Your reference number
Clients' names
WHAT'S MISSING:
Amount
Payment reference number
Other Concerns:
I tried to exclude the "Mr" parapraph but I think there may be a problem when they write it a little different like with an extra space or something
The same problem may arise if they also write the items a little different like, for example, "Process reference" instead of "Process ref."
Thanks in advance.

This one's gonna be one hefty regex. I do hope you don't try this regex on a massive file but rather one singular entry (like the one you've shown in your post).
Anyway, here's the regex-
London, ([\d\/]+)[\n\w\W]*?Process ref(?:\.|erence): ([A-Z \/\d]+)[\n\w\W]*?Your ref(?:\.|erence): (\d+)[\n\w\W]*?Client(?:s'|'s) name(?:s)?: ([\w ,]+)[\n\w\W]*?amount.*?(\d+)[\n\w\W]*?Payment ref(?:\.|erence): (\d+)
This should be quite permissive, it doesn't depend on many variable (seemingly) things, apart from London, that one's a bit of a hard coding but I'm assuming it's always London.
Now, let's walk through this-
London, ([\d\/]+) - This basically matches London, DATE - where date is, well, a date, where each element of the date is separated by a /.
In this case, it matches 06/20/09 from London, 06/20/09
[\n\w\W]*? - Try to keep up with this one - I'm using it a A LOT.
This will match pretty much all new lines, word characters and non word characters in a non-greedy way. In this particular case, this will match pretty much everything and that includes the newlines. This is used to just skip over anything and everything until we reach the desired spot.
Process ref(?:\.|erence): ([A-Z \/\d]+) - Captures the process reference, which can consist of capital alphabets (I assume, you can change that), slashes (/), and digits
Works with both ref. and reference
In this case, it matches CR // 1943061, from Process ref.: CR // 1943061
[\n\w\W]*? - Ignore everything up until the next token
Your ref(?:\.|erence): (\d+) - Captures "your reference", which can consist of digits
Works with both ref. and reference
In this case, it matches 338256
[\n\w\W]*? - Ignore everything up until the next token
Client(?:s'|'s) name(?:s)?: ([\w ,]+) - Captures the client name(s) - modified so it supports single client names too. (check the demo). The name list can consist of word characters, spaces and a comma.
In this case, it captures Mary Lamb, John Snow, from Clients' names: Mary Lamb, John Snow
(\d+) - Capture the digits - this is a big assumption, I'm assuming that the only digits that appear after client name list, are the ones for the amount. If they aren't
[\n\w\W]*? - Ignore everything up until the next token
amount.*?(\d+) - Captures the first group of digits that appear after amount. This is a bit of an assumption, I'm assuming that amount word is actually present in that paragraph.
In this case, it captures 375, from amount (375EUR)
[\n\w\W]*? - Ignore everything up until the next token
Payment ref(?:\.|erence): (\d+) - Capture the Payment reference number, which can consist of digits
Works with both ref. and reference
In this case, it captures 2500062960, from Payment ref.: 2500062960.
Check out the demo!

Related

Regular expression to group prices and item name

Given the following examples:
100k melon 8
200 knife 7k
1.2m maple logs 19
I need to be able to take the first string as one group, the middle parts as another group, and the last part as the final group.
The current expression I have is this but regular expressions really throw me for a whirl:
([\d+|k])
Do any of you veterans have an idea of where I should go next or an explanation of how I can reach a solution to this?
Unfortunately, Reference - What does this regex mean? doesn't really solve my problem since it's just a dump of all of the different tokens you can use. Where I'm having trouble is putting all of that together in a meaningful way.

Here's what I came up with:
([0-9\.]+[a-z]?)\s([a-z\ ]+)\s([0-9\.]+[a-z]?)
and a quick overview of the groups:
([0-9\.]+[a-z]?)
matches any number or dot any number of times, plus an optional 1-character unit, like "k" or "m"
([a-z\ ]+)
matches letter and/or spaces. This may include a trailing or leading space, which is annoying, but I figured this was good enough.
([0-9\.]+[a-z]?)
same as the first group.
The three groups are separate by a space each.

Solution
This regex does the job
^(\S+)\s+(.*)\s+(\S+)$
See a demo here (Regex101)
Explanation
Explicitly match non-whitespace sequences at the start and the end of the string, resp.
The remainder are the middle parts surrounded by ws sequences.

Okay, if I understand your question correctly, you have the following String '100k melon 8 200 knife 7k 1.2m maple logs 19'. You should make a function that returns a .match():
function thrice(str){
return str.match(/(\d+\.\d+|\d+)\w?\D+(\d+\.\d+|\d+)\w?/g);
}
console.log(thrice('100k melon 8 200 knife 7k 1.2m maple logs 19'));
Otherwise, if you just want to test each String separately, you may consider:
function goodCat(str){
let g = str.match(/^(\d+\.\d+|\d+)\w?\D+(\d+\.\d+|\d+)\w?$/) ? true : false;
return g;
}
console.log(goodCat('100000k is 777 not enough 5'));
console.log(goodCat('100k melon 8'));
console.log(goodCat('200 knife 7k'));
console.log(goodCat('1.2m maple logs 19'));

Javascript - get first RegExp match with matchAll()

I'm not sure what I'm doing wrong, and I'm happy to admit that javascript isn't my strongest language. I test my regexs in a little .net tester I wrote years ago, and I see no problems there. I know different languages implement regex a little differently, but I don't think that's the issue here.
My app has a textarea where I can paste in data from an industry-specific spreadsheet and I use regexp matchAll() to parse. I am looping through the matchAll-returned iterable with a for/of loop, pretty basic stuff, and noticed that I can't seem to get the first match. So If my spreadsheet has 15 lines of data, my javascript parsing handles lines 2-15 ignoring #1. If I copy any lines in the block and paste them to the start then the new line #1 is ignored and the old #1 which is now #2 gets parsed, always ignoring the first line. So the issue is apparently not the RegExp pattern. I googled and found this passage from developer.mozilla.org:
matchAll only returns the first match if the /g flag is missing
this says to me that if I take out the /g I will only get the first match, but I guess this sentence could also be read to mean
unless the /g flag is missing, matchAll will not return the first match
but that would be ridiculous, right? if I take out the /g then I get the first match, and only the first match. if I use /g I get matches 2-15. Why can't I get 1-15? I copied some of my code from a different app I made a few months ago that doesn't have this issue.
working code:
var patt = /(?<invoiceNumber>INV \d+)\t(?<vendor>[\w  .,&-]+)\t(?<vendInvNum>[ ()\w\d\/.-]+)\t(20)?(?<yr>\d{2})[-\/](?<mn>\d{1,2})[-\/](?<dd>\d{1,2})\tInvoice\sUSD\s+(?<invAmt>[\d,.-]+)/g
for (let result of objInp.value.matchAll(patt)){
//loops thru iterable
}
example of data pasted in, finds 3 for 3 matches:
INV 006015 VENDOR 1 1025702 26/08/2019 Invoice USD 580.69
INV 006019 VENDOR 2 STORE/090919 09/09/2019 Invoice USD 38.71
INV 006021 Vendor 3 170241569 10/09/2019 Invoice USD 1,080.64
Code that doesn't pull in first match:
var patt = /\s(?<actID>[\w\d-]+)\t(?<actDesc>[\w\d  .,&\(\)-]+)?\t(?<origDur>[\d]+)?\t(?<start>[-\d\w]+)?\t(?<end>[-\d\w]+)?/g
for (let result of objInp.value.matchAll(patt[x])){
//loops through but always misses the first match
}
example of pasted data, finds 2 out of 3 matches:
Activity ID Activity Name Original Duration Start Finish Variance - BL1 Finish Date BL1 Finish Total Float
S600-20-21 Executive Steering Committee 5 06-Jan-20 13-Jan-20 0 13-Jan-20 0
S600-20-31 Steering Committee - Option Selection Meeting 2 13-Jan-20 15-Jan-20 0 15-Jan-20 0
S600-20-019b10 Resource Center of Excellence- Review 20 15-Jan-20 12-Feb-20 0 12-Feb-20 0

How to filter out characters that aren't letters, numbers or punctuation

I have a string that will have a lot of formatting things like bullet points or arrows or whatever. I want to clean this string so that it only contains letters, numbers and punctuation. Multiple spaces should be replaced by a single space too.
Allowed punctuation: , . : ; [ ] ( ) / \ ! # # $ % ^ & * + - _ { } < > = ? ~ | "
Basically anything allowed in this ASCII table.
This is what I have so far:
let asciiOnly = y.replace(/[^a-zA-Z0-9\s]+/gm, '')
let withoutSpacing = asciiOnly.replace(/\s{2,}/gm, ' ')
Regex101: https://regex101.com/r/0DC1tz/2
I also tried the [:punct:] tag but apparently it's not supported by javascript. Is there a better way I can clean this string other than regex? A library or something maybe (I didn't find any). If not, how would I do this with regex? Would I have to edit the first regex to add every single character of punctuation?
EDIT: I'm trying to paste an example string in the question but SO just removes characters it doesn't recognize so it looks like a normal string. Heres a paste.
EDIT2: I think this is what I needed:
let asciiOnly = x.replace(/[^\x20-\x7E]+/gm, '')
let withoutSpacing = asciiOnly.replace(/\s{2,}/gm, ' ')
I'm testing it with different cases to make sure.

You can achieve this using below regex, which finds any non-ascii characters (also excludes non-printable ascii characters and excluding extended ascii too) and removes it with empty string.
[^ -~]+
This is assuming you want to retain all printable ASCII characters only, which range from space (ascii value 32) to tilde ~ hence usage of this char set [^ !-~]
And then replaces all one or more white space with a single space
var str = `Determine the values of P∞ and E∞ for each of the following signals: b.
d.
f.
Periodic and aperiodic signals Determine whether or not each of the following signals is periodic:
b.
Determine whether or not each of the following signals is periodic. If a signal is periodic, specify its fundamental period.
b.
d.
Transformation of Independent variables A continuous-time signal x(t) is shown in Figure 1. Sketch and label carefully each of the following signals:
b. c.
d. e. f. Figure 1: Problem Set 1.4
Even and Odd Signals
For each signal given below, determine all the values of the independent variable at which the even part of the signal is guaranteed to be zero.
b.
d. -------------------------`;
console.log(str.replace(/[^ -~]+/g,'').replace(/\s+/g, ' '));
<!-- begin snippet: js hide: false console: true babel: false -->
console.log(str.replace(/[^ !-~]+/g,'').replace(/\s+/g, ' '));
Also, if you just want to allow all alphanumeric characters and mentioned special characters, then you can use this regex to first retain all needed characters using this regex ,
[^ a-zA-Z0-9,.:;[\]()/\!##$%^&*+_{}<>=?~|"-]+
Replace this with empty string and then replace one or more white spaces with just a single space.
var str = `Determine the values of P∞ and E∞ for each of the following signals: b.
d.
f.
Periodic and aperiodic signals Determine whether or not each of the following signals is periodic:
b.
Determine whether or not each of the following signals is periodic. If a signal is periodic, specify its fundamental period.
b.
d.
Transformation of Independent variables A continuous-time signal x(t) is shown in Figure 1. Sketch and label carefully each of the following signals:
b. c.
d. e. f. Figure 1: Problem Set 1.4
Even and Odd Signals
For each signal given below, determine all the values of the independent variable at which the even part of the signal is guaranteed to be zero.
b.
d. -------------------------`;
console.log(str.replace(/[^ a-zA-Z0-9,.:;[\]()/\!##$%^&*+_{}<>=?~|"-]+/g,'').replace(/\s+/g, ' '));

This is how i will do. I will remove the all the non allowed character first and than replace the multiple spaces with a single space.
let str = `Determine the values of P∞ and E∞ for each of the following signals: b.
d.
f.
Periodic and aperiodic signals Determine whether or not each of the following signals is periodic:!!!23
b.
Determine whether or not each of the following signals is periodic. If a signal is periodic, specify its fundamental period.
b.
d.
Transformation of Independent variables A continuous-time signal x(t) is shown in Figure 1. Sketch and label carefully each of the following signals:
b. c.
d. e. f. Figure 1: Problem Set 1.4
Even and Odd Signals
For each signal given below, determine all the values of the independent variable at which the even part of the signal is guaranteed to be zero.
b.
d. ------------------------- `
const op = str.replace(/[^\w,.:;\[\]()/\!##$%^&*+{}<>=?~|" -]/g, '').replace(/\s+/g, " ")
console.log(op)
EDIT : In case you want to keep \n or \t as it is use (\s)\1+, "$1" in second regex.

There probably isn't a better solution than a regex. The under-the-hood implementation of regex actions is usually well optimized by virtue of age and ubiquity.
You may be able to explicitly tell the regex handler to "compile" the regex. This is usually a good idea if you know the regex is going to be used a lot within a program, and may help with performance here. But I don't know if javascript exposes such an option.
The idea of "normal punctuation" doesn't have an excellent foundation. There are some common marks like "90°" that aren't ASCII, and some ASCII marks like "" () that you almost certainly don't want. I would expect you to find similar edge cases with any pre-made list. In any case, just explicitly listing all the punctuation you want to allow is better in general, because then no one will ever have to look up what's in the list you chose.
You may be able to perform both substitutions in a single pass, but it's unclear if that will perform better and it almost certainly won't be clearer to any co-workers (including yourself-from-the-future). There will be a lot of finicky details to work out such as whether " ° " should be replaced with "", " ", or " ".

Matching multiple quotes in a sentence

I am trying to match multiple quotes inside of a single sentence, for example the line:
Hello "this" is a "test" example.
This is the regex that I am using, but I am having some problems with it:
/[^\.\?\!\'\"]{1,}[\"\'\“][^\"\'\“\”]{1,}[\"\'\“\”][^\.\?\!]{1,}[\.\?\!]/g
What I am trying achieve with this regex is to find everything from the start of the last sentence until I hit quotes, then find the closing set and continue until either a .?!
The sample text that I am using to test with is from Call of Cthulhu:
What seemed to be the main document was headed “CTHULHU CULT” in characters painstakingly printed to avoid the erroneous reading of a word so unheard-of. The manuscript was divided into two sections, the first of which was headed “1925—Dream and Dream Work of H. A. Wilcox, 7 Thomas St., Providence, R.I.”, and the second, “Narrative of Inspector John R. Legrasse, 121 Bienville St., New Orleans, La., at 1908 A. A. S. Mtg.—Notes on Same, & Prof. Webb’s Acct.” The other manuscript papers were all brief notes, some of them accounts of the queer dreams of different persons, some of them citations from theosophical books and magazines.
The issue comes on the line The manuscript was.... Does anyone know how to account for repeats like this? Or is there a better way?

This one ignores [.?!] inside quotes. But cases like Acct.” The nth will be considered as a single sentence in this case. Probably a . is missing over there.
var r = 'What seemed to be the main document was headed “CTHULHU.?! CULT” in characters painstakingly printed to avoid the erroneous reading of a word so unheard-of. The manuscript was divided into two sections, the first of which was headed “1925—Dream and Dream Work of H. A. Wilcox, 7 Thomas St., Providence, R.I.”, and the second, “Narrative of Inspector John R. Legrasse, 121 Bienville St., New Orleans, La., at 1908 A. A. S. Mtg.—Notes on Same, & Prof. Webb’s Acct.” The other manuscript papers were all brief notes, some of them accounts of the queer dreams of different persons, some of them citations from theosophical books and magazines.'
.split(/[“”]/g)
.map((x,i)=>(i%2)?x.replace(/[.?!]/g,''):x)
.join("'")
.split(/[.?!]/g)
.filter(x => x.trim()).map(x => ({
sentence: x,
quotescount: x.split("'").length - 1
}));
console.log(r);

You can use this naive pattern:
/[^"'“.!?]*(?:"[^"*]"[^"'“.!?]*|'[^']*'[^"'“.!?]*|“[^”]*”[^"'“.!?]*)*[.!?]/
details:
/
[^"'“.!?]* # all that isn't a quote or a punct that ends the sentence
(?:
"[^"*]" [^"'“.!?]*
|
'[^']*' [^"'“.!?]*
|
“[^”]*” [^"'“.!?]*
)*
[.!?]
/
If you want something more strong, you can emulate the "atomic grouping" feature, in particular if you are not sure that each opening quote has a closing quote (to prevent catastrophic backtracking):
/(?=([^"'“.!?]*))\1(?:"(?=([^"*]))\2"[^"'“.!?]*|'(?=([^']*))\3'[^"'“.!?]*|“(?=([^”]*))\4”[^"'“.!?]*)*[.!?]/
An atomic group forbids backtracking once closed. Unfortunately this feature doesn't exist in Javascript. But there's a way to emulate it using a lookahead that is naturally atomic, a capture group and a backreference:
(?>expr) => (?=(expr))\1

regex to validate intl phone number

Can anyone helps me to write a regex that satisfies these conditions to validate international phone number:
it must starts with +, 00 or 011.
the only allowed characters are [0-9],-,.,space,(,)
length is not important
so these tests should pass:
+1 703 335 65123
001 (703) 332-6261
+1703.338.6512
This is my attempt ^\+?(\d|\s|\(|\)|\.|\-)+$ but it's not working properly.

To clean up the regexp use square-brackets to define "OR" situations of characters, instead of |.
Below is a rewritten version of your regular-expression, matching the provided description.
/^(?:\+|00|011)[0-9 ().-]+$/
What is the use of ?:?
When doing ?: directly inside a parenthesis it's for telling the regular-expression engine that you'd want to group something, but not store away the information for later use.

with only 1 space and more successive space is not allowed ( note the " ?" at the end of second group)
(\+|00|011)([\d-.()]+ ?)+$
faster (i guess) with adding passive groups modifier (?:) at the beginnings of each group
(?:\+|00|011)(?:[\d-.()]+ ?)+$
you can use some regex cheat sheets like this one and Linqpad for faster tuning this regex to your needs.
in case you are not familiar with Linqpad, you should just copy & paste this next block to it and change language to C# statements and press F5
string pattern = #"^(?:\+|00|011)(?:[\d-.()]+ ?)+$";
Regex.IsMatch("+1 703 335 65123", pattern).Dump();
Regex.IsMatch("001 (703) 332-6261",pattern).Dump();
Regex.IsMatch("+1703.338.6512",pattern).Dump();

^(?:\+|00|011)[\d. ()-]*$
To specify a length (in case you do care about length later on), use the following:
^(?:\+|00|011)(?:[. ()-]*\d){11,12}[. ()-]*$
And you could obviously change the 11,12 to whatever you want. And just for fun, this also does the same exact thing as the one above:
^(?:\+|00|011)[. ()-]*(?:\d[. ()-]*){11,12}$

I'd go for a completely different route (in fact I had the same problem as you at one point, except I did it in Java).
The plan here is to take the input, make replacements on it and check that the input is empty:
first substitute \s* with nothing, globally;
then substitute \(\d+\) by nothing, globally;
then substitute ^(\+|00|011)\d+([-.]\d+)*$ by nothing.
after these, if the result string is empty, you have a match, otherwise you don't.
Since I did it in Java, I found Google's libphonenumber since then and have dropped that. But it still works:
fge#erwin ~ $ perl -ne '
> s,\s*,,g;
> s,\(\d+\),,g;
> s,^(\+|00|011)\d+([-.]\d+)*$,,;
> printf("%smatch\n", $_ ? "no " : "");
> '
+1 703 335 65123
match
001 (703) 332-6261
match
+1703.338.6512
match
+33209283892
match
22989018293
no match
Note that a further test is required to see if the input string is at least of length 1.

Try this:
^(\([+]?\d{1,3}\)|([+0]?\d{1,3}))?( |-)?(\(\d{1,3}\)|\d{1,3})( |-)?\d{3}( |-)?\d{4}$
It is compatible with E164 standard along with some combinations of brackets, space and hyphen.

We Keep Coding

JavaScript is the programming language of the Web.

Regex javascript tuning - javascript

Related

Regular expression to group prices and item name

Javascript - get first RegExp match with matchAll()

How to filter out characters that aren't letters, numbers or punctuation

Matching multiple quotes in a sentence

regex to validate intl phone number

Categories

Resources