Javascript - get first RegExp match with matchAll() - javascript

I'm not sure what I'm doing wrong, and I'm happy to admit that javascript isn't my strongest language. I test my regexs in a little .net tester I wrote years ago, and I see no problems there. I know different languages implement regex a little differently, but I don't think that's the issue here.
My app has a textarea where I can paste in data from an industry-specific spreadsheet and I use regexp matchAll() to parse. I am looping through the matchAll-returned iterable with a for/of loop, pretty basic stuff, and noticed that I can't seem to get the first match. So If my spreadsheet has 15 lines of data, my javascript parsing handles lines 2-15 ignoring #1. If I copy any lines in the block and paste them to the start then the new line #1 is ignored and the old #1 which is now #2 gets parsed, always ignoring the first line. So the issue is apparently not the RegExp pattern. I googled and found this passage from developer.mozilla.org:
matchAll only returns the first match if the /g flag is missing
this says to me that if I take out the /g I will only get the first match, but I guess this sentence could also be read to mean
unless the /g flag is missing, matchAll will not return the first match
but that would be ridiculous, right? if I take out the /g then I get the first match, and only the first match. if I use /g I get matches 2-15. Why can't I get 1-15? I copied some of my code from a different app I made a few months ago that doesn't have this issue.
working code:
var patt = /(?<invoiceNumber>INV \d+)\t(?<vendor>[\w  .,&-]+)\t(?<vendInvNum>[ ()\w\d\/.-]+)\t(20)?(?<yr>\d{2})[-\/](?<mn>\d{1,2})[-\/](?<dd>\d{1,2})\tInvoice\sUSD\s+(?<invAmt>[\d,.-]+)/g
for (let result of objInp.value.matchAll(patt)){
//loops thru iterable
}
example of data pasted in, finds 3 for 3 matches:
INV 006015 VENDOR 1 1025702 26/08/2019 Invoice USD 580.69
INV 006019 VENDOR 2 STORE/090919 09/09/2019 Invoice USD 38.71
INV 006021 Vendor 3 170241569 10/09/2019 Invoice USD 1,080.64
Code that doesn't pull in first match:
var patt = /\s(?<actID>[\w\d-]+)\t(?<actDesc>[\w\d  .,&\(\)-]+)?\t(?<origDur>[\d]+)?\t(?<start>[-\d\w]+)?\t(?<end>[-\d\w]+)?/g
for (let result of objInp.value.matchAll(patt[x])){
//loops through but always misses the first match
}
example of pasted data, finds 2 out of 3 matches:
Activity ID Activity Name Original Duration Start Finish Variance - BL1 Finish Date BL1 Finish Total Float
S600-20-21 Executive Steering Committee 5 06-Jan-20 13-Jan-20 0 13-Jan-20 0
S600-20-31 Steering Committee - Option Selection Meeting 2 13-Jan-20 15-Jan-20 0 15-Jan-20 0
S600-20-019b10 Resource Center of Excellence- Review 20 15-Jan-20 12-Feb-20 0 12-Feb-20 0

Related

Regex javascript tuning

First steps in regex and been trying to get some values out of emails that are being fetched. So far I've achieved some (even if it's not with the best approach) but in need of some more values and can't figure out how to get them.
This is the email template that is more or less always the same:
London, 06/20/09
Mr. Tom Waits
Process ref.: CR // 1943061
Your reference: 338256
Clients' names: Mary Lamb, John Snow
We return to your contact regarding the complaint on behalf of the clients mentioned above.
We inform you that the refund process has already started, so you should receive the respective amount (375EUR) within 4/6 weeks.
Payment ref.: 2500062960.
Our compliments,(...)
WHAT I NEED:
Date after "London,"
Process ref. (2 letters + // + digits)
Your ref. number
Clients' names
Amount
Payment reference number
Notes the amount not always comes between "( )", sometimes it's preceeded by "amount" others by "amount of", sometimes "EUR" is separated by a space, but the needed value is always the first digits combination on the paragraph
WHAT I HAVE SO FAR:
(?:London,)(.*)\s+(?:Mr. Tom Waits)\s+(?:Process ref.:)\s+(....\d+)\s+(?:Your reference:)\s+(\d+)\s+(?:Clients' names:)\s+(.*)\s+
WHAT IT RETRIEVES:
Date after "London,"
Process ref. (2 letters + // + digits)
Your reference number
Clients' names
WHAT'S MISSING:
Amount
Payment reference number
Other Concerns:
I tried to exclude the "Mr" parapraph but I think there may be a problem when they write it a little different like with an extra space or something
The same problem may arise if they also write the items a little different like, for example, "Process reference" instead of "Process ref."
Thanks in advance.
This one's gonna be one hefty regex. I do hope you don't try this regex on a massive file but rather one singular entry (like the one you've shown in your post).
Anyway, here's the regex-
London, ([\d\/]+)[\n\w\W]*?Process ref(?:\.|erence): ([A-Z \/\d]+)[\n\w\W]*?Your ref(?:\.|erence): (\d+)[\n\w\W]*?Client(?:s'|'s) name(?:s)?: ([\w ,]+)[\n\w\W]*?amount.*?(\d+)[\n\w\W]*?Payment ref(?:\.|erence): (\d+)
This should be quite permissive, it doesn't depend on many variable (seemingly) things, apart from London, that one's a bit of a hard coding but I'm assuming it's always London.
Now, let's walk through this-
London, ([\d\/]+) - This basically matches London, DATE - where date is, well, a date, where each element of the date is separated by a /.
In this case, it matches 06/20/09 from London, 06/20/09
[\n\w\W]*? - Try to keep up with this one - I'm using it a A LOT.
This will match pretty much all new lines, word characters and non word characters in a non-greedy way. In this particular case, this will match pretty much everything and that includes the newlines. This is used to just skip over anything and everything until we reach the desired spot.
Process ref(?:\.|erence): ([A-Z \/\d]+) - Captures the process reference, which can consist of capital alphabets (I assume, you can change that), slashes (/), and digits
Works with both ref. and reference
In this case, it matches CR // 1943061, from Process ref.: CR // 1943061
[\n\w\W]*? - Ignore everything up until the next token
Your ref(?:\.|erence): (\d+) - Captures "your reference", which can consist of digits
Works with both ref. and reference
In this case, it matches 338256
[\n\w\W]*? - Ignore everything up until the next token
Client(?:s'|'s) name(?:s)?: ([\w ,]+) - Captures the client name(s) - modified so it supports single client names too. (check the demo). The name list can consist of word characters, spaces and a comma.
In this case, it captures Mary Lamb, John Snow, from Clients' names: Mary Lamb, John Snow
(\d+) - Capture the digits - this is a big assumption, I'm assuming that the only digits that appear after client name list, are the ones for the amount. If they aren't
[\n\w\W]*? - Ignore everything up until the next token
amount.*?(\d+) - Captures the first group of digits that appear after amount. This is a bit of an assumption, I'm assuming that amount word is actually present in that paragraph.
In this case, it captures 375, from amount (375EUR)
[\n\w\W]*? - Ignore everything up until the next token
Payment ref(?:\.|erence): (\d+) - Capture the Payment reference number, which can consist of digits
Works with both ref. and reference
In this case, it captures 2500062960, from Payment ref.: 2500062960.
Check out the demo!

How to remove all of string up to and including hyphen

I am using javascript in a Mirth transformer. I apologize for my ignorance but I have no javascript training and have a hard time successfully utilizing info from similar threads here or elsewhere. I am trying to trim a string from 'Room-Bed' to be just 'Bed'. The "Room" and "Bed" values will fluctuate. All associated data is coming in from an ADT interface where our client is sending both the room and bed values, separated by a hyphen, in the bed field creating unnecessary redundancy. Please help me with the code needed to produce the result of 'Bed' from the received 'Room-Bed'.
There are many ways to reduce the string you have to the string you want. Which you choose will depend on your inputs and the output you want. For your simple example, all will work. But if you have strings come in with multiple hyphens, they'll render different results. They'll also have different performance characteristics. Balance the performance of it with how often it will be called, and whichever you find to be most readable.
// Turns the string in to an array, then pops the last instance out: 'Bed'!
'Room-Bed'.split('-').pop() === 'Bed'
// Looks for the given pattern,
// replacing the match with everything after the hyphen.
'Room-Bed'.replace(/.+-(.+)/, '$1') === 'Bed'
// Finds the first index of -,
// and creates a substring based on it.
'Room-Bed'.substr('Room-Bed'.indexOf('-') + 1) === 'Bed'

Javascript Substring comparison; am I crazy?

From my understanding, JavaScript's substring methods takes two parameters: first one is the index from where to start, and the second is the end point (non-index type counting, aka number of chars not 0-starting) which does not include the end-point char. Following this logic, I ran two examples:
This first example bellow worked as expected, I counted 13 chars because the 13th char was the end point and was not going to be included.
// goal was to print "Melbourne is"
alert("Melbourne is great".substring(0,13));
This example however failed. In this example I also stopped my count at the end point because I expected it not to be counted.
//goal was to print "Jan"
alert("January".substring(0,4));
Where is my understanding flawed?
JavaScript's substring methods takes two parameters: first one is the index from where to start, and the second is the end point (non-index type counting, aka number of chars not 0-starting) which does not include the end-point char
You are incorrect in this understanding. You are describing the functonality of .substr(). .substring() uses two indexes.
The reason why you are going crazy is because the function
alert("Melbourne is great".substring(0,13));
//prints Melbourne_is_ not Melbourne_is
Notice the space after
Don't go crazy! :)

It seems that JavaScript RegExp isn't finding "leftmost longest"

I observe these results:
// Test 1:
var re = /a|ab/;
"ab".match(re); // returns ["a"] <--- Unexpected
// Test 2:
re = /ab|a/;
"ab".match(re); // returns ["ab"]
I would expect tests 1 and 2 to both return ["ab"], due to the principal of "leftmost longest". I don't understand why the order of the 2 alternatives in the regex should change the results.
Find the reason below:
Note that alternatives are considered left to right until a match is
found. If the left alternative matches, the right alternative is
ignored, even if it would have produced a “better” match. Thus, when
the pattern /a|ab/ is applied to the string “ab,” it matches only the
first letter.
(source: Oreilly - Javascript Pocket Reference - Chapter 9 Regular Expressions)
Thanks.
This is because JavaScript doesn't implement the POSIX engine.
POSIX NFA Engines work similarly to Traditional NFAs with one
exception: a POSIX engine always picks the longest of the leftmost
matches. For example, the alternation cat|category would match the full word "category" whenever possible, even if the first alternative ("cat") matched and appeared earlier in the alternation. (SEE MRE 153-154)
Source: Oreilly - Javascript Pocket Reference, p.4

basic search ranking with regex in javascript

Currently I am using the below for search.
I assume each and every term the user types must appear at least once in the article.
I use the match method with regex
^(?=.*one)(?=.*two)(?=.*three).*$
with g, i, and m
At the moment I use matches.length to count the number of matches, but the behavior is not as expected.
example:
"one two three. one two three"
would give me 2 matches, but it should really be 6.
If I do something like
(one|two|three)
then I do get 6 matches, but if I have the data:
"one two. one two"
I get 4 matches, when in reality I want it to be 0, since not every word appears at least once.
I could do the first regex to check if there's at least one "match". If there is, I would subsequently use the second regex to count the real number of matches, but this would make my program run much slower than it already is. Doing this regex against 2500 json articles takes anywhere from 60 to 120 seconds as it is.
Any ideas on how to make this faster or better? Change the regex? Use search or indexOf instead of matches?
note:
I'm using lawnchair db for local persistance and jquery. I package the code for phonegap and as a chrome packaged app.
var input = '...';
var match = [];
if (input.match(/^(?=.*\bone\b)(?=.*\btwo\b)(?=.*\bthree\b)/i)) {
match = input.match(/\b(one|two|three)\b/ig);
}
Test this code here.

Categories