basic search ranking with regex in javascript - javascript

Currently I am using the below for search.
I assume each and every term the user types must appear at least once in the article.
I use the match method with regex
^(?=.*one)(?=.*two)(?=.*three).*$
with g, i, and m
At the moment I use matches.length to count the number of matches, but the behavior is not as expected.
example:
"one two three. one two three"
would give me 2 matches, but it should really be 6.
If I do something like
(one|two|three)
then I do get 6 matches, but if I have the data:
"one two. one two"
I get 4 matches, when in reality I want it to be 0, since not every word appears at least once.
I could do the first regex to check if there's at least one "match". If there is, I would subsequently use the second regex to count the real number of matches, but this would make my program run much slower than it already is. Doing this regex against 2500 json articles takes anywhere from 60 to 120 seconds as it is.
Any ideas on how to make this faster or better? Change the regex? Use search or indexOf instead of matches?
note:
I'm using lawnchair db for local persistance and jquery. I package the code for phonegap and as a chrome packaged app.

var input = '...';
var match = [];
if (input.match(/^(?=.*\bone\b)(?=.*\btwo\b)(?=.*\bthree\b)/i)) {
match = input.match(/\b(one|two|three)\b/ig);
}
Test this code here.

Related

Javascript - get first RegExp match with matchAll()

I'm not sure what I'm doing wrong, and I'm happy to admit that javascript isn't my strongest language. I test my regexs in a little .net tester I wrote years ago, and I see no problems there. I know different languages implement regex a little differently, but I don't think that's the issue here.
My app has a textarea where I can paste in data from an industry-specific spreadsheet and I use regexp matchAll() to parse. I am looping through the matchAll-returned iterable with a for/of loop, pretty basic stuff, and noticed that I can't seem to get the first match. So If my spreadsheet has 15 lines of data, my javascript parsing handles lines 2-15 ignoring #1. If I copy any lines in the block and paste them to the start then the new line #1 is ignored and the old #1 which is now #2 gets parsed, always ignoring the first line. So the issue is apparently not the RegExp pattern. I googled and found this passage from developer.mozilla.org:
matchAll only returns the first match if the /g flag is missing
this says to me that if I take out the /g I will only get the first match, but I guess this sentence could also be read to mean
unless the /g flag is missing, matchAll will not return the first match
but that would be ridiculous, right? if I take out the /g then I get the first match, and only the first match. if I use /g I get matches 2-15. Why can't I get 1-15? I copied some of my code from a different app I made a few months ago that doesn't have this issue.
working code:
var patt = /(?<invoiceNumber>INV \d+)\t(?<vendor>[\w  .,&-]+)\t(?<vendInvNum>[ ()\w\d\/.-]+)\t(20)?(?<yr>\d{2})[-\/](?<mn>\d{1,2})[-\/](?<dd>\d{1,2})\tInvoice\sUSD\s+(?<invAmt>[\d,.-]+)/g
for (let result of objInp.value.matchAll(patt)){
//loops thru iterable
}
example of data pasted in, finds 3 for 3 matches:
INV 006015 VENDOR 1 1025702 26/08/2019 Invoice USD 580.69
INV 006019 VENDOR 2 STORE/090919 09/09/2019 Invoice USD 38.71
INV 006021 Vendor 3 170241569 10/09/2019 Invoice USD 1,080.64
Code that doesn't pull in first match:
var patt = /\s(?<actID>[\w\d-]+)\t(?<actDesc>[\w\d  .,&\(\)-]+)?\t(?<origDur>[\d]+)?\t(?<start>[-\d\w]+)?\t(?<end>[-\d\w]+)?/g
for (let result of objInp.value.matchAll(patt[x])){
//loops through but always misses the first match
}
example of pasted data, finds 2 out of 3 matches:
Activity ID Activity Name Original Duration Start Finish Variance - BL1 Finish Date BL1 Finish Total Float
S600-20-21 Executive Steering Committee 5 06-Jan-20 13-Jan-20 0 13-Jan-20 0
S600-20-31 Steering Committee - Option Selection Meeting 2 13-Jan-20 15-Jan-20 0 15-Jan-20 0
S600-20-019b10 Resource Center of Excellence- Review 20 15-Jan-20 12-Feb-20 0 12-Feb-20 0

How to find key words in paragraph of text?

I'm trying to find a fast(milliseconds or seconds) solution for having an inputted block of text and a large list(11 million) of specific words/phrases to test against. So I would like to see what words/phrases are in the inputted paragraph?
We use Javascript and have SQL, MongoDB & DynamoDB as existing data stores that we can integrate this solution into.
I've done searching on this problem but can only find checking if words exist in text. not the other way around.
All ideas are welcome!
In cases like these you want to eliminate as much unnecessary data as possible. Assuming that order matters:
First things first, make sure you have a B tree index built on your phrases database clustered on the phrase. This will speed up range lookup times.
Let n = 2 (or 1, if you're into that)
Split the text block into phrases of length n and perform a query for phrases in the dictionary that begin with any of the phrase pairs ('My Phrase%'). This won't perform 4521 million string comparisons thanks to the index.
Remember the phrases that are an exact match
Let n = n + 1
Repeat from step 3 using the reduced dictionary, until the reduced dictionary is empty
You can also make small optimizations here and there depending on the kind of matches you're looking for, such as, not matching across punctuation, only phrases of a certain word length, etc. In any case, I'd expect the time bottleneck here to be on disk access, rather than actual comparisons.
Also, I'm pretty sure I based this algorithm off of an existing one but I don't remember its name so bonus points to anyone who can name it. I think it had something to do with data warehousing/mining and calculating frequencies and patterns?

How to remove all of string up to and including hyphen

I am using javascript in a Mirth transformer. I apologize for my ignorance but I have no javascript training and have a hard time successfully utilizing info from similar threads here or elsewhere. I am trying to trim a string from 'Room-Bed' to be just 'Bed'. The "Room" and "Bed" values will fluctuate. All associated data is coming in from an ADT interface where our client is sending both the room and bed values, separated by a hyphen, in the bed field creating unnecessary redundancy. Please help me with the code needed to produce the result of 'Bed' from the received 'Room-Bed'.
There are many ways to reduce the string you have to the string you want. Which you choose will depend on your inputs and the output you want. For your simple example, all will work. But if you have strings come in with multiple hyphens, they'll render different results. They'll also have different performance characteristics. Balance the performance of it with how often it will be called, and whichever you find to be most readable.
// Turns the string in to an array, then pops the last instance out: 'Bed'!
'Room-Bed'.split('-').pop() === 'Bed'
// Looks for the given pattern,
// replacing the match with everything after the hyphen.
'Room-Bed'.replace(/.+-(.+)/, '$1') === 'Bed'
// Finds the first index of -,
// and creates a substring based on it.
'Room-Bed'.substr('Room-Bed'.indexOf('-') + 1) === 'Bed'

How can I match strings precisely and avoid false matches?

Background - web app back end javascript/dojo code.
I need to match a user input string to a list of possible vehicle models and I am having challenges with incorrect matches.
Say a user enters:
Ford Fusion, S 60, and Volks Wagen
Currently, I would read that in as
FORDFUSIONS60VOLKSWAGEN
and in that, I would match against a list of makes and models.
Problem is, in this case and in many others, you get things like "S6" (Audi) " and "S60" (Volvo), or "Accord" (Honda) or "CC" (Volkswagen).
Any idea how it would be possible (if at all) to avoid these ambiguous matches?
Since this question is tagged regex, I think you are looking for the word boundary metacharacter:
/\bS6\b/
will match "S6" and "… S6 …", but not "S60", just as
/\bCC\b/i
will match "CC" and "cc", but not "Accord".
To avoid at least the both examples you would first match against the longer names (e.g. for "s60" before "s6" and "accord" before "cc") and if there's no match, then use the shorter one. Otherwise exit with the longer one.
As far as you're looking for the longest matches, you also could check, if one of the resulting names is contained within another and skip them.
This is how I would go about it:
Run checks with the name, model and company and if they trace back to the same reference, then you know you have what you want. However, if you get different results keep trying combinations of all search results until they match up to a single reference.
For example:
model traces back to honda and ford,
number traces back to ford and bentley,
and
company gives ford
then you can try combinations of list_1, list_2, and list_3 where:
list_1 = ['honda','ford']
list_2 = ['ford','bentley']
list_3 = ['ford']
Then when you try all combinations (I recommend itertools.combinations) you will end up with one valid result that is common in all lists: ford
I hope that is clear. I know I'm blabbing a bit.

Regular expression for Phone Numbers with different lengths

I searched on Google for phone number regex validations but haven't been able to make it work based on my requirements.
Basically, I have three separate sets of rules for the prefix:
For 10 digit numbers I need to make sure the first 3 are numbers starting from 2-9.
For 11 digit numbers I need to make sure the first 4 are numbers starting from 1-9.
For for anything greater than 12 digits I need to make sure the first 7 are numbers from 0-9.
After that I can allow letters like 1888GOSUPER or something like that (this would fall under the second condition)
This is what I have so far but I am not certain if I have covered everything:
var reg10 = /^[2-9]{3}[a-z0-9]+$/i;
var reg11 = /^[1-9]{4}[a-z0-9]+$/i;
var reg12plus = /^[0-9]{7}[a-z0-9]+$/i;
This can be handled by one regex (including your check for length, as suggested by others). Probably can be done more succinctly than this, but I feel this is more readable in the context of your 3 specifically separate prefix requirements:
^(?:[2-9]{3}[a-z0-9]{7})$|^(?:[1-9]{4}[a-z0-9]{7})$|^(?:[0-9]{7}[a-z0-9]{5,})$
Basically combines your three separate cases via "alternation" |
This can be "normalised" slightly, without "breaking" the clarity of intent, by grouping the entire expression and then surrounding with start/end anchors (rather than repeating these in each option, as above). Although this results in a similar length rule overall, by the time we add our additional non-capturing group:
^(?:(?:[2-9]{3}[a-z0-9]{7})|(?:[1-9]{4}[a-z0-9]{7})|(?:[0-9]{7}[a-z0-9]{5,}))$

Categories