I have to arrays that contain names of premier league players.
I want to match them by name since the player objects don't have unique ids.
How can I make a string comparison that will match Zlatan Ibrahimovic with Zlatan Ibrahimović for example? (notice the last character of both strings)
It is not a trivial problem.
You should look into Levenshtein distance problem
https://en.wikipedia.org/wiki/Levenshtein_distance
You can search in google for different implementations or use a library like:
https://www.npmjs.com/package/levenshtein
Example:
l = new Levenshtein( 'Zlatan Ibrahimovic', 'Zlatan Ibrahimović')
// l === 1
I used already, and I liked. In my code, I used this one for an experimental proposed.
I don;t care about the result. Because in a long string 4 can be a very good number and in small one 2 it is very bad.
I get to do something like l/Math.max(str1.length, str2.length) then you can make your number and decide wich number is interesting for you.
Related
I'm trying to find a fast(milliseconds or seconds) solution for having an inputted block of text and a large list(11 million) of specific words/phrases to test against. So I would like to see what words/phrases are in the inputted paragraph?
We use Javascript and have SQL, MongoDB & DynamoDB as existing data stores that we can integrate this solution into.
I've done searching on this problem but can only find checking if words exist in text. not the other way around.
All ideas are welcome!
In cases like these you want to eliminate as much unnecessary data as possible. Assuming that order matters:
First things first, make sure you have a B tree index built on your phrases database clustered on the phrase. This will speed up range lookup times.
Let n = 2 (or 1, if you're into that)
Split the text block into phrases of length n and perform a query for phrases in the dictionary that begin with any of the phrase pairs ('My Phrase%'). This won't perform 4521 million string comparisons thanks to the index.
Remember the phrases that are an exact match
Let n = n + 1
Repeat from step 3 using the reduced dictionary, until the reduced dictionary is empty
You can also make small optimizations here and there depending on the kind of matches you're looking for, such as, not matching across punctuation, only phrases of a certain word length, etc. In any case, I'd expect the time bottleneck here to be on disk access, rather than actual comparisons.
Also, I'm pretty sure I based this algorithm off of an existing one but I don't remember its name so bonus points to anyone who can name it. I think it had something to do with data warehousing/mining and calculating frequencies and patterns?
I am using javascript in a Mirth transformer. I apologize for my ignorance but I have no javascript training and have a hard time successfully utilizing info from similar threads here or elsewhere. I am trying to trim a string from 'Room-Bed' to be just 'Bed'. The "Room" and "Bed" values will fluctuate. All associated data is coming in from an ADT interface where our client is sending both the room and bed values, separated by a hyphen, in the bed field creating unnecessary redundancy. Please help me with the code needed to produce the result of 'Bed' from the received 'Room-Bed'.
There are many ways to reduce the string you have to the string you want. Which you choose will depend on your inputs and the output you want. For your simple example, all will work. But if you have strings come in with multiple hyphens, they'll render different results. They'll also have different performance characteristics. Balance the performance of it with how often it will be called, and whichever you find to be most readable.
// Turns the string in to an array, then pops the last instance out: 'Bed'!
'Room-Bed'.split('-').pop() === 'Bed'
// Looks for the given pattern,
// replacing the match with everything after the hyphen.
'Room-Bed'.replace(/.+-(.+)/, '$1') === 'Bed'
// Finds the first index of -,
// and creates a substring based on it.
'Room-Bed'.substr('Room-Bed'.indexOf('-') + 1) === 'Bed'
Background - web app back end javascript/dojo code.
I need to match a user input string to a list of possible vehicle models and I am having challenges with incorrect matches.
Say a user enters:
Ford Fusion, S 60, and Volks Wagen
Currently, I would read that in as
FORDFUSIONS60VOLKSWAGEN
and in that, I would match against a list of makes and models.
Problem is, in this case and in many others, you get things like "S6" (Audi) " and "S60" (Volvo), or "Accord" (Honda) or "CC" (Volkswagen).
Any idea how it would be possible (if at all) to avoid these ambiguous matches?
Since this question is tagged regex, I think you are looking for the word boundary metacharacter:
/\bS6\b/
will match "S6" and "… S6 …", but not "S60", just as
/\bCC\b/i
will match "CC" and "cc", but not "Accord".
To avoid at least the both examples you would first match against the longer names (e.g. for "s60" before "s6" and "accord" before "cc") and if there's no match, then use the shorter one. Otherwise exit with the longer one.
As far as you're looking for the longest matches, you also could check, if one of the resulting names is contained within another and skip them.
This is how I would go about it:
Run checks with the name, model and company and if they trace back to the same reference, then you know you have what you want. However, if you get different results keep trying combinations of all search results until they match up to a single reference.
For example:
model traces back to honda and ford,
number traces back to ford and bentley,
and
company gives ford
then you can try combinations of list_1, list_2, and list_3 where:
list_1 = ['honda','ford']
list_2 = ['ford','bentley']
list_3 = ['ford']
Then when you try all combinations (I recommend itertools.combinations) you will end up with one valid result that is common in all lists: ford
I hope that is clear. I know I'm blabbing a bit.
I searched on Google for phone number regex validations but haven't been able to make it work based on my requirements.
Basically, I have three separate sets of rules for the prefix:
For 10 digit numbers I need to make sure the first 3 are numbers starting from 2-9.
For 11 digit numbers I need to make sure the first 4 are numbers starting from 1-9.
For for anything greater than 12 digits I need to make sure the first 7 are numbers from 0-9.
After that I can allow letters like 1888GOSUPER or something like that (this would fall under the second condition)
This is what I have so far but I am not certain if I have covered everything:
var reg10 = /^[2-9]{3}[a-z0-9]+$/i;
var reg11 = /^[1-9]{4}[a-z0-9]+$/i;
var reg12plus = /^[0-9]{7}[a-z0-9]+$/i;
This can be handled by one regex (including your check for length, as suggested by others). Probably can be done more succinctly than this, but I feel this is more readable in the context of your 3 specifically separate prefix requirements:
^(?:[2-9]{3}[a-z0-9]{7})$|^(?:[1-9]{4}[a-z0-9]{7})$|^(?:[0-9]{7}[a-z0-9]{5,})$
Basically combines your three separate cases via "alternation" |
This can be "normalised" slightly, without "breaking" the clarity of intent, by grouping the entire expression and then surrounding with start/end anchors (rather than repeating these in each option, as above). Although this results in a similar length rule overall, by the time we add our additional non-capturing group:
^(?:(?:[2-9]{3}[a-z0-9]{7})|(?:[1-9]{4}[a-z0-9]{7})|(?:[0-9]{7}[a-z0-9]{5,}))$
Currently I am using the below for search.
I assume each and every term the user types must appear at least once in the article.
I use the match method with regex
^(?=.*one)(?=.*two)(?=.*three).*$
with g, i, and m
At the moment I use matches.length to count the number of matches, but the behavior is not as expected.
example:
"one two three. one two three"
would give me 2 matches, but it should really be 6.
If I do something like
(one|two|three)
then I do get 6 matches, but if I have the data:
"one two. one two"
I get 4 matches, when in reality I want it to be 0, since not every word appears at least once.
I could do the first regex to check if there's at least one "match". If there is, I would subsequently use the second regex to count the real number of matches, but this would make my program run much slower than it already is. Doing this regex against 2500 json articles takes anywhere from 60 to 120 seconds as it is.
Any ideas on how to make this faster or better? Change the regex? Use search or indexOf instead of matches?
note:
I'm using lawnchair db for local persistance and jquery. I package the code for phonegap and as a chrome packaged app.
var input = '...';
var match = [];
if (input.match(/^(?=.*\bone\b)(?=.*\btwo\b)(?=.*\bthree\b)/i)) {
match = input.match(/\b(one|two|three)\b/ig);
}
Test this code here.