How to find key words in paragraph of text? - javascript

I'm trying to find a fast(milliseconds or seconds) solution for having an inputted block of text and a large list(11 million) of specific words/phrases to test against. So I would like to see what words/phrases are in the inputted paragraph?
We use Javascript and have SQL, MongoDB & DynamoDB as existing data stores that we can integrate this solution into.
I've done searching on this problem but can only find checking if words exist in text. not the other way around.
All ideas are welcome!

In cases like these you want to eliminate as much unnecessary data as possible. Assuming that order matters:
First things first, make sure you have a B tree index built on your phrases database clustered on the phrase. This will speed up range lookup times.
Let n = 2 (or 1, if you're into that)
Split the text block into phrases of length n and perform a query for phrases in the dictionary that begin with any of the phrase pairs ('My Phrase%'). This won't perform 4521 million string comparisons thanks to the index.
Remember the phrases that are an exact match
Let n = n + 1
Repeat from step 3 using the reduced dictionary, until the reduced dictionary is empty
You can also make small optimizations here and there depending on the kind of matches you're looking for, such as, not matching across punctuation, only phrases of a certain word length, etc. In any case, I'd expect the time bottleneck here to be on disk access, rather than actual comparisons.
Also, I'm pretty sure I based this algorithm off of an existing one but I don't remember its name so bonus points to anyone who can name it. I think it had something to do with data warehousing/mining and calculating frequencies and patterns?

Related

How to remove all of string up to and including hyphen

I am using javascript in a Mirth transformer. I apologize for my ignorance but I have no javascript training and have a hard time successfully utilizing info from similar threads here or elsewhere. I am trying to trim a string from 'Room-Bed' to be just 'Bed'. The "Room" and "Bed" values will fluctuate. All associated data is coming in from an ADT interface where our client is sending both the room and bed values, separated by a hyphen, in the bed field creating unnecessary redundancy. Please help me with the code needed to produce the result of 'Bed' from the received 'Room-Bed'.
There are many ways to reduce the string you have to the string you want. Which you choose will depend on your inputs and the output you want. For your simple example, all will work. But if you have strings come in with multiple hyphens, they'll render different results. They'll also have different performance characteristics. Balance the performance of it with how often it will be called, and whichever you find to be most readable.
// Turns the string in to an array, then pops the last instance out: 'Bed'!
'Room-Bed'.split('-').pop() === 'Bed'
// Looks for the given pattern,
// replacing the match with everything after the hyphen.
'Room-Bed'.replace(/.+-(.+)/, '$1') === 'Bed'
// Finds the first index of -,
// and creates a substring based on it.
'Room-Bed'.substr('Room-Bed'.indexOf('-') + 1) === 'Bed'

How to match similar strings in javascript

I have to arrays that contain names of premier league players.
I want to match them by name since the player objects don't have unique ids.
How can I make a string comparison that will match Zlatan Ibrahimovic with Zlatan Ibrahimović for example? (notice the last character of both strings)
It is not a trivial problem.
You should look into Levenshtein distance problem
https://en.wikipedia.org/wiki/Levenshtein_distance
You can search in google for different implementations or use a library like:
https://www.npmjs.com/package/levenshtein
Example:
l = new Levenshtein( 'Zlatan Ibrahimovic', 'Zlatan Ibrahimović')
// l === 1
I used already, and I liked. In my code, I used this one for an experimental proposed.
I don;t care about the result. Because in a long string 4 can be a very good number and in small one 2 it is very bad.
I get to do something like l/Math.max(str1.length, str2.length) then you can make your number and decide wich number is interesting for you.

How to detect and remove unwanted lines from a string?

I am working on a project in which i have to extract text data from a PDF.
I am able to extract text from the PDF, but extracted text sometimes contains lines which i would like to strip off from it.
Here's and example of unwanted lines -
ISBN 0-7225-3293-8. = CONTENTS = Part One Part Two Epilogue
Page 1 / 94
And, here's an example of good line (which i'd like to keep) -
Dusk was falling as the boy arrived with his herd at an abandoned church.
I wanted to sleep a little longer, he thought. He had had the same dream that night as a week ago
Different PDFs can give out different unwanted lines.
How can i detect them ?
Option 1 - Give the computer a rule: If you are able to narrow down what content it is that you would like to keep, the obvious criteria that sticks out to me is the exclusion of special characters, then you can filter your results based on this.
So let's say you agree that all "good lines" will be without special characters ('/', '-', and '=') for example, if a line DOES contain one of these items, you know you can remove it from the content you are keeping. This could be done in a for loop containing an if-then condition that looks something like this..
var lineArray = //code needed to make each line of the file an element of the array
For (cnt = 0; cnt < totalLines; cnt++)
{
var line = lineArray[cnt];
if (line.contains("/") || line.contains("-") || line.contains("="))
lineArray[cnt] = "";
}
At the end of this code you could simply get all the text within the array and it would no longer contain the unwanted lines. If there are unwanted lines however, that are virtually indistinguishable by characters, length, positioning etc. the previous approach begins to break down on some of the trickier lines.
This is because there is no rule you can give the computer to distinguish between the good and the bad without giving it a brain such as yours that recognizes parts of speech and sentence structure. In which case you might consider option 2, which is just that.
Option 2- Give the computer a brain: Given that the text you want to remove will more or less be incoherent documentation based on what you have shown us, an open source (or purchased) natural language processor may be what you are looking for.
I found a good beginner's intro at http://myreaders.info/10_Natural_Language_Processing.pdf with some information that might be of use to you. From the source,
"Linguistics is the science of language. Its study includes:
sounds (phonology),
word formation (morphology),
sentence structure (syntax),
meaning (semantics), and understanding (pragmatics) etc.
Syntactic Analysis : Here the analysis is of words in a sentence to know the grammatical structure of the sentence. The words are transformed into structures that show how the words relate to each others. Some word sequences may be rejected if they violate the rules of the language for how words may be combined. Example: An English syntactic analyzer would reject the sentence say : 'Boy the go the to store.' "
Using some sort of NLP, you can discover whether a given section of text contains a sentence or some incoherent rambling. This test could then be used as a filter in your program for what you would like to keep or remove.
Side note- As it appears your sample text is not just sentences but literature, sometimes characters will speak in sentence fragments as part of their nature given by the author. In this case, you could add a separate condition that if the text is contained within two quotations and has no special characters, you want to keep the text regardless.
In the end NLP may be more work than you require or that you want to do, in which case Option 1 is likely going to be your best bet. On the other hand, it may be just the thing you are looking for. Whatever the case or if you decide you need some combination of the two, best of luck! I hope this answer helps.

How can I match strings precisely and avoid false matches?

Background - web app back end javascript/dojo code.
I need to match a user input string to a list of possible vehicle models and I am having challenges with incorrect matches.
Say a user enters:
Ford Fusion, S 60, and Volks Wagen
Currently, I would read that in as
FORDFUSIONS60VOLKSWAGEN
and in that, I would match against a list of makes and models.
Problem is, in this case and in many others, you get things like "S6" (Audi) " and "S60" (Volvo), or "Accord" (Honda) or "CC" (Volkswagen).
Any idea how it would be possible (if at all) to avoid these ambiguous matches?
Since this question is tagged regex, I think you are looking for the word boundary metacharacter:
/\bS6\b/
will match "S6" and "… S6 …", but not "S60", just as
/\bCC\b/i
will match "CC" and "cc", but not "Accord".
To avoid at least the both examples you would first match against the longer names (e.g. for "s60" before "s6" and "accord" before "cc") and if there's no match, then use the shorter one. Otherwise exit with the longer one.
As far as you're looking for the longest matches, you also could check, if one of the resulting names is contained within another and skip them.
This is how I would go about it:
Run checks with the name, model and company and if they trace back to the same reference, then you know you have what you want. However, if you get different results keep trying combinations of all search results until they match up to a single reference.
For example:
model traces back to honda and ford,
number traces back to ford and bentley,
and
company gives ford
then you can try combinations of list_1, list_2, and list_3 where:
list_1 = ['honda','ford']
list_2 = ['ford','bentley']
list_3 = ['ford']
Then when you try all combinations (I recommend itertools.combinations) you will end up with one valid result that is common in all lists: ford
I hope that is clear. I know I'm blabbing a bit.

basic search ranking with regex in javascript

Currently I am using the below for search.
I assume each and every term the user types must appear at least once in the article.
I use the match method with regex
^(?=.*one)(?=.*two)(?=.*three).*$
with g, i, and m
At the moment I use matches.length to count the number of matches, but the behavior is not as expected.
example:
"one two three. one two three"
would give me 2 matches, but it should really be 6.
If I do something like
(one|two|three)
then I do get 6 matches, but if I have the data:
"one two. one two"
I get 4 matches, when in reality I want it to be 0, since not every word appears at least once.
I could do the first regex to check if there's at least one "match". If there is, I would subsequently use the second regex to count the real number of matches, but this would make my program run much slower than it already is. Doing this regex against 2500 json articles takes anywhere from 60 to 120 seconds as it is.
Any ideas on how to make this faster or better? Change the regex? Use search or indexOf instead of matches?
note:
I'm using lawnchair db for local persistance and jquery. I package the code for phonegap and as a chrome packaged app.
var input = '...';
var match = [];
if (input.match(/^(?=.*\bone\b)(?=.*\btwo\b)(?=.*\bthree\b)/i)) {
match = input.match(/\b(one|two|three)\b/ig);
}
Test this code here.

Categories