Fast method to find regex matches in a large document using javascript? - javascript

I need to search the text in a HTML document for reg-exes(emails, phone numbers, etc) and words. The matches need to be highlighted and be made anchor-able so that a link can be generated to jump to the location of the matches. So not only does it need to find matches using patterns in needs to do a replace do add the proper html code.
I am currently using jquery but I am not very happy with the speed. In a 1.5mb file it takes about 5 seconds to match 2 regexes and it increases when I add more search criteria.
Does anyone know of a fast method to find regex matches in a large document using javascript?

You say you're "using jQuery" but you don't say how. Have you tried a "highlight" plugin (or, as it sounds like you'd need, a derivation of one)? I've used this one: http://johannburkard.de/blog/programming/javascript/highlight-javascript-text-higlighting-jquery-plugin.html and it doesn't seem slow to me. Again, you'd have to work on it to make it add the markup you need, but that should be pretty clear - it's not very big.
It seems like what you'd want to do for performance is take your regular expressions and combine them into what amounts to a "token grammar". In other words, you don't want to start from scratch looking for each regex individually throughout the entire document. Instead, you'd want to proceed through it with a regex that matches each possible target (one at a time of course), and each time it finds one you'd replace it with whatever's appropriate. That way you could make just one pass over the document, no matter how big it is and no matter how many patterns you're looking for.
edit Mr. Burkard's plugin doesn't let you search with regexes; it uses "indexOf" internally. Hmm.

Related

Hiding information in HTML for JavaScript to find

I want to add information to an HTML page that will be visible to JavaScript but not to the end-user. I want to keep the original HTML page as simple as possible. One solution would be to use non-standard tags, such as <custom> ... </custom>
I am aware of the official way of adding custom elements, but my purpose is not to show anything on-screen, so using the CustomElementRegistry seems overkill.
Here's my use case. I am creating an "aided reader" web application for people who are learning English. My JavaScript code adds <span> elements to an ordinary HTML page, enclosing words which are new to the reader. For example, JavaScript code will change the plain HTML ...
<p>The word "thought" may be new to elementary learners.</p>
... to:
<p>The word "<span data-info="think">thought</span>" may be new to elementary learners.</p>
(The span's data-info attribute provides information which is used later — when the user hovers the mouse over the word — to display images, definitions and examples, but that is not important here.)
The text comes from non-web-developer authors, and it contains no mark-up at all at the beginning. I am writing two tools: one offline and one online. The offline tool compares the vocabulary with lists of words that students are expected to know at different levels, and allows an non-tech-savvy editor to collect different inflections of the same word (for example: lose|loses|lost|losing) that appear in the given text, so that they can be treated as the same root word. This generates an array of terms that the student might want to learn more about. Each term is stored as a string that can be converted to a regular expression. For example "los(?:e|es|t|ing)".
The online web page will receive:
The raw text from the author
The array of search terms
Some more information about what reference sites to use. This information will be added to the data-info attribute of the enclosing span, but it is not important here.
The online code will work through the array, looking for matches for each regular expression (/thought|thinks?/, for example) in the raw text and add the same span to all the occurrences it finds. It will also add <p> tags where necessary.
However, the word "thought" can be either a verb or a noun: "Yesterday I thought..." (verb) or "Yesterday I had a thought... " (noun). In the second case, I need to use a different regular expression: /thoughts?/, to allow for both singular and plural forms.
However, both these regular expressions will find a match for "thought", which is the problem I need to solve.
This is where the "information hiding" comes in. One solution would be for my offline tool to add tags to the raw text like this...
Yesterday I thought ... Yesterday I had a thought ...
I can then use different regular expressions for each case, and there would be no clash.
/<verb>(thought|thinks?)<\/verb>/
/<noun>(thoughts?)<\/noun>/
Since these tags will be not be displayed in the browser, they can remain in place ... or can they?
Is there any danger in using non-standard and non-declared tags in this way?
Why don't wrap it with a span like you did at the beginning and add an attribute called "data-type".
Would give :
<p>Yesterday I <span data-info="think" data-type="verb">thought</<span> ... Yesterday I had a <span data-info="think" data-type="noun">thought</span> ... <p>

Delimiting documents with regular expressions

I'm working on several documents that are within just a file, and before working on the documents, I need to define where one document begins and ends. For this, I am using the following regex:
MINISTÉRIO\sDO\sTRABALHO\sE\sEMPREGO(?:[^P]*(?:P(?!ÁG\s:\s\d+\/\d+)[^P]*)*)PÁG\s:\s\d+\/(\d+)\b(?:\D*(?:(?!\1\/\1)\d\D*)*)\1\/\1(?:[^Z]*(?:Z(?!6:\s\d+)[^Z]*)*)Z6:\s\d+
Example is here
Is working 100%, the problem is, sometimes the text does not come this way I showed.. it comes with spaces and lines. As you can see here, the document is the same as the previous one, but the regular expression does not work. I wonder why is not working and how to fix to make it work ?
Also, I need modify the regex, not the text, cause the only real part that I have access is the regex.
OBS: I'm using Node.JS, that's why i'm tagging with JS this post.

Regex replace everything not starting with multiple times in line

I am making a parser of sorts in javascript that takes a mathematical expression given to the script as a string, and evaluates it and does some other things with it. If the users want to use builtin Javascript mathematical functions, they have to enter the following string e.g. "1 + Math.log(x)". That becomes very tedious when things get nested e.g "Math.abs(Math.log(Math.pow(x, 2))) + Math.log2(x)". As you can see, the "Math." part of it not only takes longer to write, but makes it less readable. I want to remove that "Math." part. The way I've done it is using simple regex that basically has a list of all Javascript Math constants and methods and simply prepends the "Math." part of it. Simple enough:
input = input.replace(/(E|PI|SQRT2|SQRT1_2|LN2|LN10|LOG2E|LOG10E)/g, "Math.$1");
The same things happens for the methods. This works fine. But as always that's not very flexible and leaves room for misunderstanding and somebody may coma along and insist on typing Math.log(x) which will in turn be replaced with Math.Math.log(x), which won't work.
What I want to know is, is there some way to match any of these predefined strings (constants and methods) so that they will only be matched via regex if they don't have the "Math." part in front of it. I have tried this
^(?!Math\.)(log2|log|exp|abs)
but it is quite useless, as this doesn't work with nesting and even multiple operands. Is there any way to do this purely in regex, as this would make the entire process more elegant. Any help would be appreciated.
You can use the following trick so that it gets replaced even if it matches or not:
(?:Math\.)?(log2|log|exp|abs|pow)
And replace with Math.$1
See DEMO

Detect four *different* words in JavaScript and add styling

This is my first post, but I've loved using this site as resource for quite awhile now. However, the time has now come for me to ask a question...
I have found plenty of JavaScript highlighter plugins during my research into this question, but they all focus on finding one word. For a fan-site I am creating (Mega Man Battle Network, for those interested), I would like to find a way to detect the words, "Fire", "Aqua", "Elec", and "Wood", so I can automatically add styling to them.
Any JavaScript gurus out there to help me?
If you don't care which word is found, you could use a regex like this:
/\bfire|aqua|elec|wood\b/gi
Actually, now that I think about it, I'd still use the same regex (only with capture groups) even if you did care what word you found. You could use javascript and jquery to select sections that have a word and add that word as a class name, thus applying whatever CSS you've defined as associated with that class.
That regex would look like this:
/\b(fire|aqua|elec|wood)\b/gi
The jQuery you'll be looking for will likely be the filter function: http://api.jquery.com/filter/#expr
Once you have those objects, you can apply your styles using jQuery and .addClass: http://api.jquery.com/addClass/
I figured it out!
$(document).ready(function(){
$('div.elemental:contains("Aqua")').addClass('aqua');
$('div.elemental:contains("Fire")').addClass('fire');
$('div.elemental:contains("Wood")').addClass('wood');
$('div.elemental:contains("Elec")').addClass('elec');
});
Now I just need to figure out how create a callback to a new function so that the highlighting is done to each new page I go to.

Fast algorithm to find keywords in an HTML page using Javascript

I have a JS specialty dictionary that find certain keywords on a page and add explanatory tooltips to them. Right now I'm using RegEx to find the keywords, but I suspect it will get slow very soon, when my dictionary grows bigger. I store dictionary entries in an array so I think that can be improved as well. My site language is Vietnamese and my keywords will all be English.
Any idea on improving performance will be much appreciated. Thanks.
You could process your dictionary server side (checks output against keywords), then add a handler to each matched item (a class or other html element to identify the definition to use..). then use javascript to bind each element to your dictionary. This way your server is doing the heavy lifting.
1) Server loads your dictionary file and compares against text you are about to output
2) Where a match is found add
<span class="definition">yourword</span>
3) Generic javascript event handler (this is written in jQuery but of course you can fdo it anyway you like)
$('.definition').mouseOver(function(){
var keyword = $(this).html();
//load your definition using the keyword...
})
See my answer to a related question: javascript: find strings in dom and emphasize it
Also see the accepted answer to that question which is a jQuery plugin to do just what you want.
The problem with doing this with regexp is not speed since some people claim that the DOM parsing method can actually be slower. The problem is avoiding crazy corner casses like: you don't want to replace a javascript string that happens to contain the keyword, you don't want to replace css class name or id that happens to contain the keyword, etc.
In my experience, the DOM way is fast enough. In fact, my website has a list of more than 100 keywords and it manages to install tooltips on all of them in under half a second (certainly faster than my eye can see).

Categories