Will this regex be slow? Is there any way to optimize it? - javascript

This will be run in javascript multiple times on bits of HTML. Will all of the or expressions make it slow? Can it be optimized?
\<[^\>]*?(abbr|acronym|address|applet|area|article|aside|audio|base|basefont|bdi|bdo|big|blockquote|body|button|canvas|caption|center|cite|code|col|colgroup|command|datalist|dd|del|details|dfn|dialog|dir|div|dl|dt|em|embed|fieldset|figcaption|figure|font|footer|form|frame|frameset|head|header|hr|html|iframe|img|input|ins|kbd|keygen|label|legend|link|map|mark|menu|meta|meter|nav|noframes|noscript|object|optgroup|option|output|param|pre|progress|q|rp|rt|ruby|samp|script|section|select|small|source|strike|style|sub|summary|sup|textarea|time|title|track|tt|var|video|wbr)[^\>]*?\>/g

You might try moving element names found very frequently in your source (a, div) to the front of the list:
… (a|div|abbr| …
Also, I think your pattern will match, e.g., < notanabbreviation >. If that's not what you want, try
<\b(a|abbr|…)\b[^>]*?>
The \b preceding the alternations helps because it lets the engine exit early without trying all of the alternations.
But you just have to test to see. I made a jsperf test using nytimes.com as an example.

You can use this tool to compare different regex
it took 2.4 seconds to execute over the source code of Yahoo's front page . This is not a scientific test but it doesnt look very effecient.
PS silverlight plugin is required

adding i after the g will make it case insensitive
also since it is javascript maybe you could use a hash instead of a giant regex

Related

Regular expression matching multiple entries, spanning multiple lines [duplicate]

This question already has answers here:
Regular expression to stop at first match
(9 answers)
Closed 2 years ago.
I have this gigantic ugly string:
J0000000: Transaction A0001401 started on 8/22/2008 9:49:29 AM
J0000010: Project name: E:\foo.pf
J0000011: Job name: MBiek Direct Mail Test
J0000020: Document 1 - Completed successfully
I'm trying to extract pieces from it using regex. In this case, I want to grab everything after Project Name up to the part where it says J0000011: (the 11 is going to be a different number every time).
Here's the regex I've been playing with:
Project name:\s+(.*)\s+J[0-9]{7}:
The problem is that it doesn't stop until it hits the J0000020: at the end.
How do I make the regex stop at the first occurrence of J[0-9]{7}?
Make .* non-greedy by adding '?' after it:
Project name:\s+(.*?)\s+J[0-9]{7}:
Using non-greedy quantifiers here is probably the best solution, also because it is more efficient than the greedy alternative: Greedy matches generally go as far as they can (here, until the end of the text!) and then trace back character after character to try and match the part coming afterwards.
However, consider using a negative character class instead:
Project name:\s+(\S*)\s+J[0-9]{7}:
\S means “everything except a whitespace and this is exactly what you want.
Well, ".*" is a greedy selector. You make it non-greedy by using ".*?" When using the latter construct, the regex engine will, at every step it matches text into the "." attempt to match whatever make come after the ".*?". This means that if for instance nothing comes after the ".*?", then it matches nothing.
Here's what I used. s contains your original string. This code is .NET specific, but most flavors of regex will have something similar.
string m = Regex.Match(s, #"Project name: (?<name>.*?) J\d+").Groups["name"].Value;
I would also recommend you experiment with regular expressions using "Expresso" - it's a utility a great (and free) utility for regex editing and testing.
One of its upsides is that its UI exposes a lot of regex functionality that people unexprienced with regex might not be familiar with, in a way that it would be easy for them to learn these new concepts.
For example, when building your regex using the UI, and choosing "*", you have the ability to check the checkbox "As few as possible" and see the resulting regex, as well as test its behavior, even if you were unfamiliar with non-greedy expressions before.
Available for download at their site:
http://www.ultrapico.com/Expresso.htm
Express download:
http://www.ultrapico.com/ExpressoDownload.htm
(Project name:\s+[A-Z]:(?:\\w+)+.[a-zA-Z]+\s+J[0-9]{7})(?=:)
This will work for you.
Adding (?:\\w+)+.[a-zA-Z]+ will be more restrictive instead of .*

Regex for repeated sub strings in a lengthy string [duplicate]

This question already has answers here:
Regular expression to stop at first match
(9 answers)
Closed 2 years ago.
I have this gigantic ugly string:
J0000000: Transaction A0001401 started on 8/22/2008 9:49:29 AM
J0000010: Project name: E:\foo.pf
J0000011: Job name: MBiek Direct Mail Test
J0000020: Document 1 - Completed successfully
I'm trying to extract pieces from it using regex. In this case, I want to grab everything after Project Name up to the part where it says J0000011: (the 11 is going to be a different number every time).
Here's the regex I've been playing with:
Project name:\s+(.*)\s+J[0-9]{7}:
The problem is that it doesn't stop until it hits the J0000020: at the end.
How do I make the regex stop at the first occurrence of J[0-9]{7}?
Make .* non-greedy by adding '?' after it:
Project name:\s+(.*?)\s+J[0-9]{7}:
Using non-greedy quantifiers here is probably the best solution, also because it is more efficient than the greedy alternative: Greedy matches generally go as far as they can (here, until the end of the text!) and then trace back character after character to try and match the part coming afterwards.
However, consider using a negative character class instead:
Project name:\s+(\S*)\s+J[0-9]{7}:
\S means “everything except a whitespace and this is exactly what you want.
Well, ".*" is a greedy selector. You make it non-greedy by using ".*?" When using the latter construct, the regex engine will, at every step it matches text into the "." attempt to match whatever make come after the ".*?". This means that if for instance nothing comes after the ".*?", then it matches nothing.
Here's what I used. s contains your original string. This code is .NET specific, but most flavors of regex will have something similar.
string m = Regex.Match(s, #"Project name: (?<name>.*?) J\d+").Groups["name"].Value;
I would also recommend you experiment with regular expressions using "Expresso" - it's a utility a great (and free) utility for regex editing and testing.
One of its upsides is that its UI exposes a lot of regex functionality that people unexprienced with regex might not be familiar with, in a way that it would be easy for them to learn these new concepts.
For example, when building your regex using the UI, and choosing "*", you have the ability to check the checkbox "As few as possible" and see the resulting regex, as well as test its behavior, even if you were unfamiliar with non-greedy expressions before.
Available for download at their site:
http://www.ultrapico.com/Expresso.htm
Express download:
http://www.ultrapico.com/ExpressoDownload.htm
(Project name:\s+[A-Z]:(?:\\w+)+.[a-zA-Z]+\s+J[0-9]{7})(?=:)
This will work for you.
Adding (?:\\w+)+.[a-zA-Z]+ will be more restrictive instead of .*

Regular expression: matching word boundaries [duplicate]

This question already has answers here:
Regular expression to stop at first match
(9 answers)
Closed 2 years ago.
I have this gigantic ugly string:
J0000000: Transaction A0001401 started on 8/22/2008 9:49:29 AM
J0000010: Project name: E:\foo.pf
J0000011: Job name: MBiek Direct Mail Test
J0000020: Document 1 - Completed successfully
I'm trying to extract pieces from it using regex. In this case, I want to grab everything after Project Name up to the part where it says J0000011: (the 11 is going to be a different number every time).
Here's the regex I've been playing with:
Project name:\s+(.*)\s+J[0-9]{7}:
The problem is that it doesn't stop until it hits the J0000020: at the end.
How do I make the regex stop at the first occurrence of J[0-9]{7}?
Make .* non-greedy by adding '?' after it:
Project name:\s+(.*?)\s+J[0-9]{7}:
Using non-greedy quantifiers here is probably the best solution, also because it is more efficient than the greedy alternative: Greedy matches generally go as far as they can (here, until the end of the text!) and then trace back character after character to try and match the part coming afterwards.
However, consider using a negative character class instead:
Project name:\s+(\S*)\s+J[0-9]{7}:
\S means “everything except a whitespace and this is exactly what you want.
Well, ".*" is a greedy selector. You make it non-greedy by using ".*?" When using the latter construct, the regex engine will, at every step it matches text into the "." attempt to match whatever make come after the ".*?". This means that if for instance nothing comes after the ".*?", then it matches nothing.
Here's what I used. s contains your original string. This code is .NET specific, but most flavors of regex will have something similar.
string m = Regex.Match(s, #"Project name: (?<name>.*?) J\d+").Groups["name"].Value;
I would also recommend you experiment with regular expressions using "Expresso" - it's a utility a great (and free) utility for regex editing and testing.
One of its upsides is that its UI exposes a lot of regex functionality that people unexprienced with regex might not be familiar with, in a way that it would be easy for them to learn these new concepts.
For example, when building your regex using the UI, and choosing "*", you have the ability to check the checkbox "As few as possible" and see the resulting regex, as well as test its behavior, even if you were unfamiliar with non-greedy expressions before.
Available for download at their site:
http://www.ultrapico.com/Expresso.htm
Express download:
http://www.ultrapico.com/ExpressoDownload.htm
(Project name:\s+[A-Z]:(?:\\w+)+.[a-zA-Z]+\s+J[0-9]{7})(?=:)
This will work for you.
Adding (?:\\w+)+.[a-zA-Z]+ will be more restrictive instead of .*

is there any way to get all the possible outcomes of a regular expression pattern?

is there any way to get all the possible outcomes of a regular expression pattern?. everything I've seen refers to a pattern that is evaluated against a string. but what I need is to have a pattern like this:
^EM1650S(B{1,2}|L{1,2})?$
generate all possible matches:
EM1650S
EM1650SB
EM1650SBB
EM1650SL
EM1650SLL
In the general case, no. In this case, you have almost no solution space.
There's a section covering this in Higher Order Perl (PDF) and a Perl module. I never re-implemented it in anything else, but I had a similar problem and this solution was adequate for similarly-limited needs.
There are tools that can display all possible matches of a regex.
Here is one written in Haskell: https://github.com/audreyt/regex-genex
and here is a Perl module: http://metacpan.org/pod/Regexp::Genex
Unfortunately I couldn't find anything for JavaScript
In this particular case, yes. The regex generates a finite number of valid string, so they can be counted up.
You'll just have to parse the regex. Some part of that (EM1650S) is mandatory, so think for the rest. Parse by the | (or) symbol. Then enumerate the strings for both sides of it. Then you can get all possible combinations of them.
Some regex (containing * or + symbols) can represent an infinite number of strings, so they cannot be counted.
From a computational theoretic standpoint, regular expressions are equivalent to finite state machines. This is part of "automata theory." You could create a finite state machine that is equivalent to a regular expression and then use graph traversal algorithms to traverse all paths of the FSM. In the general case a countably infinite number of strings may match a regular expression, so your program may never terminate depending on the input regular expression.

Wipe a string but keep its middle part

With a string like "HorsieDoggieBirdie", is there a non-capturing regex replace that would kill "Horsie" and "Birdie", yet keep "Doggie" intact? I can only think of a capturing solution:
s/(Horsie)(Doggie)(Birdie)/$2/g
Is there a non-capturing solution like:
s/Horsie##Doggie##Birdie//g
where ## is some combination of regex codes? The specific problem is in JavaScript (innerHTML.replace) but I'll take Perl suggestions, too.
You don't have to capture the Horsie or the Birdie.
s/Horsie(Doggie)Birdie/$1/g;
A similar thing should work for Javascript as well. This is probably as efficient as it gets, and at least as fast as using look-around assertions; although you should benchmark it if you want to know for sure. (The results, of course, will depend on the horsies, doggies and birdies in question.)
Mandatory disclaimer: you should know what happens when you use regular expressions with HTML...
You can use Look-Around Assertions:
s/(?:Horsie(?=Doggie))|(?:(?<=Doggie)Birdie)//g;

Categories