With a string like "HorsieDoggieBirdie", is there a non-capturing regex replace that would kill "Horsie" and "Birdie", yet keep "Doggie" intact? I can only think of a capturing solution:
s/(Horsie)(Doggie)(Birdie)/$2/g
Is there a non-capturing solution like:
s/Horsie##Doggie##Birdie//g
where ## is some combination of regex codes? The specific problem is in JavaScript (innerHTML.replace) but I'll take Perl suggestions, too.
You don't have to capture the Horsie or the Birdie.
s/Horsie(Doggie)Birdie/$1/g;
A similar thing should work for Javascript as well. This is probably as efficient as it gets, and at least as fast as using look-around assertions; although you should benchmark it if you want to know for sure. (The results, of course, will depend on the horsies, doggies and birdies in question.)
Mandatory disclaimer: you should know what happens when you use regular expressions with HTML...
You can use Look-Around Assertions:
s/(?:Horsie(?=Doggie))|(?:(?<=Doggie)Birdie)//g;
Related
I've seen regex patterns that use explicitly numbered repetition instead of ?, * and +, i.e.:
Explicit Shorthand
(something){0,1} (something)?
(something){1} (something)
(something){0,} (something)*
(something){1,} (something)+
The questions are:
Are these two forms identical? What if you add possessive/reluctant modifiers?
If they are identical, which one is more idiomatic? More readable? Simply "better"?
To my knowledge they are identical. I think there maybe a few engines out there that don't support the numbered syntax but I'm not sure which. I vaguely recall a question on SO a few days ago where explicit notation wouldn't work in Notepad++.
The only time I would use explicitly numbered repetition is when the repetition is greater than 1:
Exactly two: {2}
Two or more: {2,}
Two to four: {2,4}
I tend to prefer these especially when the repeated pattern is more than a few characters. If you have to match 3 numbers, some people like to write: \d\d\d but I would rather write \d{3} since it emphasizes the number of repetitions involved. Furthermore, down the road if that number ever needs to change, I only need to change {3} to {n} and not re-parse the regex in my head or worry about messing it up; it requires less mental effort.
If that criteria isn't met, I prefer the shorthand. Using the "explicit" notation quickly clutters up the pattern and makes it hard to read. I've worked on a project where some developers didn't know regex too well (it's not exactly everyone's favorite topic) and I saw a lot of {1} and {0,1} occurrences. A few people would ask me to code review their pattern and that's when I would suggest changing those occurrences to shorthand notation and save space and, IMO, improve readability.
I can see how, if you have a regex that does a lot of bounded repetition, you might want to use the {n,m} form consistently for readability's sake. For example:
/^
abc{2,5}
xyz{0,1}
foo{3,12}
bar{1,}
$/x
But I can't recall ever seeing such a case in real life. When I see {0,1}, {0,} or {1,} being used in a question, it's virtually always being done out of ignorance. And in the process of answering such a question, we should also suggest that they use the ?, * or + instead.
And of course, {1} is pure clutter. Some people seem to have a vague notion that it means "one and only one"--after all, it must mean something, right? Why would such a pathologically terse language support a construct that takes up a whole three characters and does nothing at all? Its only legitimate use that I know of is to isolate a backreference that's followed by a literal digit (e.g. \1{1}0), but there are other ways to do that.
They're all identical unless you're using an exceptional regex engine. However, not all regex engines support numbered repetition, ? or +.
If all of them are available, I'd use characters rather than numbers, simply because it's more intuitive for me.
They're equivalent (and you'll find out if they're available by testing your context.)
The problem I'd anticipate is when you may not be the only person ever needing to work with your code.
Regexes are difficult enough for most people. Anytime someone uses an unusual syntax, the question
arises: "Why didn't they do it the standard way? What were they thinking that I'm missing?"
Its quite a simple but in my opinion weird problem i basically have this regex and entered a few tests and they work.
(?=^\*)|(?=^.{1,254}$)(^(?:(?!\d+\.)[a-zA-Z0-9_\-\{\}]{1,63}\.?)+(?:[a-zA-Z\{\}]{1,})$)
https://regex101.com/r/hU6tP0/2
But when i try to use it in html it fails. But if i test it in javascript it works.
http://jsfiddle.net/ek6kby2q/9/
I don't have much knowledge at all about regex so maybe anyone know whats going wrong or got any tips to make the regex better is welcome.
As an html attribute, the pattern must match all the string from the beginning to the end, that's why (?=^\*) fails to do it, since it matches zero characters.
Use this pattern instead:
\*.*|(?!.{255})(?:[A-Za-z_{}-][\w{}-]{0,62}\.?)+[A-Za-z{}]+
(You can omit the anchors since they are implicit)
This will be run in javascript multiple times on bits of HTML. Will all of the or expressions make it slow? Can it be optimized?
\<[^\>]*?(abbr|acronym|address|applet|area|article|aside|audio|base|basefont|bdi|bdo|big|blockquote|body|button|canvas|caption|center|cite|code|col|colgroup|command|datalist|dd|del|details|dfn|dialog|dir|div|dl|dt|em|embed|fieldset|figcaption|figure|font|footer|form|frame|frameset|head|header|hr|html|iframe|img|input|ins|kbd|keygen|label|legend|link|map|mark|menu|meta|meter|nav|noframes|noscript|object|optgroup|option|output|param|pre|progress|q|rp|rt|ruby|samp|script|section|select|small|source|strike|style|sub|summary|sup|textarea|time|title|track|tt|var|video|wbr)[^\>]*?\>/g
You might try moving element names found very frequently in your source (a, div) to the front of the list:
… (a|div|abbr| …
Also, I think your pattern will match, e.g., < notanabbreviation >. If that's not what you want, try
<\b(a|abbr|…)\b[^>]*?>
The \b preceding the alternations helps because it lets the engine exit early without trying all of the alternations.
But you just have to test to see. I made a jsperf test using nytimes.com as an example.
You can use this tool to compare different regex
it took 2.4 seconds to execute over the source code of Yahoo's front page . This is not a scientific test but it doesnt look very effecient.
PS silverlight plugin is required
adding i after the g will make it case insensitive
also since it is javascript maybe you could use a hash instead of a giant regex
I'm talking about stuff like & which will then render to: & when it actually should render to &. In this I asked how to match entities, but it seems that isn't really possible or realistic with regexes. What then is the best way to match double entities?
EDIT: Is this a good way to do it? .replace(/&(?=#?x?[0-9a-z]+);/i, '&');
(I'm using javascript)
I'd go with
pattern &([a-zA-Z0-9]+?;)\1
replacement &$1
to replace just double amps, or:
pattern &([#a-zA-Z0-9]+?;)
EDIT:
your pattern
/&(?=#?x?[0-9a-z]+);/i
looks also good to me.
Note: none of these is something you can trust
Possibly:
&[a-zA-Z]+;
Though not fool proof.
Normalize your data first. Use whatever you know about encoding to decode them back to form where character/piece of data have only one possible encoding. After that match this normalized data with normalized pattern.
I am looking to find this in a string: XXXX-XXX-XXX Where the X is any number.
I need to find this in a string using JavaScript so bonus points to those who can provide me the JavaScript too. I tried to create a regex and came out with this: ^[0-9]{4}\-[0-9]{3}\-[0-9]{3}$
Also, I would love to know of any cheat sheets or programs you guys use to create your regular expressions.
i suppose this is what you want:
\d{4}-\d{3}-\d{3}
in doubt? Google for "RegEx Testers"
With your attempt:
^[0-9]{4}\-[0-9]{3}\-[0-9]{3}$
Since the - is not a metacharacter, there is no need to escape it -- thus you are looking for explicit backslash characters.
Also, you've anchored the match at the beginning and end of the string -- this will match only strings that consist only of your number. (Well, assuming the rest were correct.)
I know most people like the {3} style of counting, but when the thing being matched is a single digit, I find this more legible:
\d\d\d\d-\d\d\d-\d\d\d
Obviously if you wanted to extend this to matching hexadecimal digits, extending this one would be horrible, but I think this is far more legible than alternatives:
\d{4}-\d{3}-\d{3}
[[:digit:]]{4}-[[:digit:]]{3}-[[:digit:]]{3}
[0-9]{4}-[0-9]{3}-[0-9]{3}
Go with whatever is easiest for you to read.
I tend to use the perlre(1) manpage as my main reference, knowing full well that it is far more featureful than many regexp engines. I'm prepared to handle the differences considering how conveniently available the perlre manpage is on most systems.
var result = (/\d{4}\-\d{3}\-\d{3}/).exec(myString);