RegEx breaking when string contains square brackets - javascript

I've been using this regular expression to pull out mustached {{Hello}} content:
/{{\s*[\w\.]+\s*}}/g
It's falling down when the mustached string contains square brackets. I've been fiddling with it for ages to no avail, could anyone suggest an adjustment that will mean it will match {{Hello[0]}} ?

I'm your Huckleberry:
\{\{(.*?)\}\}
I always hack around with these using the excellent http://www.regexr.com/
So, to explain why this works for this situation:
First, consider \{\{ – we escape (by 'escaping' with a backslash the next character doesn't get evaluated by the expression, e.g. it just looks for that character) the first character we are looking for (the curly brace).
We then repeat that to get the second curly brace.
Next we open a parenthesis ( to make a 'group' to capture multiple tokens – so we can grab everything inside the braces.
The . matches any characters except line breaks.
The * matches zero or more of the preceding token (in this case any token except line breaks)
The ? makes the previous quantifier 'lazy' in that it will match as few as possible.
Then we close the group ).
Finally we close out with the two more escaped characters \}\}

Related

Matching varients and mis-spellings of a word using RegEx in MS Word

I am trying to capture varients of a word using Microsft Word find and replace function. Here is a searchable snippet:
There are going to be 3 instances of the word successful for the purpose of Regex matching. Here is the second sucesfull and here is another succesfull , both spelt incorrectly.
This is my Regex expression used in Find and Replace with "Use Wildcards" selected (I have also tried this with replacing the braces with brackets with no joy)
<([Ss]uc[1,]es[1,]ful[1,])>
[Ss]uc{1,}es{1,}ful{1,}
Replace the [ ] with { } and it should work fine. The curly braces specify how many times you want a character to repeat. Square brackets are used to specify the acceptable characters.
So the current regular expression will match the following.
succcccesssfulll
sucesful
successful
Successsssfull
and so on.
I think this is cleaner and easier to type.
[Ss]uc+es+ful+
"+" counts for one or more occurrence of a character.
The search string you want would be:
<[sS]uc#es#ful#>
This searches for a word (the < and > symbols) starting with either s or S and including one or more (the # symbol) of c, s, and l.

Regex - I keep getting "Nothing to repeat" exception

I use this regex code to parse urls:
/^(((http|https):\/\/)+[www.])?+\s*\S+\s*+(.com|.es|.net|.org|.co)$/ig
It works perfectly on https://regex101.com/r/bX5oM4/1
But on my console I keep getting the:
SyntaxError: Invalid regular expression: /^(((http|https):\/\/)+[www\.])?+\s*\S+\s*+(\.com|\.es|\.net|\.org|\.co)$/: Nothing to repeat
I tried escaping the + but It doesn't work. I'm kinda new on regex so It could be anything.
Here is your fixed regex:
^(?:https?:\/\/www\.)?[a-zA-Z0-9]\S+(\.(?:com|es|net|org|co))$
See demo
Or, to match the strings inside larger strings:
\b(?:https?:\/\/www\.)?[a-zA-Z0-9]\S+(?:\.(?:com|es|net|org|co))\b
See another demo
In JavaScript, you cannot set + to ? quantifier.
Also, note that [www.] matches 1 character, either w or . since it is a character class. You must have meant a group, and thus you need round brackets, not square ones.
I removed unnecessary groups, regrouped them a bit and escaped the dots. Note that unescaped dot matches any character but a newline.
So, the regex:
^ - Asserts the position at the start of the string
(?:https?:\/\/www\.)? - Optionally matches http or https then //www. literally
\w\S+ - 1 alhoanumeric and 1 or more non-whitespace characters
(\.(?:com|es|net|org|co)) - Matches a dot and then any of the alternatives in the round brackets
$ - Asserts end of string
Try this (update!)
^((http|https):\/\/)?([\w]+[.-]?)+\.(com|es|net|org|co|uk|de)$
instead of
/^(((http|https):\/\/)+[www.])?+\s*\S+\s*+(.com|.es|.net|.org|.co)$/ig
You had an extra + behind a ? and another one behind a *. And several other things were not quite OK, as stribizhev pointed out quite rightly!
This regex is looking for a limited range of TLDs ... (e. g. french pages would not pass). The [www.] was syntactically wrong and also surperfluous as any domain name can have subdomains (expressed by ([\w]+[.-]?)+) and 'www.' is just one of the possible ones.

steps to manually parse this regular expression /([.*+?^=!:${}()|\]\\])/g

/([.*+?^=!:${}()|\[\]\/\\])/g
/ /g
( )
[ ]
left: .*+?^=!:${}()
right: \[\]\/\\
right side of the or operator:
\[ matches [
\] matches ]
\/ matches /
\\ matches \
Is my step 4 correct?
What does left part of step 4
.*+?^=!:${}()
match in side the square bracket?
since step 3 is a [], so it only matches only one character. Is this correct?
The regular expression is copied from here:
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions?redirectlocale=en-US&redirectslug=JavaScript%2FGuide%2FRegular_Expressions
There are no left or right part of step 4
[.*+?^=!:${}()|\[\]\/\\]
defines a character set that consists of:
a dot character
an asterisk
a plus character
a question mark
...
a forward slash (escaped)
a back slash (escaped)
So this part of the whole expression would match a character that is enumerated within square braces [ ... ]
#zerkms has a good answer. I just want to offer an alternative - by pointing you to the really useful site regex101.com. There you can enter your expression and you get a very nice explanation of how to interpret it; you can enter strings as well, and see what is matched. Putting in the above expression (see http://regex101.com/r/iG3lA0 ) confirms that everything inside the outermost brackets is treated as a single character class, with escaped values for []/\; the entire expression can be interpreted as
"Match any of the characters .*+?^=!:${}()|[]/\ anywhere in the
string, and return each of these characters as a separate match".
The rules about special characters inside the [] character class construct are a bit strange - see for example http://www.regular-expressions.info/charclass.html. And the /g flag means this matches these characters anywhere in the string that's being matched (rather than just once). Thus the answer to the last part of your question:
"While the expression inside the square brackets matches only one character at a time, the /g flag means the match is performed everywhere, and each matching character is returned as a separate match".

Could anyone give an explain on following javascript RE code?

Could anyone give an explain on following example code?
it's from the last example here.
Not sure why there's no '\' before the '.' , it can get same result by adding '\'.
JavaScript:
var url = "http://xxx.domain.com";
print(/[^.]+/.exec(url)[0].substr(7)); // prints "xxx"
Note the paragraph here regarding Metacharacters Inside Character Classes
Note that the only special characters or metacharacters inside a character class are the closing bracket (]), the backslash (\), the caret (^) and the hyphen (-). The usual metacharacters are normal characters inside a character class, and do not need to be escaped by a backslash.
Get the chars up to the first period, then remove the first 7 which is the http:// so that leaves you with the first part of the domain which in this case is xxx.
[^.]+ means one or more characters that is not a period so this matches http://xxx. Noe that the period does not need to be escaped inside the brackets to be treated as a normal character as it has no special meaning inside the brackets.
[0] means the entire match which is http://xxx
.substr(7) means to get the characters after the first 7 which will be xxx

Javascript Regex for Javascript Regex and Digits

The title might seem a bit recursive, and indeed it is.
I am working on a Javascript which can highlight/color Javascript code displayed in HTML. Thus, in the Internet Browser, comments will be turned green, definitions (for, if, while, etc.) will be turned a dark blue and italic, numbers will be red, and so on for other elements. However, the coloring is not all that important.
I am trying to figure out two different regular expressions which have started to cause a minor headache.
1. Finding a regular expression using a regular expression
I want to find regular expressions within the script-tags of HTML using a Javascript, such as:
match(/findthis/i);
, where the regex part of course is "/findthis/i".
The rules are as follows:
Finding multiple occurrences (/g) is not important.
It must be on the same line (not /m).
Caseinsensitive (/i).
If a backward slash (ignore character) is followed directly by a forward slash, "/", the forward slash is part of the expression - not an escape character. E.g.: /itdoesntstop\/untilnow:/
Two forward slashes right next to each other (//) is: (A) At the beginning: Not a regex; it's a comment. (B) Later on: First slash is the end of the regex and the second slash is nothing but a character.
Regex continues until the line breaks or end of input (\n|$), or the escape character (second forward slash which complies with rule 4) is encountered. However, also as long as only alphabetic characters are encountered, following the second forward slash, they are considered part of the regex. E.g.: /aregex/allthisispartoftheregex
So far what I've got is this:
'\\/(?:[^\\/\\\\]|\\/\\*)*\\/([a-zA-Z]*)?'
However, it isn't consistent. Any suggestions?
2. Find digits (alphanumeric, floating) using a regular expression
Finding digits on their own is simple. However, finding floating numbers (with multiple periods) and letters including underscore is more of a challenge.
All of the below are considered numbers (a new number starts after each space):
3 3.1 3.1.4 3a 3.A 3.a1 3_.1
The rules:
Finding multiple occurrences (/g) is not important.
It must be on the same line (not /m).
Caseinsensitive (/i).
A number must begin with a digit. However, the number can be preceeded or followed by a non-word (\W) character. E.g.: "=9.9;" where "9.9" is the actual number. "a9" is not a number. A period before the number, ".9", is not considered part of the number and thus the actual number is "9".
Allowed characters: [a-zA-Z0-9_.]
What I've got:
'(^|\\W)\\d([a-zA-Z0-9_.]*?)(?=([^a-zA-Z0-9_.]|$))'
It doesn't work quite the way I want it.
For the first part, I think you are quite close. Here is what I would use (as a regex literal, to avoid all the double escapes):
/\/(?:[^\/\\\n\r]|\\.)+\/([a-z]*)/i
I don't know what you intended with your second alternative after the character class. But here the second alternative is used to consume backslashes and anything that follows them. The last part is important, so that you can recognize the regex ending in something like this: /backslash\\/. And the ? at the end of your regex was redundant. Otherwise this should be fine.
Test it here.
Your second regex is just fine for your specification. There are a few redundant elements though. The main thing you might want to do is capture everything but the possible first character:
/(?:^|\W)(\d[\w.]*)/i
Now the actual number (without the first character) will be in capturing group 1. Note that I removed the ungreediness and the lookahead, because greediness alone does exactly the same.
Test it here.

Categories