I have a simple scenario where I want to match the follow and capture the value:
stuff_in_string,
env: 'local', // want to match this and capture the content in quotes
more_stuff_in_string
I have never written a regex pattern before so excuse my attempt, I am well aware it is totally wrong.
This is what I am trying to say:
Match "env:"
Followed by none or more spaces
Followed by a single or double quote
Capture all until..
The next single or double quote
/env:*?\s+('|")+(.*?)+('|")/g
Thanks
PS here is a #failed fiddle: http://jsfiddle.net/DfHge/
Note: this is the regex I ended up using (not the answer below as it was overkill for my needs): /env:\s+(?:"|')(\w+)(?:"|')/
You can use this:
/\benv: (["'])([^"']*)\1/g
where \1 is a backreference to the first capturing group, thus your content is in the second. This is the simple way for simple cases.
Now, other cases like:
env: "abc\"def"
env: "abc\\"
env: "abc\\\def"
env: "abc'def"
You must use a more constraining pattern:
first: avoid the different quotes problem:
/\benv: (["'])((?:[^"']+|(?!\1)["'])*)\1/g
I put all the possible content in a non capturing group that i can repeat at will, and I use a negative lookahead (?!\1) to check if the allowed quote is not the same as the captured quote.
second: the backslash problem:
If a quote is escaped, it can't be the closing quote! Thus you must check if the quote is escaped or not and allow escaped quotes in the string.
I remove the backslashes from allowed content:
/\benv: (["'])((?:[^"'\\]+|(?!\1)["'])*)\1/g
I allow escaped characters:
/\benv: (["'])((?:[^"'\\]+|(?!\1)["']|\\[\s\S])*)\1/g
To allow a variable number of spaces before the quoted part, you can replace : by :\s*
/\benv:\s*(["'])((?:[^"'\\]+|(?!\1)["']|\\[\s\S])*)\1/g
You have now a working pattern.
third: pattern optimization
a simple alternation:
Using a capture group and a backreferences can be seducing to deal with the different type of quotes since it allow to write the pattern in a concise way. However, this way needs to create a capture group and to test a lookahead in this part (?!\1)["']`, so it is not so efficient. Writing a simple alternation increases the pattern length and needs to use two captures groups for the two cases but is more efficient:
/\benv:\s*(?:"((?:[^"\\]+|\\[\s\S])*)"|'((?:[^'\\]+|\\[\s\S])*)')/g
(note: if you decided to do that you must check which one of the two capture groups is defined.)
unrolling the loop:
To match the content inside quotes we use (?:[^"\\]+|\\[\s\S])* (for double quotes here) that works but can be improved to reduce the amount of steps needed. To do that we will unroll the loop that consists to avoid the alternation:
[^"\\]*(?:\\[\s\S][^"\\]*)*
finally the whole pattern can be written like this:
/\benv:\s*(?:"([^"\\]*(?:\\[\s\S][^"\\]*)*)"|'([^'\\]*(?:\\[\s\S][^'\\]*)*)')/g
env *('|").*?\1 is what you're looking for
the * means none or more
('|") matches either a single or double quote, and also saves it into a group for backreferencing
.*? is a reluctant greedy match all
\1 will reference the first group, which was either a single or double quote
regex=/env: ?['"]([^'"])+['"]/
answer=str.match(regex)[1]
even better:
regex=/env: ?(['"])([^\1]*)\1/
Related
Help with Regex (javascript flavor):
This first regex (I call it "quote regex") will match everything between matching quotes (single/double): /((?<quote>["']).*?\k<quote>)/i
Then I have this one (lets call it "tag regex"): /(?<=\s?)\S+:((?<quote>["']).*?\k<quote>|\(.*?\)|.*?(?=\s)|.*)/i:
This should match:
tag:something
tag:"something in double quotes"
tag:'something in single quotes'
tag:(between brackets)
[tag] -> can be any word
What I need is to ignore "tag regex" from the result of "quote regex"
I tried both negative/positive lookahead/lookbehind, but it will either match everything or nothing...
Whats interesting is that using negative lookbehind (?<!) with a line break between those it shouldn't match and those it should... it works.
https://regex101.com/r/1KEHfW/1
I'm sharing a link to regex101, its "working" but I put a line break on the first line, if you delete the break line it stop working.
You have a problem here:
tag:"something in double quotes"
tag:'something in single quotes'
You have specified a greedy wilcard aggregator * so as you don't distinguish the kind of quote pairing it is matching from the first " upto the final ' in the line below. To match pairs of quotes, you need to specify something like this:
\"[^"]*\"|\'[^']*\'|\([^\)]*\)
which means one of three alternatives:
Either a double quote, followed by any number of characters not equal to double quote, followed by double quote.
Or a single quote, followed by any number of characters not equal to single quote, followed by a single quote.
Or a left parenthesis, followed by any number of characters not equal to a right parenthesis (see note below), followed by a right parenthesis.
If you shorten your regexp to consider any kind of quotes, then the quotes don't pair each other, and you introduce space for wild in pattern recognition.
Note: If you plan to nest parenthesis, like in arithmetic expression, there are bad news, as regular expressions can match arbitrary regular languages, but a language that allows nesting of structures like the one introduced by parenthesis is not regular, but context free, and any grammar that you can devise (and so a regular expression) to match nesting parenthesis, must limit the depth of nesting to a fixed, bounded limit. I don't recommend you to follow the approach of using regular expressions to parse bound limited expressions, because the size of the regexps grows very quick with the maximum bounding nesting level.
Is it possible to use anchors inside a character class? This doesn't work:
analyze-string('abcd', '[\s^]abcd[\s$]')
It looks like ^ and $ are treated as literal when inside a character class; however, escaping them (\^, \$) doesn't work either.
I'm trying to use this expression to create word boundaries (\b is not available in XSLT/XQuery), but I would prefer not to use groups ((^|\s)) -- since non-capturing groups aren't available, that means in some scenarios I may end up with a large amount of unneeded capture groups, and that creates a new task of finding the "real" capture groups in the set of unneeded ones.
I believe the answer is no, you can't include ^ and $ as anchors in a [], only as literal characters. (I've wished you could do that before too.)
However, you could concat a space on the front and back of the string, then just look for \s as word boundaries and never mind the anchors. E.g.
analyze-string(concat(' ', 'abcd xyz abcd', ' '), '\sabcd\s')
You may also want + after each \s, but that's a separate issue.
If you're using analyze-string as a function, then presumably you're using a 3.0 implementation of either XSLT or XQuery.
In that case, why do you say "non-capturing groups aren't available"? The XPath Functions and Operators 3.0 spec is explicit that "Non-capturing groups are also recognized. These are indicated by the syntax (?:xxxx)."
Using the caret after the first square bracket will negate the character class. It essentially gives you the opposite of what you're looking to do, meaning the character class will match any character that is not in the character class. Negated character classes also match (invisible) line break characters.
You could try doing a negative look-ahead possibly.
(?!\s)
I am working in Javascript and I have the following regex:
[img]([a-z0-9\-\./]+[^"\' ]*)[/img]/g
When I have the following text (with space separating between the 2 groups):
[img]http://www.bla.com[/img] [img]http://www.bla.com[/img]
the regex finds the 2 separate groups successfuly.
However when given the following text (without space separating between the 2 groups):
[img]http://www.bla.com[/img][img]http://www.bla.com[/img]
the regex does not separate it into 2 groups, but rather 1 big group with http://www.bla.com[/img][img]http://www.bla.com inside it.
What am I missing in order to make the regex find the smallest groups when they are not separated by a space?
You may use this regex:
/\[img]([-a-z0-9.\/]+[^"'\s]*?)\[\/img]/g
RegEx Demo
[ and / etc need to be escaped in regex to avoid it being interpreted as character class.
Using *? we use lazy quantifier to match as little as possible before matching [/img]
If we are placing - at the start or end in a character class then it doesn't need escaping
dot doesn't need to be escaped in a character class
why not just write it like this:
\[img](.*?)\[\/img]/g
notice: use ? to forbid greedy matching.
I want to build a localization application for my javascript (pebble.js) application. It should find all Strings between l._(" and ") or ') or ", or ',.
So for example I have the the line
console.log(l._("This is a Test") + l._('%# times %# equals %#', 2, 4, (2*4)));
With the Swift application I should get an Array like this:
["This is a Test", "%# times %# equals %#"]
Right now I have no clue how I should manage it. Should I use a Regex, NSScanner or should I split the strings?
Thanks!
I'm not familiar with NSScanner, but with regexp the solution would be quite easy, with the assumption, that a string has the same delimiter at both ends. (I.e., you don't have stuff like l._("Hello world').) I think that wouldn't be valid syntax in JavaScript, so let's assume that is the case.
Also, let's assume that the strings don't contain any escaped quotes (of the same kind that is used as delimiter), i.e. there are no such strings: l._("Hello \" world).
Now you could use the following two regexps to find strings delimited by double quotes and those delimited by single quotes:
l._\("((?:[^"])*)" -- for double quotes
l._\('((?:[^'])*)' -- for single quotes
Then, you have to run these two regexps on your input, and get the result of the first capturing group for each match. (I'm not sure how exactly swift uses regexps, but note that in many implementation, capturing group #0 is usually the whole match, and capturing group #1 is what is located between the first pair of parenthesis -- you need the latter.)
Also, note that you don't have to care what comes after the closing quote: whether it's a parenthesis, or a comma, it's the same for you, as you only have to look up to the quote.
EDIT: Corrected the regexp (repetition should be inside capturing group).
EDIT2: If we allow escaped quotes, than you can use this regexp for the single quoted case:
l._\('((?:\\'|[^'])*)'. And a similar one for the double quoted.
You can play around with the regex on this link:
https://regex101.com/r/tY1jB5/1
I am trying to develop a regular expression to match the following equations:
(Price+10%+100+200)
(Price+20%+200)
(Price+30%)
(Price+100)
(Price-10%-100-200)
(Price-20%-200)
(Price-30%)
(Price-100)
My regex so far is...
/([(])+([P])+([r])+([i])+([c])+([e])+([+]|[-]){1}([\d])+([+]|[-])?([\d])+([%])?([)])/g
..., but it only matches the following equations:
(Price+100+10%)
(Price+100+100)
(Price+200)
(Price-100-10%)
(Price-100-100)
(Price-200)
Can someone help me understand how to make my pattern match the full set of equations provided?
Note: Parentheses and 'Price' are musts in the equations that the pattern must match.
Try this, which matches all the input strings provided in the question:
/\(Price([+-]\d+%?){1,3}\)/g
You can test it in a regex fiddle.
Things to note:
Only use parentheses where you want to group. Parentheses around single-possibility, fixed-quantity matches (e.g. ([P]) provide no value.
Use character classes (opened with [ and closed with ]) for multiple characters that can match at a position in the pattern (e.g. [+-]). Single-possibility character classes (e.g. [P]) similarly provide no value.
Yes, character classes (generally) implicitly escape regex special characters within them (e.g. ( in [(] vs. equivalent \( outside a character class), but to just escape regex special characters (i.e. to match them literally), you are better off not using a character class and just escaping them (e.g. \() – unless multiple characters should match at a position in the pattern (per the previous point to note).
The quantifier {1} is (almost) always useless: drop it.
The quantifier + means "one or more" as you probably know. However, in a series of cases where you used it (i.e. ([(])+([P])+([r])+([i])+([c])+([e])+), it would match many values that I doubt you expect (e.g. ((((((PPPrriiiicccceeeeee): basically, don't overuse it. Stop to consider whether you really want to match one or more of the character (class) or group to which + applies in the pattern.
To match a literal string without any regex special characters like Price, just use the literal string at the appropriate position in the pattern – e.g. Price in \(Price.
/\(Price[+-](\d)+(%)?([+-]\d+%?)?([+-]\d+%?)?\)/g
works on http://www.regexr.com/
/^[(Price]+\d+\d+([%]|[)])&/i
try at your own risk!