I'm trying to parse shorthand notation into an integer representation. This works fine for Hours, Seconds, and Minutes, but not with Milliseconds, where the regex is failing to match.
'50ms'.match(/^(\d+)([MS|S|M|H|ms|s|m|h])$/);
I wasn't sure how to phrase the question correctly, but i did perform several searches prior to asking here.
jsfiddle
If you need to match sequences of characters, you need to use alternation groups defiend with (...|...) constructs.
A character class only matches a single character defined in it. See more details on character class here.
Your regex does not work with milliseconds because you require 1 character after digits followed with the end of string immediately. Thus, there is no place for 2 letters "ms".
So, the correct way is to use
'50ms'.match(/^(\d+)(MS|S|M|H|ms|s|m|h)$/);
As Tushar suggests, you can further contract the pattern using /i modifier and reducing the number of alternatives.
/^(\d+)(MS|ms|[SMH])$/i
See this demo
Related
For context I am using Mongoose and regex to match a string in a database using find().
Given an example string {W}{W}{U}{U}{B}{B}{R}{R}{G}{G} I need to match occurrences of certain letters. I'm trying to make a RegExp that will match only when I have the required number of letters.
{W}{W}{U}{U}{B}{B}{R}{R}{G}{G} => wwuubbrrgg, ggrrbbuuww, wuwubrbrgg, etc
{W}{W}{U} => wwu, wuw, uww, etc
Solutions I found were not able to account for the order of the string being somewhat random and multiple letters potentially being in the same bracket: {U/R}. Because of that I only want to take into account the actual letters and only match when it's found the sufficient number of letters and not encountered any letters that are not present.
Regex is really, really bad at counting. wanting a specific number of a specific character in no specific order is not something Regex is very good at. It can be done, but not with any reasonable measure of efficiency. As an example, here is a working Regex for your scenario:
^(?=[^wW\n]*[wW][^wW\n]*[wW][^wW\n]*)(?=[^uU\n]*[uU][^uU\n]*[uU][^uU\n]*)(?=[^bB\n]*[bB][^bB\n]*[bB][^bB\n]*)(?=[^rR\n]*[rR][^rR\n]*[rR][^rR\n]*)(?=[^gG\n]*[gG][^gG\n]*[gG][^gG\n]*).{10}$
As we can see, it's very, very long for something so simple. That's because this behavior is not really what Regex is designed for, as the desired functionality isn't much of a pattern. I would personally recommend going through and simply counting occurences of each character. But, if you're dead set on regex, here's the breakdown:
^(?=[^wW\n]*[wW][^wW\n]*[wW][^wW\n]*)(?=[^uU\n]*[uU][^uU\n]*[uU][^uU\n]*)(?=[^bB\n]*[bB][^bB\n]*[bB][^bB\n]*)(?=[^rR\n]*[rR][^rR\n]*[rR][^rR\n]*)(?=[^gG\n]*[gG][^gG\n]*[gG][^gG\n]*).{10}$
^ //anchor to start of string
(?= //start lookahead
[^wW\n]* //any number of characters that aren't a 'w' or new line
[wW] //followed by the first instance of a character we're looking for
[^wW\n]* //any number of characters that aren't a 'w' or new line
[wW] //followed by the second instance of a character we're looking for
[^wW\n]* //any number of characters that aren't a 'w' or new line
) //end lookahead
... //repeat this for every character we want to be sure is in the string
.{10} //now actually match the ten characters, now that we know the number of each is correct
$ //then validate that that takes us to the end of the string
EDIT: Actually, this regex can be reduced slightly down to:
^(?=[^wW\n]*[wW][^wW\n]*[wW])(?=[^uU\n]*[uU][^uU\n]*[uU])(?=[^bB\n]*[bB][^bB\n]*[bB])(?=[^rR\n]*[rR][^rR\n]*[rR])(?=[^gG\n]*[gG][^gG\n]*[gG]).{10}$
Essentially, this just gets rid of the final negative capture group in each lookahead. It is not necessary since we are constraining the total capture length to the same as the sum of each character requirement. That condition is enough to know that we satisfy the requirement of not having MORE than 2 of any given character. Still, I'd avoid the regex solution to this problem, as in the time taken to generate and run this regex for a given combination of characters you could already have counted the instances of each character and come upon the same result.
I've written a regular expression that matches any number of letters with any number of single spaces between the letters. I would like that regular expression to also enforce a minimum and maximum number of characters, but I'm not sure how to do that (or if it's possible).
My regular expression is:
[A-Za-z](\s?[A-Za-z])+
I realized it was only matching two sets of letters surrounding a single space, so I modified it slightly to fix that. The original question is still the same though.
Is there a way to enforce a minimum of three characters and a maximum of 30?
Yes
Just like + means one or more you can use {3,30} to match between 3 and 30
For example [a-z]{3,30} matches between 3 and 30 lowercase alphabet letters
From the documentation of the Pattern class
X{n,m} X, at least n but not more than m times
In your case, matching 3-30 letters followed by spaces could be accomplished with:
([a-zA-Z]\s){3,30}
If you require trailing whitespace, if you don't you can use: (2-29 times letter+space, then letter)
([a-zA-Z]\s){2,29}[a-zA-Z]
If you'd like whitespaces to count as characters you need to divide that number by 2 to get
([a-zA-Z]\s){1,14}[a-zA-Z]
You can add \s? to that last one if the trailing whitespace is optional. These were all tested on RegexPlanet
If you'd like the entire string altogether to be between 3 and 30 characters you can use lookaheads adding (?=^.{3,30}$) at the beginning of the RegExp and removing the other size limitations
All that said, in all honestly I'd probably just test the String's .length property. It's more readable.
This is what you are looking for
^[a-zA-Z](\s?[a-zA-Z]){2,29}$
^ is the start of string
$ is the end of string
(\s?[a-zA-Z]){2,29} would match (\s?[a-zA-Z]) 2 to 29 times..
Actually Benjamin's answer will lead to the complete solution to the OP's question.
Using lookaheads it is possible to restrict the total number of characters AND restrict the match to a set combination of letters and (optional) single spaces.
The regex that solves the entire problem would become
(?=^.{3,30}$)^([A-Za-z][\s]?)+$
This will match AAA, A A and also fail to match AA A since there are two consecutive spaces.
I tested this at http://regexpal.com/ and it does the trick.
You should use
[a-zA-Z ]{20}
[For allowed characters]{for limiting of the number of characters}
Say I have a string which contains some units (which may or may not have prefixes) that I want to break into the individual units. For example the string may contain "Btu(th)" or "Btu(th).ft" or even "mBtu(th).ft" where mBtu(th) is the bastardised unit milli thermochemical BTU's (this is purely an example).
I currently have the following (simplified) regex however it fails for the case "mBtu(th).ft":
/(m|k)??(Btu\(th\)|ft|m)(?:\b|\s|$)/g
Currently this does not correctly detect the boundary between the end of 'Btu(th)' and the start of 'ft'. I understand javascript regex does not support look back so how do I accurately parse the string?
Additional notes
The regex presented above is greatly simplified around the prefixes and units groups. The prefixes could span multiple characters like 'Ki' and therefore character sets are not suitable.
The desire is for each group to catch the prefix match as group 1 and the unit as match two i.e for 'mBtu(th).ft' match one would be ['m','Btu(th)'] and match two would be ['','ft'].
The prefix match needs to be lazy so that the string 'm' would be matched as the unit metres rather than the prefix milli. Likewise the match for 'mm' would need to be the prefix milli and the unit metres.
I would try with:
/((m)|(k)|(Btu(\(th\))?)|(ft)|(m)|(?:\.))+/g
at least with example above, it matches all units merged into one string.
DEMO
EDIT
Another try (DEMO):
/(?:(m)|(k)|(Btu)|(th)|(ft)|[\.\(\)])/g
this one again match only one part, but if you use $1,$2,$3,$4, etc, (DEMO) you can extract other fragments. It ignores ., (, ), characters. The problem is to count proper matched groups, but it works to some degree.
Or if you accept multiple separate matches I think simple alternative is:
/(m|k|Btu|th|ft)/g
A word boundary will not separate two non-word characters. So, you don't actually want a word boundary since the parentheses and period are not valid word characters. Instead, you want the string to not be followed by a word character, so you can use this instead:
[mk]??(Btu\(th\)|ft|m)(?!\w)
Demo
I believe you're after something like this. If I understood you correctly that want to match any kind of element, possibly preceded by the m or k character and separated by parantheses or dots.
/[\s\.\(]*(m|k?)(\w+)[\s\.\)]*/g
https://regex101.com/r/eQ5nR4/2
If you don't care about being able to match the parentheses but just return the elements you can just do
/(m|k?)(\w+)/g
https://regex101.com/r/oC1eP5/1
I'm trying to write a lexer in JavaScript for finding tokens of a simple domain-specific language. I started with a simple implementation which just tries to match subsequent regexps from the current position in a line to find out whether it matches some token format and accept it then.
The problem is that when something doesn't match inside such regexp, the whole regexp fails, so I don't know which character exactly caused it to fail.
Is there any way to find out the position in the string which caused the regular expression to fail?
INB4: I'm not asking about debugging my regexp and verifying its correctness. It is correct already, matches correct strings and drops incorrect ones. I just want to know programmatically where exactly the regexp stopped matching, to find out the position of a character which was incorrect in the user input, and how much of them were OK.
Is there some way to do it with just simple regexps instead of going on with implementing a full-blown finite state automaton?
Short answer
There is no such thing as a "position in the string that causes the
regular expression to fail".
However, I will show you an approach to answer the reverse question:
At which token in the regex did the engine become unable to match the
string?
Discussion
In my view, the question of the position in the string which caused the regular expression to fail is upside-down. As the engine moves down the string with the left hand and the pattern with the right hand, a regex token that matches six characters one moment can later, because of quantifiers and backtracking, be reduced to matching zero characters the next—or expanded to match ten.
In my view, a more proper question would be:
At which token in the regex did the engine become unable to match the
string?
For instance, consider the regex ^\w+\d+$ and the string abc132z.
The \w+ can actually match the entire string. Yet, the entire regex fails. Does it make sense to say that the regex fails at the end of the string? I don't think so. Consider this.
Initially, \w+ will match abc132z. Then the engine advances to the next token: \d+. At this stage, the engine backtracks in the string, gradually letting the \w+ give up the 2z (so that the \w+ now only corresponds to abc13), allowing the \d+ to match 2.
At this stage, the $ assertion fails as the z is left. The engine backtracks, letting the \w+, give up the 3 character, then the 1 (so that the \w+ now only corresponds to abc), eventually allowing the \d+ to match 132. At each step, the engine tries the $ assertion and fails. Depending on engine internals, more backtracking may occur: the \d+ will give up the 2 and the 3 once again, then the \w+ will give up the c and the b. When the engine finally gives up, the \w+ only matches the initial a. Can you say that the regex "fails on the "3"? On the "b"?
No. If you're looking at the regex pattern from left to right, you can argue that it fails on the $, because it's the first token we were not able to add to the match. Bear in mind that there are other ways to argue this.
Lower, I'll give you a screenshot to visualize this. But first, let's see if we can answer the other question.
The Other Question
Are there techniques that allow us to answer the other question:
At which token in the regex did the engine become unable to match the
string?
It depends on your regex. If you are able to slice your regex into clean components, then you can devise an expression with a series of optional lookaheads inside capture groups, allowing the match to always succeed. The first unset capture group is the one that caused the failure.
Javascript is a bit stingy on optional lookaheads, but you can write something like this:
^(?:(?=(\w+)))?(?:(?=(\w+\d+)))?(?:(?=(\w+\d+$)))?.
In PCRE, .NET, Python... you could write this more compactly:
^(?=(\w+))?(?=(\w+\d+))?(?=(\w+\d+$))?.
What happens here? Each lookahead builds incrementally on the last one, adding one token at a time. Therefore we can test each token separately. The dot at the end is an optional flourish for visual feedback: we can see in a debugger that at least one character is matched, but we don't care about that character, we only care about the capture groups.
Group 1 tests the \w+ token
Group 2 seems to test \w+\d+, therefore, incrementally, it tests the \d+ token
Group 3 seems to test \w+\d+$, therefore, incrementally, it tests the $ token
There are three capture groups. If all three are set, the match is a full success. If only Group 3 is not set (as with abc123a), you can say that the $ caused the failure. If Group 1 is set but not Group 2 (as with abc), you can say that the \d+ caused the failure.
For reference: Inside View of a Failure Path
For what it's worth, here is a view of the failure path from the RegexBuddy debugger.
You can use a negated character set RegExp,
[^xyz]
[^a-c]
A negated or complemented character set. That is, it matches anything
that is not enclosed in the brackets. You can specify a range of
characters by using a hyphen, but if the hyphen appears as the first
or last character enclosed in the square brackets it is taken as a
literal hyphen to be included in the character set as a normal
character.
index property of String.prototype.match()
The returned Array has an extra input property, which contains the
original string that was parsed. In addition, it has an index
property, which represents the zero-based index of the match in the
string.
For example to log index where digit is matched for RegExp /[^a-zA-z]/ in string aBcD7zYx
var re = /[^a-zA-Z]/;
var str = "aBcD7zYx";
var i = str.match(re).index;
console.log(i); // 4
Is there any way to find out the position in the string which caused the regular expression to fail?
No, there isn't. A Regex either matches or doesn't. Nothing in between.
Partial Expressions can match, but the whole pattern doesnt. So the engine always needs to evaluates the whole expression:
Take the String Hello my World and the Pattern /Hello World/. While each word will match individually, the whole Expression fails. You cannot tell whether Hello or World matched - independent, both do. Also the whitespace between them is available.
Two quick questions:
What would be a RegEx string for three letters and two numbers with space before and after them (i.e. " LET 12 ")?
Would you happen to know any good RegEx resources/tools?
For a good resource, try this website and the program RegexBuddy. You may even be able to figure out the answer to your question yourself using these sites.
To start you off you want something like this:
/^[a-zA-Z]{3}\s+[0-9]{2}$/
But the exact details depend on your requirements. It's probably a better idea that you learn how to use regular expressions yourself and then write the regular expression instead of just copying the answers here. The small details make a big difference. Examples:
What is a "letter"? Just A-Z or also foreign letters? What about lower case?
What is a "number"? Just 0-9 or also foreign numerals? Only integers? Only positive integers? Can there be leading zeros?
Should there be a single space between the letters and numbers? Or any amount of any whitespace? Even none?
Do you want to search for this string in a larger text? Or match a line exactly?
etc..
The answers to these questions will change the regular expression. It would be much faster for you in the long run to learn how to create the regular expression than to completely specify your requirements and wait for other people to reply.
I forgot to mention that there will be a space before and after. How do I include that?
Again you need to consider the questions:
Do you mean just one space or any amount of spaces? Possibly not always a space but only sometimes?
Do you mean literally a space character or any whitespace characters?
My guess is:
/^\s+[a-zA-Z]{3}\s+[0-9]{2}\s+$/
/[a-z]{3} [0-9]{2}/i will match 3 letters followed by a whitespace character, and then 2 numbers. [a-z] is a character class containing the letters a through z, and the {3} means that you want exactly 3 members of that class. The space character matches a literal space (alternately, you could use \s, which is a "shorthand" character class that matches any whitespace character). The i at the end is a pattern modifier specifying that your pattern is case-insenstive.
If you want the entire string to only be that, you need to anchor it with ^ and $:
/^[a-z]{3} [0-9]{2}$/i
Regular expression resources:
http://www.regular-expressions.info - great tutorial with a lot of information
http://rexv.org/ - online regular expression tester that supports a variety of engines.
^([A-Za-z]{3}) ([0-9]{2})$ assuming one space between the letters/numbers, as in your example. This will capture the letters and numbers separately.
I use http://gskinner.com/RegExr/ - it allows you to build a regex and test it with your own text.
As you can probably tell from the wide variety of answers, RegEx is a complex subject with a wide variety of opinions and preferences, and often more than one way of doing things. Here's my preferred solution.
^[a-zA-Z]{3}\s*\d{2}$
I used [a-zA-Z] instead of \w because \w sometimes includes underscores.
The \s* is to allow zero or more spaces.
I try to use character classes wherever possible, which is why I went with \d.
\w{3}\s{1}\d{2}
And I like this site.
EDIT:[a-zA-Z]{3}\s{1}\d{2} - The \w supports numeric characters too.
try this regularexpression
[^"\r\n]{3,}