Parsing units with javascript regex

Parsing units with javascript regex - javascript

Say I have a string which contains some units (which may or may not have prefixes) that I want to break into the individual units. For example the string may contain "Btu(th)" or "Btu(th).ft" or even "mBtu(th).ft" where mBtu(th) is the bastardised unit milli thermochemical BTU's (this is purely an example).
I currently have the following (simplified) regex however it fails for the case "mBtu(th).ft":
/(m|k)??(Btu\(th\)|ft|m)(?:\b|\s|$)/g
Currently this does not correctly detect the boundary between the end of 'Btu(th)' and the start of 'ft'. I understand javascript regex does not support look back so how do I accurately parse the string?
Additional notes
The regex presented above is greatly simplified around the prefixes and units groups. The prefixes could span multiple characters like 'Ki' and therefore character sets are not suitable.
The desire is for each group to catch the prefix match as group 1 and the unit as match two i.e for 'mBtu(th).ft' match one would be ['m','Btu(th)'] and match two would be ['','ft'].
The prefix match needs to be lazy so that the string 'm' would be matched as the unit metres rather than the prefix milli. Likewise the match for 'mm' would need to be the prefix milli and the unit metres.

I would try with:
/((m)|(k)|(Btu(\(th\))?)|(ft)|(m)|(?:\.))+/g
at least with example above, it matches all units merged into one string.
DEMO
EDIT
Another try (DEMO):
/(?:(m)|(k)|(Btu)|(th)|(ft)|[\.\(\)])/g
this one again match only one part, but if you use $1,$2,$3,$4, etc, (DEMO) you can extract other fragments. It ignores ., (, ), characters. The problem is to count proper matched groups, but it works to some degree.
Or if you accept multiple separate matches I think simple alternative is:
/(m|k|Btu|th|ft)/g

A word boundary will not separate two non-word characters. So, you don't actually want a word boundary since the parentheses and period are not valid word characters. Instead, you want the string to not be followed by a word character, so you can use this instead:
[mk]??(Btu\(th\)|ft|m)(?!\w)
Demo

I believe you're after something like this. If I understood you correctly that want to match any kind of element, possibly preceded by the m or k character and separated by parantheses or dots.
/[\s\.\(]*(m|k?)(\w+)[\s\.\)]*/g
https://regex101.com/r/eQ5nR4/2
If you don't care about being able to match the parentheses but just return the elements you can just do
/(m|k?)(\w+)/g
https://regex101.com/r/oC1eP5/1

Related

Match exact number of occurrences of multiple letters

For context I am using Mongoose and regex to match a string in a database using find().
Given an example string {W}{W}{U}{U}{B}{B}{R}{R}{G}{G} I need to match occurrences of certain letters. I'm trying to make a RegExp that will match only when I have the required number of letters.
{W}{W}{U}{U}{B}{B}{R}{R}{G}{G} => wwuubbrrgg, ggrrbbuuww, wuwubrbrgg, etc
{W}{W}{U} => wwu, wuw, uww, etc
Solutions I found were not able to account for the order of the string being somewhat random and multiple letters potentially being in the same bracket: {U/R}. Because of that I only want to take into account the actual letters and only match when it's found the sufficient number of letters and not encountered any letters that are not present.

Regex is really, really bad at counting. wanting a specific number of a specific character in no specific order is not something Regex is very good at. It can be done, but not with any reasonable measure of efficiency. As an example, here is a working Regex for your scenario:
^(?=[^wW\n]*[wW][^wW\n]*[wW][^wW\n]*)(?=[^uU\n]*[uU][^uU\n]*[uU][^uU\n]*)(?=[^bB\n]*[bB][^bB\n]*[bB][^bB\n]*)(?=[^rR\n]*[rR][^rR\n]*[rR][^rR\n]*)(?=[^gG\n]*[gG][^gG\n]*[gG][^gG\n]*).{10}$
As we can see, it's very, very long for something so simple. That's because this behavior is not really what Regex is designed for, as the desired functionality isn't much of a pattern. I would personally recommend going through and simply counting occurences of each character. But, if you're dead set on regex, here's the breakdown:
^(?=[^wW\n]*[wW][^wW\n]*[wW][^wW\n]*)(?=[^uU\n]*[uU][^uU\n]*[uU][^uU\n]*)(?=[^bB\n]*[bB][^bB\n]*[bB][^bB\n]*)(?=[^rR\n]*[rR][^rR\n]*[rR][^rR\n]*)(?=[^gG\n]*[gG][^gG\n]*[gG][^gG\n]*).{10}$
^ //anchor to start of string
(?= //start lookahead
[^wW\n]* //any number of characters that aren't a 'w' or new line
[wW] //followed by the first instance of a character we're looking for
[^wW\n]* //any number of characters that aren't a 'w' or new line
[wW] //followed by the second instance of a character we're looking for
[^wW\n]* //any number of characters that aren't a 'w' or new line
) //end lookahead
... //repeat this for every character we want to be sure is in the string
.{10} //now actually match the ten characters, now that we know the number of each is correct
$ //then validate that that takes us to the end of the string
EDIT: Actually, this regex can be reduced slightly down to:
^(?=[^wW\n]*[wW][^wW\n]*[wW])(?=[^uU\n]*[uU][^uU\n]*[uU])(?=[^bB\n]*[bB][^bB\n]*[bB])(?=[^rR\n]*[rR][^rR\n]*[rR])(?=[^gG\n]*[gG][^gG\n]*[gG]).{10}$
Essentially, this just gets rid of the final negative capture group in each lookahead. It is not necessary since we are constraining the total capture length to the same as the sum of each character requirement. That condition is enough to know that we satisfy the requirement of not having MORE than 2 of any given character. Still, I'd avoid the regex solution to this problem, as in the time taken to generate and run this regex for a given combination of characters you could already have counted the instances of each character and come upon the same result.

JS regex to match two possible combinations

I need to capture a certain combination of letters followed by a number (any amount), represented in a variable called input. The letters are strict, the numbers are not. The letters are either at the beginning of a string or followed immediately after a backslash.
So for example, I would need to non-case-sensitively capture:
ab12345678google
cd4321newyorkpost
anything\here\ab1357
something\too\cd2468
For these, I have a simple rule that works (well, two rules):
input.value.match(/^(ab|cd)[0-9]+/i) || input.value.match(/\\(ab|cd)[0-9]+/i)
However, it is also possible to a string called test to exist right before the set letters which I would also need to capture (either from the beginning or after a backslash again). So besides capturing just the given two letters, I would also need to capture these occurrences as well where the test before the letters is the strict factor, e.g.:
testcd4321newyorkpost
anything\here\testab1357
I'm quite sure it's possible to place an "optional" lookup of some sort in the match query without rewriting the rules for test separately, but as new as I am with regex I'm not sure what would I be looking here?

You may use this regex:
(?:^|\\)(?:test)?(?:ab|cd)\d+
Which is:
Match start or \
Match optional string test
Match ab or cd
Match 1+ digits

Why not just make the text test optional?
(?:test)?(ab|cd)[0-9]+
should work for any of your situations.

Why in RegEx should be only dot after negative lookahead [duplicate]

I want to take all lines except which contains # symbol
This is the regex for it ^[^#]*$/gm
Now how do i select only words in it as \S\S*?
Finally i want to combine these two regex ^[^#]*$/gm and \S\S*
Sample here

You probably need to make it a two-step process: First filter all lines by /^[^#]*$/, afterwards get all matches for /\S+/ from that line. You can't have an arbitrary number of matches from a single regex (e.g. all »words« individually). Unless you want all words separated by whitespace in a single match, such as /\S+(\s+\S+)*/, but even then you'd essentially just get the whole line in a single match, so there's little point to it.

Matching variable-term equations

I am trying to develop a regular expression to match the following equations:
(Price+10%+100+200)
(Price+20%+200)
(Price+30%)
(Price+100)
(Price-10%-100-200)
(Price-20%-200)
(Price-30%)
(Price-100)
My regex so far is...
/([(])+([P])+([r])+([i])+([c])+([e])+([+]|[-]){1}([\d])+([+]|[-])?([\d])+([%])?([)])/g
..., but it only matches the following equations:
(Price+100+10%)
(Price+100+100)
(Price+200)
(Price-100-10%)
(Price-100-100)
(Price-200)
Can someone help me understand how to make my pattern match the full set of equations provided?
Note: Parentheses and 'Price' are musts in the equations that the pattern must match.

Try this, which matches all the input strings provided in the question:
/\(Price([+-]\d+%?){1,3}\)/g
You can test it in a regex fiddle.
Things to note:
Only use parentheses where you want to group. Parentheses around single-possibility, fixed-quantity matches (e.g. ([P]) provide no value.
Use character classes (opened with [ and closed with ]) for multiple characters that can match at a position in the pattern (e.g. [+-]). Single-possibility character classes (e.g. [P]) similarly provide no value.
Yes, character classes (generally) implicitly escape regex special characters within them (e.g. ( in [(] vs. equivalent \( outside a character class), but to just escape regex special characters (i.e. to match them literally), you are better off not using a character class and just escaping them (e.g. \() – unless multiple characters should match at a position in the pattern (per the previous point to note).
The quantifier {1} is (almost) always useless: drop it.
The quantifier + means "one or more" as you probably know. However, in a series of cases where you used it (i.e. ([(])+([P])+([r])+([i])+([c])+([e])+), it would match many values that I doubt you expect (e.g. ((((((PPPrriiiicccceeeeee): basically, don't overuse it. Stop to consider whether you really want to match one or more of the character (class) or group to which + applies in the pattern.
To match a literal string without any regex special characters like Price, just use the literal string at the appropriate position in the pattern – e.g. Price in \(Price.

/\(Price[+-](\d)+(%)?([+-]\d+%?)?([+-]\d+%?)?\)/g
works on http://www.regexr.com/

/^[(Price]+\d+\d+([%]|[)])&/i
try at your own risk!

PHP or Javascript - Find a specific string

Finding a specific string is relatively easy, but I am not sure where to begin on this one. I would need to extract a string that would be different every time, but with similar characteristics.
Here are some example strings I need to find in a paragraph, either at the beginning, end or somewhere in the middle.
7b.9t.7iv.4x
4ir.4i.5i.6t
7ix.7t.4t.0z
As you can see the string will always begin with a number, and would have up to 2 characters after it and will always contain 4 octets separated by dots.
Let me know if you may need more details.
EDIT:
Thanks to the answer below I came up with this, while not pretty, does what I need.
$body="test 1f.9t.7iv.4x test 1a.9a.7ab.4xa test ";
$var=preg_match_all("([0-9][a-z]{1,2}\.[0-9][a-z]{1,2}\.[0-9][a-z]{1,2}\.[0-9][a-z]{1,2})",$body,$matches);
$count=count($matches[0]);
$stack = array();
while($count > 0){
$count--;
array_push($stack, "<span id='ip_".$matches[0][$count]."'>".$matches[0][$count]."</span>");
}
$stack=array_reverse($stack);
$body=str_replace($matches[0],$stack,$body);

You can use a regular expression.
Something like this to get you started. There may be a better way to match since it's repeated, but....
([0-9][a-z]{1,2}\.[0-9][a-z]{1,2}\.[0-9][a-z]{1,2}\.[0-9][a-z]{1,2})
( Start a capture group
[0-9] match any character 0 through 9
[a-z] match any character [a-z]
{1,2} but only match the previous 1 or 2 times
\. match a literal . the \ is needed as an escape because . is a special character
) End capture group
Both php and javascript allow for regular expression use.
For an even better visual representation you can check out this tool: http://www.debuggex.com/
If you need each octet by itself (as a match) you can add more parenthesis () around each [0-9][a-z]{1,2} which will then store those octets individually.
Also note that \d is the same as [0-9] but I prefer the later as I find it a little more readable.

We Keep Coding

JavaScript is the programming language of the Web.

Parsing units with javascript regex - javascript

A word boundary will not separate two non-word characters. So, you don't actually want a word boundary since the parentheses and period are not valid word characters. Instead, you want the string to not be followed by a word character, so you can use this instead: [mk]??(Btu\(th\)|ft|m)(?!\w) Demo

Related

Match exact number of occurrences of multiple letters

JS regex to match two possible combinations

Why in RegEx should be only dot after negative lookahead [duplicate]

Matching variable-term equations

PHP or Javascript - Find a specific string

Categories

Resources