How to parse and capture any measurement unit - javascript

In my application, users can customize measurement units, so if they want to work in decimeters instead of inches or in full-turns instead of degrees, they can. However, I need a way to parse a string containing multiple values and units, such as 1' 2" 3/8. I've seen a few regular expressions on SO and didn't find any which matched all cases of the imperial system, let alone allowing any kind of unit. My objective is to have the most permissive input box possible.
So my question is: how can I extract multiple value-unit pairs from a string in the most user-friendly way?
I came up with the following algorithm:
Check for illegal characters and throw an error if needed.
Trim leading and trailing spaces.
Split the string into parts every time there's a non-digit character followed by a digit character, except for .,/ which are used to identify decimals and fractions.
Remove all spaces from parts, check for character misuse (multiple decimal points or fraction bars) and replace '' with ".
Split value and unit-string for each part. If a part has no unit:
If it is the first part, use the default unit.
Else if it is a fraction, consider it as the same unit as the previous part.
Else if it isn't, consider it as in, cm or mm based on the previous part's unit.
If it isn't the first part and there's no way to guess the unit, throw an error.
Check if units mean something, are all of the same system (metric/imperial) and follow a descending order (ft > in > fraction or m > cm > mm > fraction), throw an error if not.
Convert and sum all parts, performing division in the process.
I guess I could use string manipulation functions to do most of this, but I feel like there must be a simpler way through regex.
I came up with a regex:
((\d+('|''|"|m|cm|mm|\s|$) *)+(\d+(\/\d+)?('|''|"|m|cm|mm|\s|$) *)?)|((\d+('|''|"|m|cm|mm|\s) *)*(\d+(\/\d+)?('|''|"|m|cm|mm|\s|$) *))
It only allows fractions at the end and allows to place spaces between values. I've never used regex capturing though, so I'm not so sure how I'll manage to extract the values out of this mess. I'll work again on this tomorrow.

My objective is to have the most permissive input box possible.
Careful, more permissive doesn't always mean more intuitive. An ambiguous input should warn the user, not pass silently, as that might lead them to make multiple mistakes before they realize their input wasn't interpreted like they hoped.
How can I extract multiple value-unit pairs from a string? I guess I could use string manipulation functions to do most of this, but I feel like there must be a simpler way through regex.
Regular expressions are a powerful tool, especially since they work in many programming languages, but be warned. When you're holding a hammer everything starts to look like a nail. Don't try to use a regular expression to solve every problem just because you recently learned how they work.
Looking at the pseudocode you wrote, you are trying to solve two problems at once: splitting up a string (which we call tokenization) and interpreting input according to a grammar (which we call parsing). You should should try to first split up the input into a list of tokens, or maybe unit-value pairs. You can start making sense of these pairs once you're done with string manipulation. Separation of concerns will spare you a headache, and your code will be much easier to maintain as a result.
I've never used regex capturing though, so I'm not so sure how I'll manage to extract the values out of this mess.
If a regular expression has the global (g) flag, it can be used to find multiple matches in the same string. That would be useful if you had a regular expression that finds a single unit-value pair. In JavaScript, you can retrieve a list of matches using string.match(regex). However, that function ignores capture groups on global regular expressions.
If you want to use capture groups, you need to call regex.exec(string) inside a loop. For each successful match, the exec function will return an array where item 0 is the entire match and items 1 and onwards are the captured groups.
For example, /(\d+) ([a-z]+)/g will look for an integer followed by a space and a word. If you made successive calls to regex.exec("1 hour 30 minutes") you would get:
["1 hour", "1", "hour"]
["30 minutes", "30", "minutes"]
null
Successive calls work like this because the regex object keeps an internal cursor you can get or set with regex.lastIndex. You should set it back to 0 before using the regex again with a different input.
You've been using parentheses to isolate OR clauses such as a|b and to apply quantifiers to a character sequence such as (abc)+. If you want to do that without creating capture groups, you can use (?: ) instead. This is called a non-capturing group. It does the same thing as regular parentheses in a regex, but what's inside it won't create an entry in the returned array.
Is there a better way to approach this?
A previous version of this answer concluded with a regular expression even more incomprehensible than the one posted in the question because I didn't know better at the time, but today this would be my recommendation. It's a regular expression that only extracts one token at a time from the input string.
/ (\s+) // 1 whitespace
| (\d+)\/(\d+) // 2,3 fraction
| (\d*)([.,])(\d+) // 4,5,6 decimal
| (\d+) // 7 integer
| (km|cm|mm|m|ft|in|pi|po|'|") // 8 unit
/gi
Sorry about the weird syntax highlighting. I used whitespace to make this more readable but properly formatted it becomes:
/(\s+)|(\d+)\/(\d+)|(\d*)([.,])(\d+)|(\d+)|(km|cm|mm|m|ft|in|pi|po|'|")/gi
This regular expression makes clever uses of capture groups separated by OR clauses. Only the capture groups of one type of token will contain anything. For example, on the string "10 ft", successive calls to exec would return:
["10", "", "", "", "", "", "", "10", ""] (because "10" is an integer)
[" ", " ", "", "", "", "", "", "", ""] (because " " is whitespace)
["ft", "", "", "", "", "", "", "", "ft"] (because "ft" is a unit)
null
A tokenizer function can then do something like this to treat each individual token:
function tokenize (input) {
const localTokenRx = new RegExp(tokenRx);
return function next () {
const startIndex = localTokenRx.lastIndex;
if (startIndex >= input.length) {
// end of input reached
return undefined;
}
const match = localTokenRx.exec(input);
if (!match) {
localTokenRx.lastIndex = input.length;
// there is leftover garbage at the end of the input
return ["garbage", input.slice(startIndex)];
}
if (match.index !== startIndex) {
localTokenRx.lastIndex = match.index;
// the regex skipped over some garbage
return ["garbage", input.slice(startIndex, match.index)];
}
const [
text,
whitespace,
numerator, denominator,
integralPart, decimalSeparator, fractionalPart,
integer,
unit
] = match;
if (whitespace) {
return ["whitespace", undefined];
// or return next(); if we want to ignore it
}
if (denominator) {
return ["fraction", Number(numerator) / Number(denominator)];
}
if (decimalSeparator) {
return ["decimal", Number(integralPart + "." + fractionalPart)];
}
if (integer) {
return ["integer", Number(integer)];
}
if (unit) {
return ["unit", unit];
}
};
}
This function can do all the necessary string manipulation and type conversion all in one place, letting another piece of code do proper analysis of the sequence of tokens. But that would be out of scope for this Stack Overflow answer, especially since the question doesn't specify the rules of the grammar we are willing to accept.
But this is most likely too generic and complex of a solution if all you're trying to do is accept imperial lengths and metric lengths. For that, I'd probably only write a different regular expression for each acceptable format, then test the user's input to see which one matches. If two different expressions match, then the input is ambiguous and we should warn the user.

Related

algorithm to generate random string matching a regular expression

I am trying to create a method which will look at an Schema object and generate a value for that field which can be accepted. Here is one I am trying to match where the problem came around.
phoneNumber: {
type: String,
label: "Phone Number",
RegExp: /^(\([0-9]{3}\) |[0-9]{3}-)[0-9]{3}-[0-9]{4}$/
}
This will accept a string matching an American style phone numbers, e.g., (111)-111-1111.
For a solution, I thought it'd be possible to recursively build a string and test it against the regex, returning if it matches, but that results in stack overflow (and isn't a great idea to begin with)
characters = frozenset([x for x in string.printable])
def generate_matching_string(re_str, regex):
" generates a string that matches a given regular expression "
if re.match(re_str, regex):
return re_str
letter = re_str[:-1]
if characters.index(letter) == len(characters):
re_str += characters[0]
else:
re_str[:-1] = characters[characters.index(letter) + 1]
return generate_matching_string(re_str, regex)
I imagine the first step would be something similar to a DFA.
Another concern: the result would have to at least be moderately random. That is to say, there should be some variation in results, adding another level of complexity.
How can one generate a string matching a regular expression programmatically? (language agnostic, conceptual is fine).

regular expression for finding decimal/float numbers?

i need a regular expression for decimal/float numbers like 12 12.2 1236.32 123.333 and +12.00 or -12.00 or ...123.123... for using in javascript and jQuery.
Thank you.
Optionally match a + or - at the beginning, followed by one or more decimal digits, optional followed by a decimal point and one or more decimal digits util the end of the string:
/^[+-]?\d+(\.\d+)?$/
RegexPal
The right expression should be as followed:
[+-]?([0-9]*[.])?[0-9]+
this apply for:
+1
+1.
+.1
+0.1
1
1.
.1
0.1
Here is Python example:
import re
#print if found
print(bool(re.search(r'[+-]?([0-9]*[.])?[0-9]+', '1.0')))
#print result
print(re.search(r'[+-]?([0-9]*[.])?[0-9]+', '1.0').group(0))
Output:
True
1.0
If you are using mac, you can test on command line:
python -c "import re; print(bool(re.search(r'[+-]?([0-9]*[.])?[0-9]+', '1.0')))"
python -c "import re; print(re.search(r'[+-]?([0-9]*[.])?[0-9]+', '1.0').group(0))"
You can check for text validation and also only one decimal point validation using isNaN
var val = $('#textbox').val();
var floatValues = /[+-]?([0-9]*[.])?[0-9]+/;
if (val.match(floatValues) && !isNaN(val)) {
// your function
}
This is an old post but it was the top search result for "regular expression for floating point" or something like that and doesn't quite answer _my_ question. Since I worked it out I will share my result so the next person who comes across this thread doesn't have to work it out for themselves.
All of the answers thus far accept a leading 0 on numbers with two (or more) digits on the left of the decimal point (e.g. 0123 instead of just 123) This isn't really valid and in some contexts is used to indicate the number is in octal (base-8) rather than the regular decimal (base-10) format.
Also these expressions accept a decimal with no leading zero (.14 instead of 0.14) or without a trailing fractional part (3. instead of 3.0). That is valid in some programing contexts (including JavaScript) but I want to disallow them (because for my purposes those are more likely to be an error than intentional).
Ignoring "scientific notation" like 1.234E7, here is an expression that meets my criteria:
/^((-)?(0|([1-9][0-9]*))(\.[0-9]+)?)$/
or if you really want to accept a leading +, then:
/^((\+|-)?(0|([1-9][0-9]*))(\.[0-9]+)?)$/
I believe that regular expression will perform a strict test for the typical integer or decimal-style floating point number.
When matched:
$1 contains the full number that matched
$2 contains the (possibly empty) leading sign (+/-)
$3 contains the value to the left of the decimal point
$5 contains the value to the right of the decimal point, including the leading .
By "strict" I mean that the number must be the only thing in the string you are testing.
If you want to extract just the float value out of a string that contains other content use this expression:
/((\b|\+|-)(0|([1-9][0-9]*))(\.[0-9]+)?)\b/
Which will find -3.14 in "negative pi is approximately -3.14." or in "(-3.14)" etc.
The numbered groups have the same meaning as above (except that $2 is now an empty string ("") when there is no leading sign, rather than null).
But be aware that it will also try to extract whatever numbers it can find. E.g., it will extract 127.0 from 127.0.0.1.
If you want something more sophisticated than that then I think you might want to look at lexical analysis instead of regular expressions. I'm guessing one could create a look-ahead-based expression that would recognize that "Pi is 3.14." contains a floating point number but Home is 127.0.0.1. does not, but it would be complex at best. If your pattern depends on the characters that come after it in non-trivial ways you're starting to venture outside of regular expressions' sweet-spot.
Paulpro and lbsweek answers led me to this:
re=/^[+-]?(?:\d*\.)?\d+$/;
>> /^[+-]?(?:\d*\.)?\d+$/
re.exec("1")
>> Array [ "1" ]
re.exec("1.5")
>> Array [ "1.5" ]
re.exec("-1")
>> Array [ "-1" ]
re.exec("-1.5")
>> Array [ "-1.5" ]
re.exec(".5")
>> Array [ ".5" ]
re.exec("")
>> null
re.exec("qsdq")
>> null
For anyone new:
I made a RegExp for the E scientific notation (without spaces).
const floatR = /^([+-]?(?:[0-9]+(?:\.[0-9]+)?|\.[0-9]+)(?:[eE][+-]?[0-9]+)?)$/;
let str = "-2.3E23";
let m = floatR.exec(str);
parseFloat(m[1]); //=> -2.3e+23
If you prefer to use Unicode numbers, you could replace all [0-9] by \d in the RegExp.
And possibly add the Unicode flag u at the end of the RegExp.
For a better understanding of the pattern see https://regexper.com/.
And for making RegExp, I can suggest https://regex101.com/.
EDIT: found another site for viewing RegExp in color: https://jex.im/regulex/.
EDIT 2: although op asks for RegExp specifically you can check a string in JS directly:
const isNum = (num)=>!Number.isNaN(Number(num));
isNum("123.12345678E+3");//=> true
isNum("80F");//=> false
converting the string to a number (or NaN) with Number()
then checking if it is NOT NaN with !Number.isNaN()
If you want it to work with e, use this expression:
[+-]?[0-9]+([.][0-9]+)?([eE][+-]?[0-9]+)?
Here is a JavaScript example:
var re = /^[+-]?[0-9]+([.][0-9]+)?([eE][+-]?[0-9]+)?$/;
console.log(re.test('1'));
console.log(re.test('1.5'));
console.log(re.test('-1'));
console.log(re.test('-1.5'));
console.log(re.test('1E-100'));
console.log(re.test('1E+100'));
console.log(re.test('.5'));
console.log(re.test('foo'));
Here is my js method , handling 0s at the head of string
1- ^0[0-9]+\.?[0-9]*$ : will find numbers starting with 0 and followed by numbers bigger than zero before the decimal seperator , mainly ".". I put this to distinguish strings containing numbers , for example, "0.111" from "01.111".
2- ([1-9]{1}[0-9]\.?[0-9]) : if there is string starting with 0 then the part which is bigger than 0 will be taken into account. parentheses are used here because I wanted to capture only parts conforming to regex.
3- ([0-9]\.?[0-9]): to capture only the decimal part of the string.
In Javascript , st.match(regex), will return array in which first element contains conformed part. I used this method in the input element's onChange event , by this if the user enters something that violates the regex than violating part is not shown in element's value at all but if there is a part that conforms to regex , then it stays in the element's value.
const floatRegexCheck = (st) => {
const regx1 = new RegExp("^0[0-9]+\\.?[0-9]*$"); // for finding numbers starting with 0
let regx2 = new RegExp("([1-9]{1}[0-9]*\\.?[0-9]*)"); //if regx1 matches then this will remove 0s at the head.
if (!st.match(regx1)) {
regx2 = new RegExp("([0-9]*\\.?[0-9]*)"); //if number does not contain 0 at the head of string then standard decimal formatting takes place
}
st = st.match(regx2);
if (st?.length > 0) {
st = st[0];
}
return st;
}
Here is a more rigorous answer
^[+-]?0(?![0-9]).[0-9]*(?![.])$|^[+-]?[1-9]{1}[0-9]*.[0-9]*$|^[+-]?.[0-9]+$
The following values will match (+- sign are also work)
.11234
0.1143424
11.21
1.
The following values will not match
00.1
1.0.00
12.2350.0.0.0.0.
.
....
How it works
The (?! regex) means NOT operation
let's break down the regex by | operator which is same as logical OR operator
^[+-]?0(?![0-9]).[0-9]*(?![.])$
This regex is to check the value starts from 0
First Check + and - sign with 0 or 1 time ^[+-]
Then check if it has leading zero 0
If it has,then the value next to it must not be zero because we don't want to see 00.123 (?![0-9])
Then check the dot exactly one time and check the fraction part with unlimited times of digits .[0-9]*
Last, if it has a dot follow by fraction part, we discard it.(?![.])$
Now see the second part
^[+-]?[1-9]{1}[0-9]*.[0-9]*$
^[+-]? same as above
If it starts from non zero, match the first digit exactly one time and unlimited time follow by it [1-9]{1}[0-9]* e.g. 12.3 , 1.2, 105.6
Match the dot one time and unlimited digit follow it .[0-9]*$
Now see the third part
^[+-]?.{1}[0-9]+$
This will check the value starts from . e.g. .12, .34565
^[+-]? same as above
Match dot one time and one or more digits follow by it .[0-9]+$

Regular Expression with multiple words (in any order) without repeat

I'm trying to execute a search of sorts (using JavaScript) on a list of strings. Each string in the list has multiple words.
A search query may also include multiple words, but the ordering of the words should not matter.
For example, on the string "This is a random string", the query "trin and is" should match. However, these terms cannot overlap. For example, "random random" as a query on the same string should not match.
I'm going to be sorting the results based on relevance, but I should have no problem doing that myself, I just can't figure out how to build up the regular expression(s). Any ideas?
The query trin and is becomes the following regular expression:
/trin.*(?:and.*is|is.*and)|and.*(?:trin.*is|is.*trin)|is.*(?:trin.*and|and.*trin)/
In other words, don't use regular expressions for this.
It probably isn't a good idea to do this with just a regular expression. A (pure, computer science) regular expression "can't count". The only "memory" it has at any point is the state of the DFA. To match multiple words in any order without repeat you'd need on the order of 2^n states. So probably a really horrible regex.
(Aside: I mention "pure, computer science" regular expressions because most implementations are actually an extension, and let you do things that are non-regular. I'm not aware of any extensions, certainly none in JavaScript, that make doing what you want to do any less painless with a single pattern.)
A better approach would be to keep a dictionary (Object, in JavaScript) that maps from words to counts. Initialize it to your set of words with the appropriate counts for each. You can use a regular expression to match words, and then for each word you find, decrement the corresponding entry in the dictionary. If the dictionary contains any non-0 values at the end, or if somewhere a long the way you try to over-decrement a value (or decrement one that doesn't exist), then you have a failed match.
I'm totally not sure if I get you right there, so I'll just post my suggestion for it.
var query = "trin and is",
target = "This is a random string",
search = { },
matches = 0;
query.split( /\s+/ ).forEach(function( word ) {
search[ word ] = true;
});
Object.keys( search ).forEach(function( word ) {
matches += +new RegExp( word ).test( target );
});
// do something useful with "matches" for the query, should be "3"
alert( matches );
So, the variable matches will contain the number of unique matches for the query. The first split-loop just makes sure that no "doubles" are counted since we would overwrite our search object. The second loop checks for the individuals words within the target string and uses the nifty + to cast the result (either true or false) into a number, hence, +1 on a match or +0.
I was looking for a solution to this issue and none of the solutions presented here was good enough, so this is what I came up with:
function filterMatch(itemStr, keyword){
var words = keyword.split(' '), i = 0, w, reg;
for(; w = words[i++] ;){
reg = new RegExp(w, 'ig');
if (reg.test(itemStr) === false) return false; // word not found
itemStr = itemStr.replace(reg, ''); // remove matched word from original string
}
return true;
}
// test
filterMatch('This is a random string', 'trin and is'); // true
filterMatch('This is a random string', 'trin not is'); // false

How to make this simple regexp?

I need to make a string starts and ends with alphanumeric range between 5 to 20 characters and it could have a space or none between characters. /^[a-z\s?A-Z0-9]{5,20}$/ but this is not working.
EDIT
test test -should pass
testtest -should pass
test test test -should not pass
You can't do this with traditional regex without writing a ridiculously long expression, so you need to use a look-ahead:
/^(?=(\w| ){15,20}$)\w+ ?\w+$/
This says, make sure there are between 15 and 20 characters in the match, then match /\w+ \w+/
Note I used \w for simplification. It is the same as your character class above except it also accepts underscores. If you don't want to match them you have to do:
/^(?=[a-zA-Z0-9 ]{15,20}$)[a-zA-Z0-9]+ ?[a-zA-Z0-9]+$/
You can't put a ? inside of [...]. [...] is used to specify a set of characters precisely, you can't maybe (?) have a character inside a set of characters. The occurrence of any specific characters is already optional, the ? is meaningless.
If you allow any number of spaces inside your match, just remove the question mark. If you want to allow a single space but no more, then regular expressions alone can't do that for you, you'd need something like
if (myString.match(/^[a-z\sA-Z0-9]{5,20}$/ && myString.match(/\s/g).length <= 1)
You couldn't do this with a single traditional regex without it being dozens of lines long; regexes are meant for matching more simpler patterns than this.
If you only want to use regexes, you could use two instead of one. The first matches the general pattern, the second ensures that only one non-space characters is found.
if (myString.match(/^[a-z\sA-Z0-9]{5,20}$/ && myString.match(/^[^\s]*\s?[^\s]*$/))) {
Example Usage
inputs = ["test test", "testtest", "test test test"];
for (index in inputs) {
var myString = inputs[index];
if (myString.match(/^[a-z\sA-Z0-9]{5,20}$/ && myString.match(/^[^\s]*\s?[^\s]*$/))) {
console.log(myString + " matches.")
} else {
console.log(myString + " does not match.")
}
}
This produces the output specified in your question.
Meh , So here's the ridiculously long traditional regex for the same
(?i)[a-z0-9]+( [a-z0-9]+)?{5,12}
js vesrion (w/o the nested quantifier)
/^([a-z0-9]( [a-z0-9])?){5,12}$/i

Javascript text manipulation

Just starting with js, decided to convert Friendfeed to a fluid app, and as part of that I need to be able to parse some numbers out of a string.
How do I complete this function?
function numMessages(text) {
MAGIC HAPPENS (POSSIBLY THE DARK ART OF THE REGEX)
return number;
}
input would be "Direct Messages (15)"
output would be 15.
Instincts tell me to find the first bracket then find the last bracket, and get the text in between but I don't know how to do that. Second instinct tells me to regex for [0-9], but I don't know how to run regexes in js. Jquery is avaliable already if needed.
Thanks!
This should do it:
>>> 'Direct Messages (15)'.match(/[0-9]+/g);
["15"]
Just be careful if you expect more than 1 number to be in the string:
>>> 'Direct 19 Messages (15)'.match(/[0-9]+/g);
["19", "15"]
If you only wanted the first match, you could remove the g flag:
>>> 'Direct 19 Messages (15)'.match(/[0-9]+/);
["19"]
If you only wanted to match what's between the parentheses
>>> 'Direct 19 Messages (15)'.match(/\((.*?)\)/);
["(15)","15"]
// first index will always be entire match, 2nd index will be captured match
As pointed out in the comments, to get the last match:
>>> var matches = 'Direct 19 Messages (15)'.match(/[0-9]+/g);
>>> matches[matches.length-1];
"15"
Though some boundary checking would also be appropriate. :)
var reg = new RegExp('[0-9]+');
reg.exec('Direct Messages (15)');
function numMessages(text) {
return text.match(/\d+/g);
}
This will return all numbers (\d is a special character class equivalent to [0-9]) from the string. the /g makes the regex engine do a global search, thereby returning an array of all matches; if you just want one, remove the /g. Regardless of if your expression is global or not, match returns an array, so you will need to use array notation to get at the element you want.
Note that results from a regular expression match are of type string; if you want numbers, you can use parseInt to convert "15" to 15.
Putting that all together, if you just want one number, as it seems to appear from your initial question text:
function numMessages(text) {
return parseInt(text.match(/\d+/)[0]);
}
str = "Direct Messages (15)";
numMessages(str); // 15

Categories