algorithm to generate random string matching a regular expression - javascript

I am trying to create a method which will look at an Schema object and generate a value for that field which can be accepted. Here is one I am trying to match where the problem came around.
phoneNumber: {
type: String,
label: "Phone Number",
RegExp: /^(\([0-9]{3}\) |[0-9]{3}-)[0-9]{3}-[0-9]{4}$/
}
This will accept a string matching an American style phone numbers, e.g., (111)-111-1111.
For a solution, I thought it'd be possible to recursively build a string and test it against the regex, returning if it matches, but that results in stack overflow (and isn't a great idea to begin with)
characters = frozenset([x for x in string.printable])
def generate_matching_string(re_str, regex):
" generates a string that matches a given regular expression "
if re.match(re_str, regex):
return re_str
letter = re_str[:-1]
if characters.index(letter) == len(characters):
re_str += characters[0]
else:
re_str[:-1] = characters[characters.index(letter) + 1]
return generate_matching_string(re_str, regex)
I imagine the first step would be something similar to a DFA.
Another concern: the result would have to at least be moderately random. That is to say, there should be some variation in results, adding another level of complexity.
How can one generate a string matching a regular expression programmatically? (language agnostic, conceptual is fine).

Related

Check if all elements in array are present in a string as words using regular expressions in Javascript

I have used existing Javascript functions that I found online but I need more control and I think regular expressions will allow it using flags to control if case sensitive or not, multiline etc.
var words=['one','two','cat','oranges'];
var string='The cat ate two oranges and one mouse.';
check=string.match(pattern);
pattern=???;
if(check!==null){
//string matches all array elements
}else{
//string does not match all array words
}
what would the pattern be and if it can be constructed using javascript using the array as a source?
***I will need to place the same function on the backend in PHP and so it would be easier just to create a regular expression and use it instead of looping and finding alternatives for this to work in PHP.
***I would love to have as many options including changes, replace, count and regular expressions are meant for this. And on the plus side the speed should be better using regex instead of looping(search the whole text for every element in the array) specially in case of a long array and a long string.
var words=['one','two','cat','oranges'];
let string='The cat ate two oranges and one mouse.';
words=words.map(function(value,index){return '(?=(.)*?\\b('+value+')\\b)'; }).join('');
let pattern=new RegExp(`${words}((.)+)`,'g');
if(string.match(pattern)!==null){
//string matches all elements
}else{
//string does not match all words
}
It will look for the exact word match, and you will have the extra control you wanted using regex flags for case insensitive..
if you want to test it or adjust it you can do it here:
doregex.com
You can use the same regex to make changes within the text using a callback function.
You can create RegExp objects from your strings, this will allow you further control over case sensitivity etc.
For example:
const patterns = [ 'ONE','two','cat','oranges'];
const string = 'The cat ate two oranges and one mouse.';
// Match in a case sensitive way
const result = patterns.every(pattern => new RegExp(pattern, 'i').test(string));
console.log("All patterns match:", result);

regex - Match a portion of text given begin and end of the portion in Javascript

I need to trim a very long string that can change during time. Since it's html I can use tags and attributes name to cut it regardless of the content. unfortunately I can't find a way to write the regex match. Given the following example:
This is (random characters) an example (random characters)
How can I match the (random characters) and "This is" using the rest, which is always the same? I've tried something along the lines of the followings:
^(This is)((.|\s)*an)$
This is^(?!.*(an))
but everything seems to fail. I think that the "any character or space in beetween" part makes the search go right to the end of the string and I miss the "an" part, but I can't figure it out how to add an exception to that.
I don't know javascript, but I will assume the following functions I will write in some loosely C-like code exist in some form:
string input = "This is (random characters) an example (random characters)";
string pattern = "(^This is .*) an example (.*$)";
RegexMatch match = Regex.Match( str, pattern );
string group0 = match.GetGroup(0);//this should contain the whole input
string group1 = match.GetGroup(1);//this should get the first part: This is (random characters)
string group2 = match.GetGroup(2);//this should get the second part: (random characters) at the end of the input string
Note: Normally in Regular Expressions, The parentheses create capture groups.
'Look behind' would be good for this, but unfortunately, JS doesn't support it. However, you can use a RegExp and capturing groups to get the result you want.
let matchedGroups = new RegExp(/^This is (.+) an example (.+).$/,'g')
matchGroups.exec('This is (random characters) an example (random characters).')
This returns an array:
0:"This is (random characters) an example (random characters)."
1:"(random characters)"
2:"(random characters)"
As you can see this is a little clunky, but will get you two strings that you can use.

How to parse and capture any measurement unit

In my application, users can customize measurement units, so if they want to work in decimeters instead of inches or in full-turns instead of degrees, they can. However, I need a way to parse a string containing multiple values and units, such as 1' 2" 3/8. I've seen a few regular expressions on SO and didn't find any which matched all cases of the imperial system, let alone allowing any kind of unit. My objective is to have the most permissive input box possible.
So my question is: how can I extract multiple value-unit pairs from a string in the most user-friendly way?
I came up with the following algorithm:
Check for illegal characters and throw an error if needed.
Trim leading and trailing spaces.
Split the string into parts every time there's a non-digit character followed by a digit character, except for .,/ which are used to identify decimals and fractions.
Remove all spaces from parts, check for character misuse (multiple decimal points or fraction bars) and replace '' with ".
Split value and unit-string for each part. If a part has no unit:
If it is the first part, use the default unit.
Else if it is a fraction, consider it as the same unit as the previous part.
Else if it isn't, consider it as in, cm or mm based on the previous part's unit.
If it isn't the first part and there's no way to guess the unit, throw an error.
Check if units mean something, are all of the same system (metric/imperial) and follow a descending order (ft > in > fraction or m > cm > mm > fraction), throw an error if not.
Convert and sum all parts, performing division in the process.
I guess I could use string manipulation functions to do most of this, but I feel like there must be a simpler way through regex.
I came up with a regex:
((\d+('|''|"|m|cm|mm|\s|$) *)+(\d+(\/\d+)?('|''|"|m|cm|mm|\s|$) *)?)|((\d+('|''|"|m|cm|mm|\s) *)*(\d+(\/\d+)?('|''|"|m|cm|mm|\s|$) *))
It only allows fractions at the end and allows to place spaces between values. I've never used regex capturing though, so I'm not so sure how I'll manage to extract the values out of this mess. I'll work again on this tomorrow.
My objective is to have the most permissive input box possible.
Careful, more permissive doesn't always mean more intuitive. An ambiguous input should warn the user, not pass silently, as that might lead them to make multiple mistakes before they realize their input wasn't interpreted like they hoped.
How can I extract multiple value-unit pairs from a string? I guess I could use string manipulation functions to do most of this, but I feel like there must be a simpler way through regex.
Regular expressions are a powerful tool, especially since they work in many programming languages, but be warned. When you're holding a hammer everything starts to look like a nail. Don't try to use a regular expression to solve every problem just because you recently learned how they work.
Looking at the pseudocode you wrote, you are trying to solve two problems at once: splitting up a string (which we call tokenization) and interpreting input according to a grammar (which we call parsing). You should should try to first split up the input into a list of tokens, or maybe unit-value pairs. You can start making sense of these pairs once you're done with string manipulation. Separation of concerns will spare you a headache, and your code will be much easier to maintain as a result.
I've never used regex capturing though, so I'm not so sure how I'll manage to extract the values out of this mess.
If a regular expression has the global (g) flag, it can be used to find multiple matches in the same string. That would be useful if you had a regular expression that finds a single unit-value pair. In JavaScript, you can retrieve a list of matches using string.match(regex). However, that function ignores capture groups on global regular expressions.
If you want to use capture groups, you need to call regex.exec(string) inside a loop. For each successful match, the exec function will return an array where item 0 is the entire match and items 1 and onwards are the captured groups.
For example, /(\d+) ([a-z]+)/g will look for an integer followed by a space and a word. If you made successive calls to regex.exec("1 hour 30 minutes") you would get:
["1 hour", "1", "hour"]
["30 minutes", "30", "minutes"]
null
Successive calls work like this because the regex object keeps an internal cursor you can get or set with regex.lastIndex. You should set it back to 0 before using the regex again with a different input.
You've been using parentheses to isolate OR clauses such as a|b and to apply quantifiers to a character sequence such as (abc)+. If you want to do that without creating capture groups, you can use (?: ) instead. This is called a non-capturing group. It does the same thing as regular parentheses in a regex, but what's inside it won't create an entry in the returned array.
Is there a better way to approach this?
A previous version of this answer concluded with a regular expression even more incomprehensible than the one posted in the question because I didn't know better at the time, but today this would be my recommendation. It's a regular expression that only extracts one token at a time from the input string.
/ (\s+) // 1 whitespace
| (\d+)\/(\d+) // 2,3 fraction
| (\d*)([.,])(\d+) // 4,5,6 decimal
| (\d+) // 7 integer
| (km|cm|mm|m|ft|in|pi|po|'|") // 8 unit
/gi
Sorry about the weird syntax highlighting. I used whitespace to make this more readable but properly formatted it becomes:
/(\s+)|(\d+)\/(\d+)|(\d*)([.,])(\d+)|(\d+)|(km|cm|mm|m|ft|in|pi|po|'|")/gi
This regular expression makes clever uses of capture groups separated by OR clauses. Only the capture groups of one type of token will contain anything. For example, on the string "10 ft", successive calls to exec would return:
["10", "", "", "", "", "", "", "10", ""] (because "10" is an integer)
[" ", " ", "", "", "", "", "", "", ""] (because " " is whitespace)
["ft", "", "", "", "", "", "", "", "ft"] (because "ft" is a unit)
null
A tokenizer function can then do something like this to treat each individual token:
function tokenize (input) {
const localTokenRx = new RegExp(tokenRx);
return function next () {
const startIndex = localTokenRx.lastIndex;
if (startIndex >= input.length) {
// end of input reached
return undefined;
}
const match = localTokenRx.exec(input);
if (!match) {
localTokenRx.lastIndex = input.length;
// there is leftover garbage at the end of the input
return ["garbage", input.slice(startIndex)];
}
if (match.index !== startIndex) {
localTokenRx.lastIndex = match.index;
// the regex skipped over some garbage
return ["garbage", input.slice(startIndex, match.index)];
}
const [
text,
whitespace,
numerator, denominator,
integralPart, decimalSeparator, fractionalPart,
integer,
unit
] = match;
if (whitespace) {
return ["whitespace", undefined];
// or return next(); if we want to ignore it
}
if (denominator) {
return ["fraction", Number(numerator) / Number(denominator)];
}
if (decimalSeparator) {
return ["decimal", Number(integralPart + "." + fractionalPart)];
}
if (integer) {
return ["integer", Number(integer)];
}
if (unit) {
return ["unit", unit];
}
};
}
This function can do all the necessary string manipulation and type conversion all in one place, letting another piece of code do proper analysis of the sequence of tokens. But that would be out of scope for this Stack Overflow answer, especially since the question doesn't specify the rules of the grammar we are willing to accept.
But this is most likely too generic and complex of a solution if all you're trying to do is accept imperial lengths and metric lengths. For that, I'd probably only write a different regular expression for each acceptable format, then test the user's input to see which one matches. If two different expressions match, then the input is ambiguous and we should warn the user.

Regular Expression with multiple words (in any order) without repeat

I'm trying to execute a search of sorts (using JavaScript) on a list of strings. Each string in the list has multiple words.
A search query may also include multiple words, but the ordering of the words should not matter.
For example, on the string "This is a random string", the query "trin and is" should match. However, these terms cannot overlap. For example, "random random" as a query on the same string should not match.
I'm going to be sorting the results based on relevance, but I should have no problem doing that myself, I just can't figure out how to build up the regular expression(s). Any ideas?
The query trin and is becomes the following regular expression:
/trin.*(?:and.*is|is.*and)|and.*(?:trin.*is|is.*trin)|is.*(?:trin.*and|and.*trin)/
In other words, don't use regular expressions for this.
It probably isn't a good idea to do this with just a regular expression. A (pure, computer science) regular expression "can't count". The only "memory" it has at any point is the state of the DFA. To match multiple words in any order without repeat you'd need on the order of 2^n states. So probably a really horrible regex.
(Aside: I mention "pure, computer science" regular expressions because most implementations are actually an extension, and let you do things that are non-regular. I'm not aware of any extensions, certainly none in JavaScript, that make doing what you want to do any less painless with a single pattern.)
A better approach would be to keep a dictionary (Object, in JavaScript) that maps from words to counts. Initialize it to your set of words with the appropriate counts for each. You can use a regular expression to match words, and then for each word you find, decrement the corresponding entry in the dictionary. If the dictionary contains any non-0 values at the end, or if somewhere a long the way you try to over-decrement a value (or decrement one that doesn't exist), then you have a failed match.
I'm totally not sure if I get you right there, so I'll just post my suggestion for it.
var query = "trin and is",
target = "This is a random string",
search = { },
matches = 0;
query.split( /\s+/ ).forEach(function( word ) {
search[ word ] = true;
});
Object.keys( search ).forEach(function( word ) {
matches += +new RegExp( word ).test( target );
});
// do something useful with "matches" for the query, should be "3"
alert( matches );
So, the variable matches will contain the number of unique matches for the query. The first split-loop just makes sure that no "doubles" are counted since we would overwrite our search object. The second loop checks for the individuals words within the target string and uses the nifty + to cast the result (either true or false) into a number, hence, +1 on a match or +0.
I was looking for a solution to this issue and none of the solutions presented here was good enough, so this is what I came up with:
function filterMatch(itemStr, keyword){
var words = keyword.split(' '), i = 0, w, reg;
for(; w = words[i++] ;){
reg = new RegExp(w, 'ig');
if (reg.test(itemStr) === false) return false; // word not found
itemStr = itemStr.replace(reg, ''); // remove matched word from original string
}
return true;
}
// test
filterMatch('This is a random string', 'trin and is'); // true
filterMatch('This is a random string', 'trin not is'); // false

Regular Expression to match given word in last five words of pipe-delimited string

Say we have a string
blue|blue|green|blue|blue|yellow|yellow|blue|yellow|yellow|
And we want to figure out whether the word "yellow" occurs in the last 5 words of the string, specifically by returning a capture group containing these occurences if any.
Is there a way to do that with a regex?
Update: I'm feeding a regex engine some rules. For various reasons I'm trying to work with the engine rather than go outside it, which would be my last resort.
/\b(yellow)\|(?=(?:\w+\|){0,4}$)/g
This will return one hit for each yellow| that's followed by fewer than five words (per your definition of "word"). This assumes the sequence always ends with a pipe; if that's not the case, you might want to change it to:
/\b(yellow)(?=(?:\|\w+){0,4}\|?$)/g
EDIT (in response to comment): The definition of a "word" in this solution is arbitrary, and doesn't really correspond to real-world usage. To allow for hyphenated words like "real-world" you could use this:
/\b(yellow)\|(?=(?:\w+(?:-\w+)*\|){0,4}$)/g
...or, for this particular job, you could define a word as one or more of any characters except pipes:
/\b(yellow)\|(?=(?:[^|]+\|){0,4}$)/g
No need to use a Regex for such a simple thing.
Simply split on the pipe, and check with indexOf:
var group = 'blue|blue|green|blue|blue|yellow|yellow|blue|yellow|yellow';
if ( group.split('|').slice(-5).indexOf('yellow') == -1 ) {
alert('Not there :(');
} else {
alert('Found!!!');
}
Note: indexOf is not natively supported in IE < 9, but support for it can be added very easily.
Can't think of a way to do this with a single regular expression, but you can form one for each of the last five positions and sum the matches.
var string = "blue|blue|green|blue|blue|yellow|yellow|blue|yellow|yellow|";
var regexes = [];
regexes.push(/(yellow)\|[^|]+\|[^|]+\|[^|]+\|[^|]+\|$/);
regexes.push(/(yellow)\|[^|]+\|[^|]+\|[^|]+\|$/);
regexes.push(/(yellow)\|[^|]+\|[^|]+\|$/);
regexes.push(/(yellow)\|[^|]+\|$/);
regexes.push(/(yellow)\|$/);
var count = 0;
var regex;
while (regex = regexes.shift()) {
if (string.match(regex)) {
count++;
}
}
console.log(count);
Should find four matches.

Categories