Regular Expression with multiple words (in any order) without repeat - javascript

I'm trying to execute a search of sorts (using JavaScript) on a list of strings. Each string in the list has multiple words.
A search query may also include multiple words, but the ordering of the words should not matter.
For example, on the string "This is a random string", the query "trin and is" should match. However, these terms cannot overlap. For example, "random random" as a query on the same string should not match.
I'm going to be sorting the results based on relevance, but I should have no problem doing that myself, I just can't figure out how to build up the regular expression(s). Any ideas?

The query trin and is becomes the following regular expression:
/trin.*(?:and.*is|is.*and)|and.*(?:trin.*is|is.*trin)|is.*(?:trin.*and|and.*trin)/
In other words, don't use regular expressions for this.

It probably isn't a good idea to do this with just a regular expression. A (pure, computer science) regular expression "can't count". The only "memory" it has at any point is the state of the DFA. To match multiple words in any order without repeat you'd need on the order of 2^n states. So probably a really horrible regex.
(Aside: I mention "pure, computer science" regular expressions because most implementations are actually an extension, and let you do things that are non-regular. I'm not aware of any extensions, certainly none in JavaScript, that make doing what you want to do any less painless with a single pattern.)
A better approach would be to keep a dictionary (Object, in JavaScript) that maps from words to counts. Initialize it to your set of words with the appropriate counts for each. You can use a regular expression to match words, and then for each word you find, decrement the corresponding entry in the dictionary. If the dictionary contains any non-0 values at the end, or if somewhere a long the way you try to over-decrement a value (or decrement one that doesn't exist), then you have a failed match.

I'm totally not sure if I get you right there, so I'll just post my suggestion for it.
var query = "trin and is",
target = "This is a random string",
search = { },
matches = 0;
query.split( /\s+/ ).forEach(function( word ) {
search[ word ] = true;
});
Object.keys( search ).forEach(function( word ) {
matches += +new RegExp( word ).test( target );
});
// do something useful with "matches" for the query, should be "3"
alert( matches );
So, the variable matches will contain the number of unique matches for the query. The first split-loop just makes sure that no "doubles" are counted since we would overwrite our search object. The second loop checks for the individuals words within the target string and uses the nifty + to cast the result (either true or false) into a number, hence, +1 on a match or +0.

I was looking for a solution to this issue and none of the solutions presented here was good enough, so this is what I came up with:
function filterMatch(itemStr, keyword){
var words = keyword.split(' '), i = 0, w, reg;
for(; w = words[i++] ;){
reg = new RegExp(w, 'ig');
if (reg.test(itemStr) === false) return false; // word not found
itemStr = itemStr.replace(reg, ''); // remove matched word from original string
}
return true;
}
// test
filterMatch('This is a random string', 'trin and is'); // true
filterMatch('This is a random string', 'trin not is'); // false

Related

Quickly check if two regular expressions share matches

say I have a list of regular expressions, which match filepaths:
{
"list":[
"^/foo/bar/baz/x",
"^/foo/bar/baz/y"
"^/foo/mon/choo$",
...
"^/foo/.*"
]
}
Note that at runtime, this will happen:
let regexes = list.map(function(l){
return new RegExp(l);
});
I need to create a routine to quickly check if two or more of the regular expressions match the same input.
Is there a way to quickly check if an imaginary/potential filepath would match more than one regular expression in the list?
For example, the regular expression /foo/.* will match the first 3 items, and therefore that represents an error in my program.
Use case: the user is expected to create a list of regular expressions, but they have to be exclusive regular expressions which do not share any matches.
I could check this with actual input, but I am wondering if there is a way to check this with theoretical input as well. (I am hoping that latter would be faster).
The "hard" way: I have a list of files. For each file I check to see if it matches any of the regular expressions in the list. If it matches more than 1 in the list, I throw an error.
The problem with the hard way is that I would like to validate the list before using any real input data.
Since you're using an array, it's possible to have duplicates of the exact regex, so maybe you want to use an object with keys instead or just set it.
Along with that, you could actually use the regex themselves to test each other. In my example below I'm only checking .* or .+ but if you really want to be comprehensive, you could run each regex against every other regex. Though I didn't do this because it might have a really long run time, but that's up to you.
var list = [
"/foo/bar/baz/x",
"/foo/bar/baz/y",
"/foo/mon/choo$",
"/foo/.*"
];
var error_list = [...list, "/foo/mon/choo$"];
let set = new Set(error_list);
console.log(set.length === list.length, "\"if false it means there's duplicates\"");
var regexes = [];
for (var regex of list){
if (regex.match(/\.(\*|\+)/)){
regexes.push(regex);
}
}
loop:
for (var regex of regexes){
var r = new RegExp("^"+regex);
for (let test of list){
if (test.match(r) && regex !== test){
console.log(test, "this matched");
// break loop;
}
}
}

How to store only the nth substring into a variable in Javascript

var a="how are you?";
In the above example I want to store the second word "are" into another variable in a single step.
I don't want to use something like below
var bigArray = a.split(" ");
var secondText = bigArray[1];
as we may need to store the entire paragraph into a big array and consume a lot of memory without any use.
I would like to know if there is some function which works as below
var secondText=specialFunction(a," ",1);
so that we will get the second substring when the paragraph is split by " "
Well, I would spend my time worrying about more important things than the size of some arrays.
Anyway, you could try using a regexp:
var secondText = (a.match(/ (\w+)/) || []) [1];
This reads as "find a space, then capture the following word".
The || [] part is meant to deal with the situation where there is no match (for example, no second word). In that case, the result will be [][1] which is undefined.
This finds only the second word. What about the more general case? Since we are not allowed to split the string on spaces, because that would create an array and the OP doesn't want that due to memory concerns. So, we will instead build a dynamic regexp. To find the nth word, we want to skip over the first n-1 spaces. Or, to be more precise, we want to skip over the first word, some spaces, then the second word, then some more spaces, etc. So the regexp is
/(?:\w+ ){n}(\w+)/
^^ NO CAPTURING GROUP
^^^^ WORD FOLLOWED BY SPACE
^^^ N TIMES
^^^^^ CAPTURE FOLLOWING WORD
The ?: is to avoid this being treated as a capturing group. We build the regexp using
function make_nth_word_regexp(n) {
n--;
return new RegExp("(?:\\w+ ){" + n + "}(\\w+)");
}
Now look for your nth word:
var fifth_word = str.match(make_nth_word_regexp(5)) [1];
> "Hey there you".match(make_nth_word_regexp(3))[1]
< "you"
Alternative to regex is just to use substring(). Something like
var a="how are you";
alert(a.substring(a.indexOf(" "), a.length).substring(0, a.indexOf(" ")+1));

Get id from url

I have the following example url: #/reports/12/expense/11.
I need to get the id just after the reports -> 12. What I am asking here is the most suitable way to do this. I can search for reports in the url and get the content just after that ... but what if in some moment I decide to change the url, I will have to change my algorythm.
What do You think is the best way here. Some code examples will be also very helpfull.
It's hard to write code that is future-proof since it's hard to predict the crazy things we might do in the future!
However, if we assume that the id will always be the string of consecutive digits in the URL then you could simply look for that:
function getReportId(url) {
var match = url.match(/\d+/);
return (match) ? Number(match[0]) : null;
}
getReportId('#/reports/12/expense/11'); // => 12
getReportId('/some/new/url/report/12'); // => 12
You should use a regular expression to find the number inside the string. Passing the regular expression to the string's .match() method will return an array containing the matches based on the regular expression. In this case, the item of the returned array that you're interested in will be at the index of 1, assuming that the number will always be after reports/:
var text = "#/reports/12/expense/11";
var id = text.match(/reports\/(\d+)/);
alert(id[1]);
\d+ here means that you're looking for at least one number followed by zero to an infinite amount of numbers.
var text = "#/reports/12/expense/11";
var id = text.match("#/[a-zA-Z]*/([0-9]*)/[a-zA-Z]*/")
console.log(id[1])
Regex explanation:
#/ matches the characters #/ literally
[a-zA-Z]* - matches a word
/ matches the character / literally
1st Capturing group - ([0-9]*) - this matches a number.
[a-zA-Z]* - matches a word
/ matches the character / literally
Regular expressions can be tricky (add expensive). So usually if you can efficiently do the same thing without them you should. Looking at your URL format you would probably want to put at least a few constraints on it otherwise the problem will be very complex. For instance, you probably want to assume the value will always appear directly after the key so in your sample report=12 and expense=11, but report and expense could be switched (ex. expense/11/report/12) and you would get the same result.
I would just use string split:
var parts = url.split("/");
for(var i = 0; i < parts.length; i++) {
if(parts[i] === "report"){
this.reportValue = parts[i+1];
i+=2;
}
if(parts[i] === "expense"){
this.expenseValue = parts[i+1];
i+=2;
}
}
So this way your key/value parts can appear anywhere in the array
Note: you will also want to check that i+1 is in the range of the parts array. But that would just make this sample code ugly and it is pretty easy to add in. Depending on what values you are expecting (or not expecting) you might also want to check that values are numbers using isNaN

Shared part in RegEx matched string

In following code:
"a sasas b".match(/sas/g) //returns ["sas"]
The string actually include two sas strings, a [sas]as b and a sa[sas] b.
How can I modify RegEx to match both?
Another example:
"aaaa".match(/aa/g); //actually include [aa]aa,a[aa]a,aa[aa]
Please consider the issue in general not just above instances.
A pure RexEx solution is preferred.
If you want to match at least one such "merged" occurrence, then you could do something like:
"a sasas b".match(/s(as)+/g)
If you want to retrieve the matches as separate results, then you have a bit more work to do; this is not a case that regular expressions are designed to handle. The basic algorithm would be:
Attempt a match. If it was unsuccessful, stop.
Extract the match you are interested in and do whatever you want with it.
Take the substring of the original target string, starting from one character following the first character in your match.
Start over, using this substring as the new input.
(To be more efficient, you could match with an offset instead of using substrings; that technique is discussed in this question.)
For example, you would start with "a sasas b". After the first match, you have "sas". Taking the substring that starts one character after the match starts, we would have "asas b". The next match would find the "sas" here, and you would again repeat the process with "as b". This would fail to match, so you would be done.
This significantly-improved answer owes itself to #EliGassert.
String.prototype.match_overlap = function(re)
{
if (!re.global)
re = new RegExp(re.source,
'g' + (re.ignoreCase ? 'i' : '')
+ (re.multiline ? 'm' : ''));
var matches = [];
var result;
while (result = re.exec(this))
matches.push(result),
re.lastIndex = result.index + 1;
return matches.length ? matches : null;
}
#EliGassert points out that there is no need to walk through the entire string character by character; instead we can find a match anywhere (i.e. do without the anchor), and then continue one character after the index of the found match. While researching how to retrieve said index, I found that the re.lastIndex property, used by exec to keep track of where it should continue its search, is in fact settable! This works rather nicely with what we intend to do.
The only bit needing further explanation might be the beginning. In the absence of the g flag, exec may never return null (always returning its one match, if it exists), thus possibly going into an infinite loop. Since, however, match_overlap by design seeks multiple matches, we can safely recompile any non-global RegExp as a global RegExp, importing the i and m options as well if set.
Here is a new jsFiddle: http://jsfiddle.net/acheong87/h5MR5/.
document.write("<pre>");
document.write('sasas'.match_overlap(/sas/));
document.write("\n");
document.write('aaaa'.match_overlap(/aa/));
document.write("\n");
document.write('my1name2is3pilchard'.match_overlap(/[a-z]{2}[0-9][a-z]{2}/));
document.write("</pre>");​
Output:
sas,sas
aa,aa,aa
my1na,me2is,is3pi
var match = "a sasas b".match(/s(?=as)/g);
for(var i =0; i != match.length; ++i)
alert(match[i]);
Going off of the comment by Q. Sheets and the response by cdhowie, I came up with the above solution: it consumes ONE character in the regular expression and does a lookahead for the rest of the match string. With these two pieces, you can construct all the positions and matching strings in your regular expression.
I wish there was an "inspect but don't consume" operator that you could use to actually include the rest of the matching (lookahead) string in the results, but there unfortunately isn't -- at least not in JS.
Here's a generic way to do it:
​String.prototype.match_overlap = function(regexp)
{
regexp = regexp.toString().replace(/^\/|\/$/g, '');
var re = new RegExp('^' + regexp);
var matches = [];
var result;
for (var i = 0; i < this.length; i++)
if (result = re.exec(this.substr(i)))
matches.push(result);
return matches.length ? matches : null;
}
Usage:
var results = 'sasas'.match_overlap(/sas/);
Returns:
An array of (overlapping) matches, or null.
Example:
Here's a jsFiddle in which this:
document.write("<pre>");​
document.write('sasas'.match_overlap(/sas/));
document.write("\n");
document.write('aaaa'.match_overlap(/aa/));
document.write("\n");
document.write('my1name2is3pilchard'.match_overlap(/[a-z]{2}[0-9][a-z]{2}/));
document.write("</pre>");​
returns this:
sas,sas
aa,aa,aa
my1na,me2is,is3pi
Explanation:
To explain a little bit, we intend for the user to pass a RegExp object to this new function, match_overlap, as he or she would do normally with match. From this we want to create a new RegExp object anchored at the beginning (to prevent duplicate overlapped matches—this part probably won't make sense unless you encounter the issue yourself—don't worry about it). Then, we simply match against each substring of the subject string this and push the results to an array, which is returned if non-empty (otherwise returning null). Note that if the user passes in an expression that is already anchored, this is inherently wrong—at first I stripped anchors out, but then I realized I was making an assumption in the user's stead, which we should avoid. Finally one could go further and somehow merge the resulting array of matches into a single match result resembling what would normally occur with the //g option; and one could go even further and make up a new flag, e.g. //o that gets parsed to do overlap-matching, but this is getting a little crazy.

Regular Expression to match given word in last five words of pipe-delimited string

Say we have a string
blue|blue|green|blue|blue|yellow|yellow|blue|yellow|yellow|
And we want to figure out whether the word "yellow" occurs in the last 5 words of the string, specifically by returning a capture group containing these occurences if any.
Is there a way to do that with a regex?
Update: I'm feeding a regex engine some rules. For various reasons I'm trying to work with the engine rather than go outside it, which would be my last resort.
/\b(yellow)\|(?=(?:\w+\|){0,4}$)/g
This will return one hit for each yellow| that's followed by fewer than five words (per your definition of "word"). This assumes the sequence always ends with a pipe; if that's not the case, you might want to change it to:
/\b(yellow)(?=(?:\|\w+){0,4}\|?$)/g
EDIT (in response to comment): The definition of a "word" in this solution is arbitrary, and doesn't really correspond to real-world usage. To allow for hyphenated words like "real-world" you could use this:
/\b(yellow)\|(?=(?:\w+(?:-\w+)*\|){0,4}$)/g
...or, for this particular job, you could define a word as one or more of any characters except pipes:
/\b(yellow)\|(?=(?:[^|]+\|){0,4}$)/g
No need to use a Regex for such a simple thing.
Simply split on the pipe, and check with indexOf:
var group = 'blue|blue|green|blue|blue|yellow|yellow|blue|yellow|yellow';
if ( group.split('|').slice(-5).indexOf('yellow') == -1 ) {
alert('Not there :(');
} else {
alert('Found!!!');
}
Note: indexOf is not natively supported in IE < 9, but support for it can be added very easily.
Can't think of a way to do this with a single regular expression, but you can form one for each of the last five positions and sum the matches.
var string = "blue|blue|green|blue|blue|yellow|yellow|blue|yellow|yellow|";
var regexes = [];
regexes.push(/(yellow)\|[^|]+\|[^|]+\|[^|]+\|[^|]+\|$/);
regexes.push(/(yellow)\|[^|]+\|[^|]+\|[^|]+\|$/);
regexes.push(/(yellow)\|[^|]+\|[^|]+\|$/);
regexes.push(/(yellow)\|[^|]+\|$/);
regexes.push(/(yellow)\|$/);
var count = 0;
var regex;
while (regex = regexes.shift()) {
if (string.match(regex)) {
count++;
}
}
console.log(count);
Should find four matches.

Categories