How to check multiple matching words with regex in Javascript - javascript

Hey I have code like this
var text = "We are downing to earth"
var regexes = "earth|art|ear"
if (regexes.length) {
var reg = new RegExp(regexes, "ig");
console.log(reg)
while ((regsult = reg.exec(text)) !== null) {
var word = regsult[0];
console.log(word)
}
}
I want to get matching words from text. It should have "earth", "art" and "ear" as well. Because "earth" consist of those substring. Instead, it only produce "earth".
Is there any mistake with my regex pattern?
Or should I use another approach in JS?
Thanks

As discussed in another answer, a single regexp cannot match multiple overlapping alternatives. In your case, simply do a separate regexp test for each word you are looking for:
var text = "We are downing to earth"
var regexes = ["earth", "art", "ear"];
var results = [];
for (var i = 0; i < regexes.length; i++ ) {
var word = regexes[i];
if (text.match(word) results.push(word);
}
You could tighten this up a little bit by doing
regexes . filter(function(word) { return (text.match(word) || [])[0]; });
If your "regexes" are actually just strings, you could just use indexOf and keep things simpler:
regexes . filter(function(word) { return text.indexOf(word) !== -1; });

You only get earth as a match because the regex engine has matched earth as the first alternative and then moved on in the source string, oblivious to the fact that you could also have matched ear or art. This is expected behavior with all regex engines - they don't try to return all possible matches, just the first one, and matches generally can't overlap.
Whether earth or ear is returned depends on the regex engine. A POSIX ERE engine will always return the leftmost, longest match, whereas most current regex engines (including JavaScript's) will return the first possible match, depending on the order of alternation in the regex.
So art|earth|ear would return earth, whereas ear|art|earth would return ear.
You can make the regex find overlapping matches (as long as they start in different positions in the string) by using positive lookahead assertions:
(?=(ear|earth|art))
will find ear and art, but not earth because it starts at the same position as ear. Note that you must not look for the regex' entire match (regsult[0] in your code) in this case but for the content of the capturing group, in this case (regsult[1]).
The only way around this that I can think of at the moment would be to use
(?=(ear(th)?|art))
which would have a result like [["", "ear", "th"], ["", "art", undefined]].

Related

Regex to match string in a sentence

I am trying to find a strictly declared string in a sentence, the thread says:
Find the position of the string "ten" within a sentence, without using the exact string directly (this can be avoided in many ways using just a bit of RegEx). Print as many spaces as there were characters in the original sentence before the aforementioned string appeared, and then the string itself in lowercase.
I've gotten this far:
let words = 'A ton of tunas weighs more than ten kilograms.'
function findTheNumber(){
let regex=/t[a-z]*en/gi;
let output = words.match(regex)
console.log(words)
console.log(output)
}
console.log(findTheNumber())
The result should be:
input = A ton of tunas weighs more than ten kilograms.
output = ten(ENTER)
You could try a regex replacement approach, with the help of a callback function:
var input = "A ton of tunas weighs more than ten kilograms.";
var output = input.replace(/\w+/g, function(match, contents, offset, input_string)
{
if (!match.match(/^[t][e][n]$/)) {
return match.replace(/\w/g, " ");
}
else {
return match;
}
});
console.log(input);
console.log(output);
The above logic matches every word in the input sentence, and then selectively replaces every word which is not ten with an equal number of spaces.
You can use
let text = 'A ton of tunas weighs more than ten kilograms.'
function findTheNumber(words){
console.log( words.replace(/\b(t[e]n)\b|[^.]/g, (x,y) => y ?? " ") )
}
findTheNumber(text)
The \b(t[e]n)\b is basically ten whole word searching pattern.
The \b(t[e]n)\b|[^.] regex will match and capture ten into Group 1 and will match any char but . (as you need to keep it at the end). If Group 1 matches, it is kept (ten remains in the output), else the char matched is replaced with a space.
Depending on what chars you want to keep, you may adjust the [^.] pattern. For example, if you want to keep all non-word chars, you may use \w.

How to store only the nth substring into a variable in Javascript

var a="how are you?";
In the above example I want to store the second word "are" into another variable in a single step.
I don't want to use something like below
var bigArray = a.split(" ");
var secondText = bigArray[1];
as we may need to store the entire paragraph into a big array and consume a lot of memory without any use.
I would like to know if there is some function which works as below
var secondText=specialFunction(a," ",1);
so that we will get the second substring when the paragraph is split by " "
Well, I would spend my time worrying about more important things than the size of some arrays.
Anyway, you could try using a regexp:
var secondText = (a.match(/ (\w+)/) || []) [1];
This reads as "find a space, then capture the following word".
The || [] part is meant to deal with the situation where there is no match (for example, no second word). In that case, the result will be [][1] which is undefined.
This finds only the second word. What about the more general case? Since we are not allowed to split the string on spaces, because that would create an array and the OP doesn't want that due to memory concerns. So, we will instead build a dynamic regexp. To find the nth word, we want to skip over the first n-1 spaces. Or, to be more precise, we want to skip over the first word, some spaces, then the second word, then some more spaces, etc. So the regexp is
/(?:\w+ ){n}(\w+)/
^^ NO CAPTURING GROUP
^^^^ WORD FOLLOWED BY SPACE
^^^ N TIMES
^^^^^ CAPTURE FOLLOWING WORD
The ?: is to avoid this being treated as a capturing group. We build the regexp using
function make_nth_word_regexp(n) {
n--;
return new RegExp("(?:\\w+ ){" + n + "}(\\w+)");
}
Now look for your nth word:
var fifth_word = str.match(make_nth_word_regexp(5)) [1];
> "Hey there you".match(make_nth_word_regexp(3))[1]
< "you"
Alternative to regex is just to use substring(). Something like
var a="how are you";
alert(a.substring(a.indexOf(" "), a.length).substring(0, a.indexOf(" ")+1));

Shared part in RegEx matched string

In following code:
"a sasas b".match(/sas/g) //returns ["sas"]
The string actually include two sas strings, a [sas]as b and a sa[sas] b.
How can I modify RegEx to match both?
Another example:
"aaaa".match(/aa/g); //actually include [aa]aa,a[aa]a,aa[aa]
Please consider the issue in general not just above instances.
A pure RexEx solution is preferred.
If you want to match at least one such "merged" occurrence, then you could do something like:
"a sasas b".match(/s(as)+/g)
If you want to retrieve the matches as separate results, then you have a bit more work to do; this is not a case that regular expressions are designed to handle. The basic algorithm would be:
Attempt a match. If it was unsuccessful, stop.
Extract the match you are interested in and do whatever you want with it.
Take the substring of the original target string, starting from one character following the first character in your match.
Start over, using this substring as the new input.
(To be more efficient, you could match with an offset instead of using substrings; that technique is discussed in this question.)
For example, you would start with "a sasas b". After the first match, you have "sas". Taking the substring that starts one character after the match starts, we would have "asas b". The next match would find the "sas" here, and you would again repeat the process with "as b". This would fail to match, so you would be done.
This significantly-improved answer owes itself to #EliGassert.
String.prototype.match_overlap = function(re)
{
if (!re.global)
re = new RegExp(re.source,
'g' + (re.ignoreCase ? 'i' : '')
+ (re.multiline ? 'm' : ''));
var matches = [];
var result;
while (result = re.exec(this))
matches.push(result),
re.lastIndex = result.index + 1;
return matches.length ? matches : null;
}
#EliGassert points out that there is no need to walk through the entire string character by character; instead we can find a match anywhere (i.e. do without the anchor), and then continue one character after the index of the found match. While researching how to retrieve said index, I found that the re.lastIndex property, used by exec to keep track of where it should continue its search, is in fact settable! This works rather nicely with what we intend to do.
The only bit needing further explanation might be the beginning. In the absence of the g flag, exec may never return null (always returning its one match, if it exists), thus possibly going into an infinite loop. Since, however, match_overlap by design seeks multiple matches, we can safely recompile any non-global RegExp as a global RegExp, importing the i and m options as well if set.
Here is a new jsFiddle: http://jsfiddle.net/acheong87/h5MR5/.
document.write("<pre>");
document.write('sasas'.match_overlap(/sas/));
document.write("\n");
document.write('aaaa'.match_overlap(/aa/));
document.write("\n");
document.write('my1name2is3pilchard'.match_overlap(/[a-z]{2}[0-9][a-z]{2}/));
document.write("</pre>");​
Output:
sas,sas
aa,aa,aa
my1na,me2is,is3pi
var match = "a sasas b".match(/s(?=as)/g);
for(var i =0; i != match.length; ++i)
alert(match[i]);
Going off of the comment by Q. Sheets and the response by cdhowie, I came up with the above solution: it consumes ONE character in the regular expression and does a lookahead for the rest of the match string. With these two pieces, you can construct all the positions and matching strings in your regular expression.
I wish there was an "inspect but don't consume" operator that you could use to actually include the rest of the matching (lookahead) string in the results, but there unfortunately isn't -- at least not in JS.
Here's a generic way to do it:
​String.prototype.match_overlap = function(regexp)
{
regexp = regexp.toString().replace(/^\/|\/$/g, '');
var re = new RegExp('^' + regexp);
var matches = [];
var result;
for (var i = 0; i < this.length; i++)
if (result = re.exec(this.substr(i)))
matches.push(result);
return matches.length ? matches : null;
}
Usage:
var results = 'sasas'.match_overlap(/sas/);
Returns:
An array of (overlapping) matches, or null.
Example:
Here's a jsFiddle in which this:
document.write("<pre>");​
document.write('sasas'.match_overlap(/sas/));
document.write("\n");
document.write('aaaa'.match_overlap(/aa/));
document.write("\n");
document.write('my1name2is3pilchard'.match_overlap(/[a-z]{2}[0-9][a-z]{2}/));
document.write("</pre>");​
returns this:
sas,sas
aa,aa,aa
my1na,me2is,is3pi
Explanation:
To explain a little bit, we intend for the user to pass a RegExp object to this new function, match_overlap, as he or she would do normally with match. From this we want to create a new RegExp object anchored at the beginning (to prevent duplicate overlapped matches—this part probably won't make sense unless you encounter the issue yourself—don't worry about it). Then, we simply match against each substring of the subject string this and push the results to an array, which is returned if non-empty (otherwise returning null). Note that if the user passes in an expression that is already anchored, this is inherently wrong—at first I stripped anchors out, but then I realized I was making an assumption in the user's stead, which we should avoid. Finally one could go further and somehow merge the resulting array of matches into a single match result resembling what would normally occur with the //g option; and one could go even further and make up a new flag, e.g. //o that gets parsed to do overlap-matching, but this is getting a little crazy.

Regular Expression with multiple words (in any order) without repeat

I'm trying to execute a search of sorts (using JavaScript) on a list of strings. Each string in the list has multiple words.
A search query may also include multiple words, but the ordering of the words should not matter.
For example, on the string "This is a random string", the query "trin and is" should match. However, these terms cannot overlap. For example, "random random" as a query on the same string should not match.
I'm going to be sorting the results based on relevance, but I should have no problem doing that myself, I just can't figure out how to build up the regular expression(s). Any ideas?
The query trin and is becomes the following regular expression:
/trin.*(?:and.*is|is.*and)|and.*(?:trin.*is|is.*trin)|is.*(?:trin.*and|and.*trin)/
In other words, don't use regular expressions for this.
It probably isn't a good idea to do this with just a regular expression. A (pure, computer science) regular expression "can't count". The only "memory" it has at any point is the state of the DFA. To match multiple words in any order without repeat you'd need on the order of 2^n states. So probably a really horrible regex.
(Aside: I mention "pure, computer science" regular expressions because most implementations are actually an extension, and let you do things that are non-regular. I'm not aware of any extensions, certainly none in JavaScript, that make doing what you want to do any less painless with a single pattern.)
A better approach would be to keep a dictionary (Object, in JavaScript) that maps from words to counts. Initialize it to your set of words with the appropriate counts for each. You can use a regular expression to match words, and then for each word you find, decrement the corresponding entry in the dictionary. If the dictionary contains any non-0 values at the end, or if somewhere a long the way you try to over-decrement a value (or decrement one that doesn't exist), then you have a failed match.
I'm totally not sure if I get you right there, so I'll just post my suggestion for it.
var query = "trin and is",
target = "This is a random string",
search = { },
matches = 0;
query.split( /\s+/ ).forEach(function( word ) {
search[ word ] = true;
});
Object.keys( search ).forEach(function( word ) {
matches += +new RegExp( word ).test( target );
});
// do something useful with "matches" for the query, should be "3"
alert( matches );
So, the variable matches will contain the number of unique matches for the query. The first split-loop just makes sure that no "doubles" are counted since we would overwrite our search object. The second loop checks for the individuals words within the target string and uses the nifty + to cast the result (either true or false) into a number, hence, +1 on a match or +0.
I was looking for a solution to this issue and none of the solutions presented here was good enough, so this is what I came up with:
function filterMatch(itemStr, keyword){
var words = keyword.split(' '), i = 0, w, reg;
for(; w = words[i++] ;){
reg = new RegExp(w, 'ig');
if (reg.test(itemStr) === false) return false; // word not found
itemStr = itemStr.replace(reg, ''); // remove matched word from original string
}
return true;
}
// test
filterMatch('This is a random string', 'trin and is'); // true
filterMatch('This is a random string', 'trin not is'); // false

Regular Expression to match given word in last five words of pipe-delimited string

Say we have a string
blue|blue|green|blue|blue|yellow|yellow|blue|yellow|yellow|
And we want to figure out whether the word "yellow" occurs in the last 5 words of the string, specifically by returning a capture group containing these occurences if any.
Is there a way to do that with a regex?
Update: I'm feeding a regex engine some rules. For various reasons I'm trying to work with the engine rather than go outside it, which would be my last resort.
/\b(yellow)\|(?=(?:\w+\|){0,4}$)/g
This will return one hit for each yellow| that's followed by fewer than five words (per your definition of "word"). This assumes the sequence always ends with a pipe; if that's not the case, you might want to change it to:
/\b(yellow)(?=(?:\|\w+){0,4}\|?$)/g
EDIT (in response to comment): The definition of a "word" in this solution is arbitrary, and doesn't really correspond to real-world usage. To allow for hyphenated words like "real-world" you could use this:
/\b(yellow)\|(?=(?:\w+(?:-\w+)*\|){0,4}$)/g
...or, for this particular job, you could define a word as one or more of any characters except pipes:
/\b(yellow)\|(?=(?:[^|]+\|){0,4}$)/g
No need to use a Regex for such a simple thing.
Simply split on the pipe, and check with indexOf:
var group = 'blue|blue|green|blue|blue|yellow|yellow|blue|yellow|yellow';
if ( group.split('|').slice(-5).indexOf('yellow') == -1 ) {
alert('Not there :(');
} else {
alert('Found!!!');
}
Note: indexOf is not natively supported in IE < 9, but support for it can be added very easily.
Can't think of a way to do this with a single regular expression, but you can form one for each of the last five positions and sum the matches.
var string = "blue|blue|green|blue|blue|yellow|yellow|blue|yellow|yellow|";
var regexes = [];
regexes.push(/(yellow)\|[^|]+\|[^|]+\|[^|]+\|[^|]+\|$/);
regexes.push(/(yellow)\|[^|]+\|[^|]+\|[^|]+\|$/);
regexes.push(/(yellow)\|[^|]+\|[^|]+\|$/);
regexes.push(/(yellow)\|[^|]+\|$/);
regexes.push(/(yellow)\|$/);
var count = 0;
var regex;
while (regex = regexes.shift()) {
if (string.match(regex)) {
count++;
}
}
console.log(count);
Should find four matches.

Categories