In following code:
"a sasas b".match(/sas/g) //returns ["sas"]
The string actually include two sas strings, a [sas]as b and a sa[sas] b.
How can I modify RegEx to match both?
Another example:
"aaaa".match(/aa/g); //actually include [aa]aa,a[aa]a,aa[aa]
Please consider the issue in general not just above instances.
A pure RexEx solution is preferred.
If you want to match at least one such "merged" occurrence, then you could do something like:
"a sasas b".match(/s(as)+/g)
If you want to retrieve the matches as separate results, then you have a bit more work to do; this is not a case that regular expressions are designed to handle. The basic algorithm would be:
Attempt a match. If it was unsuccessful, stop.
Extract the match you are interested in and do whatever you want with it.
Take the substring of the original target string, starting from one character following the first character in your match.
Start over, using this substring as the new input.
(To be more efficient, you could match with an offset instead of using substrings; that technique is discussed in this question.)
For example, you would start with "a sasas b". After the first match, you have "sas". Taking the substring that starts one character after the match starts, we would have "asas b". The next match would find the "sas" here, and you would again repeat the process with "as b". This would fail to match, so you would be done.
This significantly-improved answer owes itself to #EliGassert.
String.prototype.match_overlap = function(re)
{
if (!re.global)
re = new RegExp(re.source,
'g' + (re.ignoreCase ? 'i' : '')
+ (re.multiline ? 'm' : ''));
var matches = [];
var result;
while (result = re.exec(this))
matches.push(result),
re.lastIndex = result.index + 1;
return matches.length ? matches : null;
}
#EliGassert points out that there is no need to walk through the entire string character by character; instead we can find a match anywhere (i.e. do without the anchor), and then continue one character after the index of the found match. While researching how to retrieve said index, I found that the re.lastIndex property, used by exec to keep track of where it should continue its search, is in fact settable! This works rather nicely with what we intend to do.
The only bit needing further explanation might be the beginning. In the absence of the g flag, exec may never return null (always returning its one match, if it exists), thus possibly going into an infinite loop. Since, however, match_overlap by design seeks multiple matches, we can safely recompile any non-global RegExp as a global RegExp, importing the i and m options as well if set.
Here is a new jsFiddle: http://jsfiddle.net/acheong87/h5MR5/.
document.write("<pre>");
document.write('sasas'.match_overlap(/sas/));
document.write("\n");
document.write('aaaa'.match_overlap(/aa/));
document.write("\n");
document.write('my1name2is3pilchard'.match_overlap(/[a-z]{2}[0-9][a-z]{2}/));
document.write("</pre>");
Output:
sas,sas
aa,aa,aa
my1na,me2is,is3pi
var match = "a sasas b".match(/s(?=as)/g);
for(var i =0; i != match.length; ++i)
alert(match[i]);
Going off of the comment by Q. Sheets and the response by cdhowie, I came up with the above solution: it consumes ONE character in the regular expression and does a lookahead for the rest of the match string. With these two pieces, you can construct all the positions and matching strings in your regular expression.
I wish there was an "inspect but don't consume" operator that you could use to actually include the rest of the matching (lookahead) string in the results, but there unfortunately isn't -- at least not in JS.
Here's a generic way to do it:
String.prototype.match_overlap = function(regexp)
{
regexp = regexp.toString().replace(/^\/|\/$/g, '');
var re = new RegExp('^' + regexp);
var matches = [];
var result;
for (var i = 0; i < this.length; i++)
if (result = re.exec(this.substr(i)))
matches.push(result);
return matches.length ? matches : null;
}
Usage:
var results = 'sasas'.match_overlap(/sas/);
Returns:
An array of (overlapping) matches, or null.
Example:
Here's a jsFiddle in which this:
document.write("<pre>");
document.write('sasas'.match_overlap(/sas/));
document.write("\n");
document.write('aaaa'.match_overlap(/aa/));
document.write("\n");
document.write('my1name2is3pilchard'.match_overlap(/[a-z]{2}[0-9][a-z]{2}/));
document.write("</pre>");
returns this:
sas,sas
aa,aa,aa
my1na,me2is,is3pi
Explanation:
To explain a little bit, we intend for the user to pass a RegExp object to this new function, match_overlap, as he or she would do normally with match. From this we want to create a new RegExp object anchored at the beginning (to prevent duplicate overlapped matches—this part probably won't make sense unless you encounter the issue yourself—don't worry about it). Then, we simply match against each substring of the subject string this and push the results to an array, which is returned if non-empty (otherwise returning null). Note that if the user passes in an expression that is already anchored, this is inherently wrong—at first I stripped anchors out, but then I realized I was making an assumption in the user's stead, which we should avoid. Finally one could go further and somehow merge the resulting array of matches into a single match result resembling what would normally occur with the //g option; and one could go even further and make up a new flag, e.g. //o that gets parsed to do overlap-matching, but this is getting a little crazy.
Related
This question already has an answer here:
Find all regex matches
(1 answer)
Closed last year.
Okay, so I have this string "nesˈo:tkʰo:x", and I want to get the index of all the zero-width positions that don't occur after any instance of the character ˈ (the IPA primary stress symbol). So in this case, those expected output would be 0, 1, 2, and 3 - the indices of the letters nes that occur before the one and only instance of ˈ, plus the ˈ itself.
I'm doing this with regex for reasons I'll get into in a bit. Regex101 confirms that /(?=.*?ˈ)/ should match all 4 of those zero-width positions with JS' regex flavor... but I can't actually get JS to return them.
A simple setup might look like this:
let teststring = "nesˈo:tkʰo:x";
let re = new RegExp("(?=.*?ˈ)", "g");
while (result = re.exec(teststring)) {
console.log("Match found at "+result.index);
}
...except that this loops forever. It seems to get stuck on the first match, which I understand has something to do with how RegExp.exec is supposed to auto-increment RegExp.lastIndex for global regexes, or something. But I also can't make the regex not global, or it won't return all the matches for strings like this where more than one match is expected.
Okay, so what if I manually increment RegExp.lastIndex to prevent it from looping?
let teststring = "nesˈo:tkʰo:x";
let re = new RegExp("(?=.*?ˈ)", "g");
while (result = re.exec(teststring)) {
if (result.index == re.lastIndex) {
re.lastIndex++;
} else {
console.log("Match found at "+result.index);
}
}
Now it... prints out nothing at all. Now, to be fair, if lastIndex starts at 0 by default, and the index of the first match is 0, I half expect that to be skipped over... but why isn't it at least giving me 1, 2 and 3 as matches?
Now, I can already hear the chorus of "you don't need regex for this, just do Array(teststring.indexOf("ˈ")).keys() or something to generate [0,1,2,3]". That may work for this specific example, but the actual use case is a parser function that's supposed to be a general solution for "for this input string, replace all instances of A with B, if condition C is true, unless condition D is true". Those conditions might be something like "if A is at the end of the string" or "if A is right next to another instance of A" or "if A is between 'n' and 't'". That kind of complicated string matching problem is why the parser creates and executes regexes on the fly and why regex is getting involved, and it does work for almost everything except this one annoying edge case, which I'd rather not have to refactor the entire mechanism of the parser to deal with if I don't have to.
Use String.prototype.matchAll() to get all the matches.
let teststring = "nesˈo:tkʰo:x";
let re = new RegExp("(?=.*?ˈ)", "g");
[...teststring.matchAll(re)].forEach(result =>
console.log("Match found at " + result.index)
)
.search() returns the index of a match. .exec() returns an array of the match. Note a look ahead (?=) isn't needed, a standard capture group () suffices.
const str =`nesˈo:tkʰo:x",`;
const rgx = /(.*?ˈ)/;
let first = str.search(rgx);
let last = rgx.exec(str)[0].length - 1;
console.log('Indices: '+first+' - '+(first + last)+' \nLength: '+(last+1));
var a="how are you?";
In the above example I want to store the second word "are" into another variable in a single step.
I don't want to use something like below
var bigArray = a.split(" ");
var secondText = bigArray[1];
as we may need to store the entire paragraph into a big array and consume a lot of memory without any use.
I would like to know if there is some function which works as below
var secondText=specialFunction(a," ",1);
so that we will get the second substring when the paragraph is split by " "
Well, I would spend my time worrying about more important things than the size of some arrays.
Anyway, you could try using a regexp:
var secondText = (a.match(/ (\w+)/) || []) [1];
This reads as "find a space, then capture the following word".
The || [] part is meant to deal with the situation where there is no match (for example, no second word). In that case, the result will be [][1] which is undefined.
This finds only the second word. What about the more general case? Since we are not allowed to split the string on spaces, because that would create an array and the OP doesn't want that due to memory concerns. So, we will instead build a dynamic regexp. To find the nth word, we want to skip over the first n-1 spaces. Or, to be more precise, we want to skip over the first word, some spaces, then the second word, then some more spaces, etc. So the regexp is
/(?:\w+ ){n}(\w+)/
^^ NO CAPTURING GROUP
^^^^ WORD FOLLOWED BY SPACE
^^^ N TIMES
^^^^^ CAPTURE FOLLOWING WORD
The ?: is to avoid this being treated as a capturing group. We build the regexp using
function make_nth_word_regexp(n) {
n--;
return new RegExp("(?:\\w+ ){" + n + "}(\\w+)");
}
Now look for your nth word:
var fifth_word = str.match(make_nth_word_regexp(5)) [1];
> "Hey there you".match(make_nth_word_regexp(3))[1]
< "you"
Alternative to regex is just to use substring(). Something like
var a="how are you";
alert(a.substring(a.indexOf(" "), a.length).substring(0, a.indexOf(" ")+1));
I have the following example url: #/reports/12/expense/11.
I need to get the id just after the reports -> 12. What I am asking here is the most suitable way to do this. I can search for reports in the url and get the content just after that ... but what if in some moment I decide to change the url, I will have to change my algorythm.
What do You think is the best way here. Some code examples will be also very helpfull.
It's hard to write code that is future-proof since it's hard to predict the crazy things we might do in the future!
However, if we assume that the id will always be the string of consecutive digits in the URL then you could simply look for that:
function getReportId(url) {
var match = url.match(/\d+/);
return (match) ? Number(match[0]) : null;
}
getReportId('#/reports/12/expense/11'); // => 12
getReportId('/some/new/url/report/12'); // => 12
You should use a regular expression to find the number inside the string. Passing the regular expression to the string's .match() method will return an array containing the matches based on the regular expression. In this case, the item of the returned array that you're interested in will be at the index of 1, assuming that the number will always be after reports/:
var text = "#/reports/12/expense/11";
var id = text.match(/reports\/(\d+)/);
alert(id[1]);
\d+ here means that you're looking for at least one number followed by zero to an infinite amount of numbers.
var text = "#/reports/12/expense/11";
var id = text.match("#/[a-zA-Z]*/([0-9]*)/[a-zA-Z]*/")
console.log(id[1])
Regex explanation:
#/ matches the characters #/ literally
[a-zA-Z]* - matches a word
/ matches the character / literally
1st Capturing group - ([0-9]*) - this matches a number.
[a-zA-Z]* - matches a word
/ matches the character / literally
Regular expressions can be tricky (add expensive). So usually if you can efficiently do the same thing without them you should. Looking at your URL format you would probably want to put at least a few constraints on it otherwise the problem will be very complex. For instance, you probably want to assume the value will always appear directly after the key so in your sample report=12 and expense=11, but report and expense could be switched (ex. expense/11/report/12) and you would get the same result.
I would just use string split:
var parts = url.split("/");
for(var i = 0; i < parts.length; i++) {
if(parts[i] === "report"){
this.reportValue = parts[i+1];
i+=2;
}
if(parts[i] === "expense"){
this.expenseValue = parts[i+1];
i+=2;
}
}
So this way your key/value parts can appear anywhere in the array
Note: you will also want to check that i+1 is in the range of the parts array. But that would just make this sample code ugly and it is pretty easy to add in. Depending on what values you are expecting (or not expecting) you might also want to check that values are numbers using isNaN
I am writing a little app for Sharepoint. I am trying to extract some text from the middle of a field that is returned:
var ows_MetaInfo="1;#Subject:SW|NameOfADocument
vti_parservers:SR|23.0.0.6421
ContentTypeID:SW|0x0101001DB26Cf25E4F31488B7333256A77D2CA
vti_cachedtitle:SR|NameOfADocument
vti_title:SR|ATitleOfADocument
_Author:SW:|TheNameOfOurCompany
_Category:SW|
ContentType:SW|Document
vti_author::SR|mrwienerdog
_Comments:SW|This is very much the string I need extracted
vti_categories:VW|
vtiapprovallevel:SR|
vti_modifiedby:SR|mrwienerdog
vti_assignedto:SR|
Keywords:SW|Project Name
ContentType _Comments"
So......All I want returned is "This is very much the string I need extracted"
Do I need a regex and a string replace? How would you write the regex?
Yes, you can use a regular expression for this (this is the sort of thing they are good for). Assuming you always want the string after the pipe (|) on the line starting with "_Comments:SW|", here's how you can extract it:
var matchresult = ows_MetaInfo.match(/^_Comments:SW\|(.*)$/m);
var comment = (matchresult==null) ? "" : matchresult[1];
Note that the .match() method of the String object returns an array. The first (index 0) element will be the entire match (here, we the entire match is the whole line, as we anchored it with ^ and $; note that adding the "m" after the regex makes this a multiline regex, allowing us to match the start and end of any line within the multi-line input), and the rest of the array are the submatches that we capture using parenthesis. Above we've captured the part of the line that you want, so that will present in the second item in the array (index 1).
If there is no match ("_Comments:SW|" doesnt appear in ows_MetaInfo), then .match() will return null, which is why we test it before pulling out the comment.
If you need to adjust the regex for other scenarios, have a look at the Regex docs on Mozilla Dev Network: https://developer.mozilla.org/en/JavaScript/Guide/Regular_Expressions
You can use this code:
var match = ows_MetaInfo.match(/_Comments:SW\|([^\n]+)/);
if (match)
document.writeln(match[1]);
I'm far from competent with RegEx, so here is my RegEx-less solution. See comments for further detail.
var extractedText = ExtractText(ows_MetaInfo);
function ExtractText(arg) {
// Use the pipe delimiter to turn the string into an array
var aryValues = ows_MetaInfo.split("|");
// Find the portion of the array that contains "vti_categories:VW"
for (var i = 0; i < aryValues.length; i++) {
if (aryValues[i].search("vti_categories:VW") != -1)
return aryValues[i].replace("vti_categories:VW", "");
}
return null;
}
Here's a working fiddle to demonstrate.
I'm trying to execute a search of sorts (using JavaScript) on a list of strings. Each string in the list has multiple words.
A search query may also include multiple words, but the ordering of the words should not matter.
For example, on the string "This is a random string", the query "trin and is" should match. However, these terms cannot overlap. For example, "random random" as a query on the same string should not match.
I'm going to be sorting the results based on relevance, but I should have no problem doing that myself, I just can't figure out how to build up the regular expression(s). Any ideas?
The query trin and is becomes the following regular expression:
/trin.*(?:and.*is|is.*and)|and.*(?:trin.*is|is.*trin)|is.*(?:trin.*and|and.*trin)/
In other words, don't use regular expressions for this.
It probably isn't a good idea to do this with just a regular expression. A (pure, computer science) regular expression "can't count". The only "memory" it has at any point is the state of the DFA. To match multiple words in any order without repeat you'd need on the order of 2^n states. So probably a really horrible regex.
(Aside: I mention "pure, computer science" regular expressions because most implementations are actually an extension, and let you do things that are non-regular. I'm not aware of any extensions, certainly none in JavaScript, that make doing what you want to do any less painless with a single pattern.)
A better approach would be to keep a dictionary (Object, in JavaScript) that maps from words to counts. Initialize it to your set of words with the appropriate counts for each. You can use a regular expression to match words, and then for each word you find, decrement the corresponding entry in the dictionary. If the dictionary contains any non-0 values at the end, or if somewhere a long the way you try to over-decrement a value (or decrement one that doesn't exist), then you have a failed match.
I'm totally not sure if I get you right there, so I'll just post my suggestion for it.
var query = "trin and is",
target = "This is a random string",
search = { },
matches = 0;
query.split( /\s+/ ).forEach(function( word ) {
search[ word ] = true;
});
Object.keys( search ).forEach(function( word ) {
matches += +new RegExp( word ).test( target );
});
// do something useful with "matches" for the query, should be "3"
alert( matches );
So, the variable matches will contain the number of unique matches for the query. The first split-loop just makes sure that no "doubles" are counted since we would overwrite our search object. The second loop checks for the individuals words within the target string and uses the nifty + to cast the result (either true or false) into a number, hence, +1 on a match or +0.
I was looking for a solution to this issue and none of the solutions presented here was good enough, so this is what I came up with:
function filterMatch(itemStr, keyword){
var words = keyword.split(' '), i = 0, w, reg;
for(; w = words[i++] ;){
reg = new RegExp(w, 'ig');
if (reg.test(itemStr) === false) return false; // word not found
itemStr = itemStr.replace(reg, ''); // remove matched word from original string
}
return true;
}
// test
filterMatch('This is a random string', 'trin and is'); // true
filterMatch('This is a random string', 'trin not is'); // false