How to get all possible overlapping matches for a string - javascript

I'm working on the MIU system problem from "Gödel, Escher, Bach" chapter 2.
One of the rules states
Rule III: If III occurs in one of the strings in your collection, you may make a new string with U in place of III.
Which means that the string MIII can become MU, but for other, longer strings there may be multiple possibilities [matches in brackets]:
MIIII could yield
M[III]I >> MUI
MI[III] >> MIU
MUIIIUIIIU could yield
MU[III]UIIIU >> MUUUIIIU
MUIIIU[III]U >> MUIIIUUU
MUIIIIU could yield
MU[III]IU >> MUUIU
MUI[III]U >> MUIUU
Clearly regular expressions such as /(.*)III(.*)/ are helpful, but I can't seem to get them to generate every possible match, just the first one it happens to find.
Is there a way to generate every possible match?
(Note, I can think of ways to do this entirely manually, but I am hoping there is a better way using the built in tools, regex or otherwise)
(Edited to clarify overlapping needs.)

Here's the regex you need: /III/g - simple enough, right? Now here's how you use it:
var text = "MUIIIUIIIU", find = "III", replace "U",
regex = new RegExp(find,"g"), matches = [], match;
while(match = regex.exec(text)) {
matches.push(match);
regex.lastIndex = match.index+1;
}
That regex.lastIndex... line overrides the usual regex behaviour of not matching results that overap. Also I'm using a RegExp constructor to make this more flexible. You could even build it into a function this way.
Now you have an array of match objects, you can do this:
matches.forEach(function(m) { // older browsers need a shim or old-fashioned for loop
console.log(text.substr(0,m.index)+replace+text.substr(m.index+find.length));
});
EDIT: Here is a JSFiddle demonstrating the above code.

Sometimes regexes are overkill. In your case a simple indexOf might be fine too!
Here is, admittedly, a hack, but you can transform it into pretty, reusable code on your own:
var s = "MIIIIIUIUIIIUUIIUIIIIIU";
var results = [];
for (var i = 0; true; i += 1) {
i = s.indexOf("III", i);
if (i === -1) {
break;
}
results.push(i);
}
console.log("Match positions: " + JSON.stringify(results));
It takes care of overlaps just fine, and at least to me, the indexOf just looks simpler.

Related

How to make indexOf only match 'hi' as a match and not 'hirandomstuffhere'?

Basically I was playing around with an Steam bot for some time ago, and made it auto-reply when you said things in an array, I.E an 'hello-triggers' array, which would contain things like "hi", "hello" and such. I made so whenever it received an message, it would check for matches using indexOf() and everything worked fine, until I noticed it would notice 'hiasodkaso', or like, 'hidemyass' as an "hi" trigger.
So it would match anything that contained the word even if it was in the middle of a word.
How would I go about making indexOf only notice it if it's the exact word, and not something else in the same word?
I do not have the script that I use but I will make an example that is pretty much like it:
var hiTriggers = ['hi', 'hello', 'yo'];
// here goes the receiving message function and what not, then:
for(var i = 0; i < hiTriggers.length; i++) {
if(message.indexOf(hiTriggers[i]) >= 0) {
bot.sendMessage(SteamID, randomHelloMsg[Math stuff here blabla]); // randomHelloMsg is already defined
}
}
Regex wouldn't be used for this, right? As it is to be used for expressions or whatever. (my English isn't awesome, ikr)
Thanks in advance. If I wasn't clear enough on something, please let me know and I'll edit/formulate it in another way! :)
You can extend prototype:
String.prototype.regexIndexOf = function(regex, startpos) {
var indexOf = this.substring(startpos || 0).search(regex);
return (indexOf >= 0) ? (indexOf + (startpos || 0)) : indexOf;
}
and do:
var foo = "hia hi hello";
foo.regexIndexOf(/hi\b/);
Or if you don't want to extend the string object:
foo.substr(i).search(/hi\b/);
both examples where taken from the top answers of Is there a version of JavaScript's String.indexOf() that allows for regular expressions?
Regex wouldn't be used for this, right? As it is to be used for expressions or whatever. (my > English isn't awesome, ikr)
Actually, regex is for any old pattern matching. It's absolutely useful for this.
fmsf's answer should work for what you're trying to do, however, in general extending native objects prototypes is frowned upon afik. You can easily break libraries by doing so. I'd avoid it when possible. In this case you could use his regexIndexOf function by itself or in concert with something like:
//takes a word and searches for it using regexIndexOf
function regexIndexWord(word){
return regexIndexOf("/"+word+"\b/");
}
Which would let you search based on your array of words without having to add the special symbols to each one.

How to delete an item from a serialized list?

I make a serialized list (with JQuery) and then want to delete a Parameter/Value pair from the list. What's the best way to do this? My code seems kinda clunky to take care of edge conditions that the Parameter/Value pair might be first, last, or in the middle of the list.
function serializeDeleteItem(strSerialize, strParamName)
{
// Delete Parameter/Value pair from Serialized list
var strRegEx;
var rExp;
strRegEx = "((^[?&]?" + strParamName + "\=[^\&]*[&]?))|([&]" + strParamName + "\=[^\&]*)|(" + strParamName + "\=[^\&]*[&])";
rExp = new RegExp(strRegEx, "i");
strSerialize = strSerialize.replace(rExp, "");
return strSerialize;
}
Examples / Test rig at http://jsfiddle.net/7Awzw/
EDIT: Modified the test rig to preserve any leading "?" or "&" so that function could be used with URL Query String or fragment of serialized string
See: http://jsfiddle.net/7Awzw/5/
This version is longer than yours, but imho it's more maintainable. It will find and remove the serialized parameter regardless of where it is in the list.
Notes:
To avoid problems with removing items in the middle of an array, we iterate in reverse.
For exact matching of parameter names, we expect them to start at the beginning of the split string, and to terminate with =.
Assuming there is just one instance of the given param, we break once it's found. If there may be more, just remove that line.
Code
function serializeDeleteItem(strSerialize, strParamName)
{
var arrSerialize = strSerialize.split("&");
var i = arrSerialize.length;
while (i--) {
if (arrSerialize[i].indexOf(strParamName+"=") == 0) {
arrSerialize.splice(i,1);
break; // Found the one and only, we're outta here.
}
}
return arrSerialize.join("&");
}
This fails a few of your tests - the ones with serialized strings starting with '?' or '&'. If you feel those are valid, then you could do this at the start of the function, and all tests will pass:
if (strSerialize.length && (strSerialize[0] == '?' || strSerialize[0] == '&'))
strSerialize = strSerialize.slice(1);
Performance Comparison
I've put together a test in jsperf to compare the regex approach with this string method. It's reporting that the regex solution is 49% slower than strings, in IE10 on 32-bit Win7.

How to search DOM to count the number of $ symbol found on a product page?

I am looking to find the best possible way to find how many $ symbols are on a page. Is there a better method than reading document.body.innerHTML and calc how many $-as are on that?
Your question can be split into two parts:
How can we get the the webpage text content without HTML tags?
We can generalize the second question a bit.
How can we find the number of string occurrences in another string?
And the 'best possible way to do this':
Amaan got the idea right of finding the text, but lets take it further.
var text = document.body.innerText || document.body.textContent;
Adding textContent to the code helps us cover more browsers, since innerText is not supported by all of them.
The second part is a bit trickier. It all depends on the number of '$' symbol occurrences on the page.
For example, if we know for sure, that there is at least one occurrence of the symbol on the page we would use this code:
text.match(/\$/g).length;
Which performs a global regular expression match on the given string and counts the length of the returned array. It's pretty fast and concise.
On the other hand, if we're not sure if the symbol appears on the page at least once, we should modify the code to look like this:
if (match = text.match(/\$/g)) {
match.length;
}
This just checks the value returned by the match function and if it's null, does nothing.
I would recommend using the third option only when there is a large occurrence of the symbols in the page or you're going to perform the search many many times. This is a custom function (taken from here) to count the occurrence of the specified string in another string. It performs better than the other two, but is longer and harder to understand.
var occurrences = function(string, subString, allowOverlapping) {
string += "";
subString += "";
if (subString.length <= 0) return string.length + 1;
var n = 0,
pos = 0;
var step = (allowOverlapping) ? (1) : (subString.length);
while (true) {
pos = string.indexOf(subString, pos);
if (pos >= 0) {
n++;
pos += step;
} else break;
}
return (n);
};
occurrences(text, '$');
I'm also including a little jsfiddle 'benchmark' so you can compare these three different approaches yourself.
Also: No, there isn't a better way of doing this than just getting the body text and counting how many '$' symbols there are.
You should probably use document.body.innerText or document.body.textContent to avoid getting your HTML give you false positives.
Something like this should work:
document.body.innerText.match(/\$/g).length;
An alternate way I can think of, would be to use window.find like this:
var len = 0;
while(window.find('$') === true){
len++;
}
(This may be unreliable because it depends on where the user clicked last. It will work fine if you do it onload, before any user interaction.)

Which is the correct way to ensure you end up with an Array greater than 1 after Splitting?

I am currently doing a big project (by big I mean, many processes) where every millisecond I save means a lot (on the long run), so I want to make sure I am doing it the right way.
So, what is the best way to ensure you will have an array greater than 1?
a) use indexOf(), then if result is different than -1, split()
b) split (regardless if characters exist), then do stuff ONLY if the
array.length is greater than 1
c) another not listed above
Using jsPerf, it appears that omitting .indexOf() is roughly 23% more efficient that including it over 500,000 iterations (11.67 vs. 8.95 operations per second):
Without indexOf():
var str = "test";
for (var i = 0; i < 500000; i++) {
var test = str.split('.');
}
With .indexOf():
var str = "test";
for (var i = 0; i < 500000; i++) {
if (str.indexOf('.')) {
var test = str.split('.');
} else {
var test = str;
}
}
http://jsperf.com/split-and-split-indexof
EDIT
Hmm... If the following line is:
if (str.indexOf('.') > -1)
http://jsperf.com/split-and-split-indexof-with-indexof-check
Or any other comparison, it's seemingly quite a bit faster (by about 69%).
The only reason I can think this is the case is that running .split() on every variable will perform two functions on each value (find, then separate), instead of just one when necessary. Note, this last part is just a guess.
We can see that even when there is something to split the best results come from doing the indexOf test against a value. Still the improvement is worse that the cases where 100% of items don't need a split. Thus as you have more items needing to be split testing returns less benefit (as would be expected). So it really depends on the use case since the extra code takes up memory and uses resources.
http://jsperf.com/split-and-split-indexof/2
(b) is obviously more efficient than (a) because split uses the same logic as indexOf and that logic will not need to be repeated if there are indeed more than 2 elements. i cannot think of a more efficient way.

What's the best/fastest way to find the number of tabs to start a string?

I want to find the number of tabs at the beginning of a string (and of course I want it to be fast running code ;) ). This is my idea, but not sure if this is the best/fastest choice:
//The regular expression
var findBegTabs = /(^\t+)/g;
//This string has 3 tabs and 2 spaces: "<tab><tab><space>something<space><tab>"
var str = " something ";
//Look for the tabs at the beginning
var match = reg.exec( str );
//We found...
var numOfTabs = ( match ) ? match[ 0 ].length : 0;
Another possibility is to use a loop and charAt:
//This string has 3 tabs and 2 spaces: "<tab><tab><space>something<space><tab>"
var str = " something ";
var numOfTabs = 0;
var start = 0;
//Loop and count number of tabs at beg
while ( str.charAt( start++ ) == "\t" ) numOfTabs++;
In general if you can calculate the data by simply iterating through the string and doing a character check at every index, this will be faster than a regex/regular expression which must build up a more complex searching engine. I encourage you to profile this but I think you'll find the straight search is faster.
Note: Your search should use === instead of == here as you don't need to introduce conversions in the equality check.
function numberOfTabs(text) {
var count = 0;
var index = 0;
while (text.charAt(index++) === "\t") {
count++;
}
return count;
}
Try using a profiler (such as jsPerf or one of the many available backend profilers) to create and run benchmarks on your target systems (the browsers and/or interpreters you plan to support for your software).
It's useful to reason about which solution will perform best based on your expected data and target system(s); however, you may sometimes be surprised by which solution actually performs fastest, especially with regard to big-oh analysis and typical data sets.
In your specific case, iterating over characters in the string will likely be faster than regular expression operations.
One-liner (if you find smallest is best):
"\t\tsomething".split(/[^\t]/)[0].length;
i.e. splitting by all non-tab characters, then fetching the first element and obtaining its length.

Categories