Javascript regexes - Lookbehind and lookahead at the same time

Javascript regexes - Lookbehind and lookahead at the same time - javascript

I am trying to create a regex in JavaScript that matches the character b if it is not preceded or followed by the character a.
Apparently, JavaScript regexes don't have negative lookbehind readily implemented, making the task difficult. I came up with the following one, but it does not work.
"ddabdd".replace(new RegExp('(?:(?![a]b(?![a])))*b(?![a])', 'i'),"c");
is the best I could come up with. Here, the b should not match because it has a preceding it, but it matches.
So some examples on what I want to achieve
"ddbdd" matches the b
"b" matches the b
"ddb" matches the b
"bdd" matches the b
"ddabdd" or "ddbadd" does not match the b

It seems you could use a capturing group containing either the beginning of string anchor or a negated character class preceding "b" while using Negative Lookahead to assert that "a" does not follow as well. Then you would simply reference $1 inside of the replacement call along with the rest of your replacement string.
var s = 'ddbdd b ddb bdd ddabdd ddabdd ddbadd';
var r = s.replace(/(^|[^a])b(?!a)/gi, '$1c');
console.log(r); //=> "ddcdd c ddc cdd ddabdd ddabdd ddbadd"
Edit: As #nhahtdh pointed out the comment about consecutive characters, you may consider a callback.
var s = 'ddbdd b ddb bdd ddabdd ddabdd ddbadd sdfbbfds';
var r = s.replace(/(a)?b(?!a)/gi, function($0, $1) {
return $1 ? $0 : 'c';
});
console.log(r); //=> "ddcdd c ddc cdd ddabdd ddabdd ddbadd sdfccfds"

There is no way to emulate the behavior of look-behind with regex alone in this case, since there may be consecutive b in the string, which requires the zero-width property of a look-behind to check the immediately preceding character.
Since the condition in the look-behind is quite simple, you can check for it in the replacement function:
inputString.replace(/b(?!a)/gi, function ($0, idx, str) {
if (idx == 0 || !/a/i.test(str[idx - 1])) { // Equivalent to (?<!a)
return 'c';
} else {
return $0; // $0 is the text matched by /b(?!a)/
}
});

What you are really trying to do here is write a parser for a tiny language. Regexp is good at some parsing tasks, but bad at many (and JS regexps are somewhat underpowered). You may be able to find a regexp to work in a particular situation, then when your syntax rules change, the regexp may be difficult or impossible to change to reflect that. The simple program below has the advantage that it is readable and maintainable. It does exactly what it says.
function find_bs(str) {
var indexes = [];
for (var i = 0; i < str.length; i++) {
if (str[i] === 'b' && str[i-1] !== 'a' && str[i+1] !== 'a')
indexes.push(i);
}
return indexes;
}
Using a regexp
If you absolutely insist on using a regexp, you can use the trick of resetting the lastIndex property on the regexp in conjunction with RegExp.exec:
function find_bs(str) {
var indexes = [];
var regexp = /.b[^a]|[^a]b./g;
var matches;
while (matches = regexp.exec(str)) {
indexes.push(matches.index + 1);
regexp.lastIndex -= 2;
}
return indexes;
}
You will need to tweak the logic to handle the beginning and end of the string.
How this works
We find the entire xbx string using the regexp. The index of b will be one plus the index of the match, so we record this. Before we do the next match, we reset lastIndex, which governs the starting point from which the search will continue, back to the b, so it serves as the first character of any following potential match.

Related

ES5 Regex test find a set of characters in any order

Say I want to find a word in the dictionary that contain a given set of letters in any order. For my example, my regex character set contains [jock]. I have been trying to create a function that creates a regex test like this:
/*
** Function: Take an array of 2 strings. Return TRUE if all letters
** of 2nd string are within 1st string, letters can be any order,
** case-sensitivity does not matter.
**
** #param {array} where arr[0] word to check, arr[1] string of characters
** #returns {array}
*/
function matchAllLetters(arr) {
'use strict';
// flags: i - case insensitive, g - global match
var pattern = new RegExp("[" + arr[0] + "+]", "ig");
return ( pattern.test( arr[1] ) );
}
matchAllLetters(['jackal','jock']); //true, s/b false
I am aware there are ways other than using RegExp to solve this, but I would like to solve this problem with RegExp so that I can compare RegExp against another approach with JSPerf.

The problem is the regex returns true because it matches one character of set can be checked on regex101 to match the whole string anchors and quantifier must be added as suggested by #Ryan.
^[jackal]+$
note the a is redundant because it appears twice in character set.

You could use a regular expression with
^ start position
[] a character class, which may be empty
* a greedy quantifier for zero until unlimited time (needed for empty character class)
$ end position of the string
'use strict';
function matchAllLetters(arr) {
var pattern = new RegExp("^[" + arr[0] + "]*$", "ig");
return pattern.test(arr[1]);
}
console.log(matchAllLetters(['jackal', 'jock']));
console.log(matchAllLetters(['', '']));
console.log(matchAllLetters(['jackal', 'jack']));

Nahuel's answer steered me in the right direction. In my earlier attempt I tried to pass a string of characters 'jock' as a character pattern for regex to use in order to match a string. The set of letters 'j','o','c','k' could be in found in any order of the string, just as long as they were found within the string. As Nahuel pointed out I couldn't just take the'jock' string and surround it by a character class [], because regex returns TRUE if even only 1 of the characters in the set matches and the rest don't. This Java Regex discussion on stack overflow helped me rewrite my function to build a 'positive lookahead' regex test.
function matchAllLetters(arr) {
var len= arr[1].length;
var lookahead = [];
for (var i = 0; i < len; i++) {
lookahead += "(?=.*" + arr[1].charAt(i) + ")";
}
var pattern = new RegExp(lookahead, "i");
return ( pattern.test( arr[0] ) );
}
The function works as expected. But as mentioned, the entire purpose was to see
if Regex was faster than other methods at pattern matching in this manner. For those interested you can find my JSPerf tests here

Regular expression that remove second occurrence of a character in a string

I'm trying to write a JavaScript function that removes any second occurrence of a character using the regular expression. Here is my function
var removeSecondOccurrence = function(string) {
return string.replace(/(.*)\1/gi, '');
}
It's only removing consecutive occurrence. I'd like it to remove even non consecutive one. for example papirana should become pairn.
Please help

A non-regexp solution:
"papirana".split("").filter(function(x, n, self) { return self.indexOf(x) == n }).join("")
Regexp code is complicated, because JS doesn't support lookbehinds:
str = "papirana";
re = /(.)(.*?)\1/;
while(str.match(re)) str = str.replace(re, "$1$2")
or a variation of the first method:
"papirana".replace(/./g, function(a, n, str) { return str.indexOf(a) == n ? a : "" })

Using a zero-width lookahead assertion you can do something similar
"papirana".replace(/(.)(?=.*\1)/g, "")
returns
"pirna"
The letters are of course the same, just in a different order.
Passing the reverse of the string and using the reverse of the result you can get what you're asking for.

This is how you would do it with a loop:
var removeSecondOccurrence = function(string) {
var results = "";
for (var i = 0; i < string.length; i++)
if (!results.contains(string.charAt(i)))
results += string.charAt(i);
}
Basically: for each character in the input, if you haven't seen that character already, add it to the results. Clear and readable, at least.

What Michelle said.
In fact, I strongly suspect it cannot be done using regular expressions. Or rather, you can if you reverse the string, remove all but the first occurences, then reverse again, but it's a dirty trick and what Michelle suggests is way better (and probably faster).
If you're still hot on regular expressions...
"papirana".
split("").
reverse().
join("").
replace(/(.)(?=.*\1)/g, '').
split("").
reverse().
join("")
// => "pairn"
The reason why you can't find all but the first occurence without all the flippage is twofold:
JavaScript does not have lookbehinds, only lookaheads
Even if it did, I don't think any regexp flavour allows variable-length lookbehinds

JavaScript: How can I remove any words containing (or directly preceding) capital letters, numbers, or commas, from a string?

I'm trying to write the code so it removes the "bad" words from the string (the text).
The word is "bad" if it has comma or any special sign thereafter. The word is not "bad" if it contains only a to z (small letters).
So, the result I'm trying to achieve is:
<script>
String.prototype.azwords = function() {
return this.replace(/[^a-z]+/g, "0");
}
var res = "good Remove remove1 remove, ### rem0ve? RemoVE gooood remove.".azwords();//should be "good gooood"
//Remove has a capital letter
//remove1 has 1
//remove, has comma
//### has three #
//rem0ve? has 0 and ?
//RemoVE has R and V and E
//remove. has .
alert(res);//should alert "good gooood"
</script>

Try this:
return this.replace(/(^|\s+)[a-z]*[^a-z\s]\S*(?!\S)/g, "");
It tries to match a word (that is surrounded by whitespaces / string ends) and contains any (non-whitespace) character but at least one that is not a-z. However, this is quite complicated and unmaintainable. Maybe you should try a more functional approach:
return this.split(/\s+/).filter(function(word) {
return word && !/[^a-z]/.test(word);
}).join(" ");

okay, first off you probably want to use the word boundary escape \b in your regex. Also, it's a bit tricky if you match the bad words, because a bad word might contain lower case chars, so your current regex will exclude anything which does have lowecase letters.
I'd be tempted to pick out the good words and put them in a new string. It's a much easier regex.
/\b[a-z]+\b/g
NB: I'm not totally sure that it'll work for the first and last words in the string so you might need to account for that as well. http://www.regextester.com/ is exceptionally useful.
EDIT: as you want punctiation after the word to be 'bad', this will actually do what I was suggesting
(^|\s)[a-z]+(\s|$)

Firstly I wouldn't recommend changing the prototype of String (or of any native object) if you can avoid because you leave yourself open to conflicts with other code that might define the same property in different ways. Much better to put custom methods like this on a namespaced object, though I'm sure some will disagree.
Second, is there any need to use RegEx completely? (Genuine question; not trying to be facetious.)
Here is an example of the function with plain old JS using a little bit of RegEx here and there. Easier to comment, debug, and reuse.
Here is the code:
var azwords = function(str) {
var arr = str.split(/\s+/),
len = arr.length,
i = 0,
res = "";
for (i; i < len; i += 1) {
if (!(arr[i].match(/[^a-z]/))) {
res += (!res) ? arr[i] : " " + arr[i];
}
}
return res;
}
var res = "good Remove remove1 remove, ### rem0ve? RemoVE gooood remove."; //should be "good gooood"
//Remove has a capital letter
//remove1 has 1
//remove, has comma
//### has three #
//rem0ve? has 0 and ?
//RemoVE has R and V and E
//remove. has .
alert(azwords(res));//should alert "good gooood";

Try this one:
var res = "good Remove remove1 remove, ### rem0ve? RemoVE gooood remove.";
var new_one = res.replace(/\s*\w*[#A-Z0-9,.?\\xA1-\\xFF]\w*/g,'');
//Output `good gooood`
Description:
\s* # zero-or-more spaces
\w* # zero-or-more alphanumeric characters
[#A-Z0-9,.?\\xA1-\\xFF] # matches any list of characters
\w* # zero-or-more alphanumeric characters
/g - global (run over all string)

This will find all the words you want /^[a-z]+\s|\s[a-z]+$|\s[a-z]+\s/g so you could use match.
this.match(/^[a-z]+\s|\s[a-z]+$|\s[a-z]+\s/g).join(" "); should return the list of valid words.
Note that this took some time as a JSFiddle so it maybe more efficient to split and iterate your list.

Shared part in RegEx matched string

In following code:
"a sasas b".match(/sas/g) //returns ["sas"]
The string actually include two sas strings, a [sas]as b and a sa[sas] b.
How can I modify RegEx to match both?
Another example:
"aaaa".match(/aa/g); //actually include [aa]aa,a[aa]a,aa[aa]
Please consider the issue in general not just above instances.
A pure RexEx solution is preferred.

If you want to match at least one such "merged" occurrence, then you could do something like:
"a sasas b".match(/s(as)+/g)
If you want to retrieve the matches as separate results, then you have a bit more work to do; this is not a case that regular expressions are designed to handle. The basic algorithm would be:
Attempt a match. If it was unsuccessful, stop.
Extract the match you are interested in and do whatever you want with it.
Take the substring of the original target string, starting from one character following the first character in your match.
Start over, using this substring as the new input.
(To be more efficient, you could match with an offset instead of using substrings; that technique is discussed in this question.)
For example, you would start with "a sasas b". After the first match, you have "sas". Taking the substring that starts one character after the match starts, we would have "asas b". The next match would find the "sas" here, and you would again repeat the process with "as b". This would fail to match, so you would be done.

This significantly-improved answer owes itself to #EliGassert.
String.prototype.match_overlap = function(re)
{
if (!re.global)
re = new RegExp(re.source,
'g' + (re.ignoreCase ? 'i' : '')
+ (re.multiline ? 'm' : ''));
var matches = [];
var result;
while (result = re.exec(this))
matches.push(result),
re.lastIndex = result.index + 1;
return matches.length ? matches : null;
}
#EliGassert points out that there is no need to walk through the entire string character by character; instead we can find a match anywhere (i.e. do without the anchor), and then continue one character after the index of the found match. While researching how to retrieve said index, I found that the re.lastIndex property, used by exec to keep track of where it should continue its search, is in fact settable! This works rather nicely with what we intend to do.
The only bit needing further explanation might be the beginning. In the absence of the g flag, exec may never return null (always returning its one match, if it exists), thus possibly going into an infinite loop. Since, however, match_overlap by design seeks multiple matches, we can safely recompile any non-global RegExp as a global RegExp, importing the i and m options as well if set.
Here is a new jsFiddle: http://jsfiddle.net/acheong87/h5MR5/.
document.write("<pre>");
document.write('sasas'.match_overlap(/sas/));
document.write("\n");
document.write('aaaa'.match_overlap(/aa/));
document.write("\n");
document.write('my1name2is3pilchard'.match_overlap(/[a-z]{2}[0-9][a-z]{2}/));
document.write("</pre>");
Output:
sas,sas
aa,aa,aa
my1na,me2is,is3pi

var match = "a sasas b".match(/s(?=as)/g);
for(var i =0; i != match.length; ++i)
alert(match[i]);
Going off of the comment by Q. Sheets and the response by cdhowie, I came up with the above solution: it consumes ONE character in the regular expression and does a lookahead for the rest of the match string. With these two pieces, you can construct all the positions and matching strings in your regular expression.
I wish there was an "inspect but don't consume" operator that you could use to actually include the rest of the matching (lookahead) string in the results, but there unfortunately isn't -- at least not in JS.

Here's a generic way to do it:
String.prototype.match_overlap = function(regexp)
{
regexp = regexp.toString().replace(/^\/|\/$/g, '');
var re = new RegExp('^' + regexp);
var matches = [];
var result;
for (var i = 0; i < this.length; i++)
if (result = re.exec(this.substr(i)))
matches.push(result);
return matches.length ? matches : null;
}
Usage:
var results = 'sasas'.match_overlap(/sas/);
Returns:
An array of (overlapping) matches, or null.
Example:
Here's a jsFiddle in which this:
document.write("<pre>");
document.write('sasas'.match_overlap(/sas/));
document.write("\n");
document.write('aaaa'.match_overlap(/aa/));
document.write("\n");
document.write('my1name2is3pilchard'.match_overlap(/[a-z]{2}[0-9][a-z]{2}/));
document.write("</pre>");
returns this:
sas,sas
aa,aa,aa
my1na,me2is,is3pi
Explanation:
To explain a little bit, we intend for the user to pass a RegExp object to this new function, match_overlap, as he or she would do normally with match. From this we want to create a new RegExp object anchored at the beginning (to prevent duplicate overlapped matches—this part probably won't make sense unless you encounter the issue yourself—don't worry about it). Then, we simply match against each substring of the subject string this and push the results to an array, which is returned if non-empty (otherwise returning null). Note that if the user passes in an expression that is already anchored, this is inherently wrong—at first I stripped anchors out, but then I realized I was making an assumption in the user's stead, which we should avoid. Finally one could go further and somehow merge the resulting array of matches into a single match result resembling what would normally occur with the //g option; and one could go even further and make up a new flag, e.g. //o that gets parsed to do overlap-matching, but this is getting a little crazy.

Split string with a single occurence (not twice) of a delimiter in Javascript

This is better explained with an example. I want to achieve an split like this:
two-separate-tokens-this--is--just--one--token-another
->
["two", "separate", "tokens", "this--is--just--one--token", "another"]
I naively tried str.split(/-(?!-)/) and it won't match the first occurrence of double delimiters, but it will match the second (as it is not followed by the delimiter):
["two", "separate", "tokens", "this-", "is-", "just-", "one-", "token", "another"]
Do I have a better alternative than looping through the string?
By the way, the next step should be replacing the two consecutive delimiters by just one, so it's kind of escaping the delimiter by repeating it... So the final result would be this:
["two", "separate", "tokens", "this-is-just-one-token", "another"]
If that can be achieved in just one step, that should be really awesome!

str.match(/(?!-)(.*?[^\-])(?=(?:-(?!-)|$))/g);
Check this fiddle.
Explanation:
Non-greedy pattern (?!-)(.*?[^\-]) match a string that does not start and does not end with dash character and pattern (?=(?:-(?!-)|$)) requires such match to be followed by single dash character or by end of line. Modifier /g forces function match to find all occurrences, not just a single (first) one.
Edit (based on OP's comment):
str.match(/(?:[^\-]|--)+/g);
Check this fiddle.
Explanation:
Pattern (?:[^\-]|--) will match non-dash character or double-dash string. Sign + says that such matching from the previous pattern should be multiplied as many times as can. Modifier /g forces function match to find all occurrences, not just a single (first) one.
Note:
Pattern /(?:[^-]|--)+/g works in Javascript as well, but JSLint requires to escape - inside of square brackets, otherwise it comes with error.

You would need a negative lookbehind assertion as well as your negative lookahead:
(?<!-)-(?!-)
http://regexr.com?31qrn
Unfortunately the javascript regular expression parser does not support negative lookbehinds, I believe the only workaround is to inspect your results afterwards and remove any matches that would have failed the lookbehind assertion (or in this case, combine them back into a single match).

#Ωmega has the right idea in using match instead of split, but his regex is more complicated than it needs to be. Try this one:
s.match(/[^-]+(?:--[^-]+)*/g);
It reads exactly the way you expect it to work: Consume one or more non-hyphens, and if you encounter a double hyphen, consume that and go on consuming non-hyphens. Repeat as necessary.
EDIT: Apparently the source string may contain runs of two or more consecutive hyphens, which should not be treated as delimiters. That can be handled by adding a + to the second hyphen:
s.match(/[^-]+(?:--+[^-]+)*/g);
You can also use a {min,max} quantifier:
s.match(/[^-]+(?:-{2,}[^-]+)*/g);

I don't know how to do it purely with the regex engine in JS. You could do it this way that is a little less involved than manually parsing:
var str = "two-separate-tokens-this--is--just--one--token-another";
str = str.replace(/--/g, "#!!#");
var split = str.split(/-/);
for (var i = 0; i < split.length; i++) {
split[i] = split[i].replace(/#!!#/g, "--");
}
Working demo: http://jsfiddle.net/jfriend00/hAhAB/

You can achieve this without negative lookbehind (as #jbabey mentioned these are not supported in JS) like that (inspired by this article):
\b-\b

Given that the regular expressions weren't very good with edge cases (like 5 consecutive delimiters) and I had to deal with replacing the double delimiters with a single one (and then again it would get tricky because '----'.replace('--', '-') gives '---' rather than '--')
I wrote a function that loops over the characters and does everything in one go (although I'm concerned that using the string accumulator can be slow :-s)
f = function(id, delim) {
var result = [];
var acc = '';
var i = 0;
while(i < id.length) {
if (id[i] == delim) {
if (id[i+1] == delim) {
acc += delim;
i++;
} else {
result.push(acc);
acc = '';
}
} else {
acc += id[i];
}
i++;
}
if (acc != '') {
result.push(acc);
}
return result;
}
and some tests:
> f('a-b--', '-')
["a", "b-"]
> f('a-b---', '-')
["a", "b-"]
> f('a-b---c', '-')
["a", "b-", "c"]
> f('a-b----c', '-')
["a", "b--c"]
> f('a-b----c-', '-')
["a", "b--c"]
> f('a-b----c-d', '-')
["a", "b--c", "d"]
> f('a-b-----c-d', '-')
["a", "b--", "c", "d"]
(If the last token is empty, it's meant to be skipped)

We Keep Coding

JavaScript is the programming language of the Web.

Javascript regexes - Lookbehind and lookahead at the same time - javascript

Related

ES5 Regex test find a set of characters in any order

Regular expression that remove second occurrence of a character in a string

JavaScript: How can I remove any words containing (or directly preceding) capital letters, numbers, or commas, from a string?

Shared part in RegEx matched string

Split string with a single occurence (not twice) of a delimiter in Javascript

Categories

Resources