I was trying to do a regex for someone else when I ran into this problem. The requirement was that the regex should return results from a set of strings that has, let's say, "apple" in it. For example, consider the following strings:
"I have an apple"
"You have two Apples"
"I give you one more orange"
The result set should have the first two strings.
The regex(es) I tried are:
/[aA]pple/ and /[^a-zA-Z0-9][aA]pple/
The problem with the first one is that words like "aapple", "bapple", etc (ok, so they are meaningless, but still...) test positive with it, and the problem with the second one is that when a string actually starts with the word "apple", "Apples and oranges", for example, it tests negative. Can someone explain why the second regex behaves this way and what the correct regex would be?
/(^.*?\bapples?\b.*$)/i
Edit: The above will match the entire string containing the word "apples", which I thought is what you were asking for. If you are just trying to see if the string contains the word, the following will work.
/\bapples?\b/i
The regex(es) I tried are:
/[aA]pple/ and /[^a-zA-Z0-9][aA]pple/
The first one just checks for the existence of the following characters, in order: a-p-p-l-e, regardless of what context they are used in. The \b, or word-boundary character, matches any spot where a non-word character and a word character meet, ala \W\w.
The second one is trying to match other characters before the occurrance of a-p-p-l-e, and is essentially the same as the first, except it requires other characters in front of it.
The one I answered with works like following. From the beginning of the string, matches any characters (if they exist) non-greedily until it encounters a word boundary. If the string starts with apple, the beginning of a string is a word-boundary, so it still matches. It then matches the letters a-p-p-l-e, and s if it exists, followed by another word boundary. It then matches all characters to the end of the string. The /i at the end means it's case-insensitive, so 'Apple', 'APPLE', and 'apple' are all valid.
If you have the time, I would highly recommend walking through the tutorial at http://regular-expressions.info. It really goes in-depth and talks about how the regular expression engines match different expressions, it helped me a ton.
To build on #tj111, the reason your second regex fails is that [^a-zA-Z0-9] requires that a character matches; that is, there is some character in that position, and its value is not contained in the set [a-zA-Z0-9]. Markers like \b are called "zero-width assertions". \b, in particular, matches against boundaries between characters or at the beginning or end of a string. Because it is not matching against any character, its "width" is zero.
In sum, [^a-zA-Z0-9] requires a character that does not take a particular value be present, while \b requires only that a boundary be present.
Edit: #tj111 has added most of this to his response. I'm in too late, again :)
This works for apple and apples and its case-insensitive spellings:
var strings = ["I have an apple", "You have two Apples", "I give you one more orange"];
var result = [];
var pattern = /\bapples?\b/i;
for (var i=0; i<strings.length; i++) {
if (pattern.test(strings[i])) {
result.push(strings[i]);
}
}
Your second regex requires a nonalphanumeric character before the first a in apple. "apple" doesn't satisfy this. As others note, "\b" matches not a character, but a word boundary position.
/\bapple/i
\b is a word boundary.
To explain why your attempts do not work, the first one does not check that it is the beginning of the word, so it can have something before it. The second regex you gave says that something must be before the word "apple", but it can't be alphanumeric.
Related
Been stupid around lately. Recently I had no idea about how to capture every first letter in the string. I googled. As I understood, \b matches every first letter plus space before them. So, I wrote a regex: (\b)[^'] and it works perfectly with strings except for the part with the short form like 'aren't' or 'doesn't'. My reg captures this 't' that shouldn't be captured. Could you tell me what is wrong with my reg?
Thanks ahead. I am a beginner, so I am sorry if I have a dumb code xd
Code:
String.prototype.toJadenCase = function () {
return this.replace(/(\b)[^']/g, function(x) {
return x.toUpperCase();
});
}
// output: How Can Mirrors Be Real If Our Eyes Aren'T Real
console.log("How can mirrors be real if our eyes aren't real".toJadenCase());
// output: Most Trees Are Blue
console.log('most Trees Are Blue'.toJadenCase());
// output: Why This Doesn'T Work
console.log("why This Doesn't work".toJadenCase());
As I understood, \b matches every first letter plus space before them.
This is not really what it does. It doesn't match a character. It matches a position without matching a single character. The position is matched when exactly one character among the character before and character after that position is alphanumeric (or underscore).
So in this particular task you have to avoid a match with 't, since the \b will match the position between these two characters. So you could use a look-behind check, to check that the character before that position is not a '. You should also check that the character after that position is the one that is the letter, as otherwise you'll also match the position just after every word.
So use: /(?<!')\b[a-z]/g
The (?<!') part asserts that the preceding character is not a quote
http://jsfiddle.net/bxeLyneu/1/
function custom() {
var str = document.getElementById('original').innerHTML;
var replacement = str.replace(/\B:poop:\B/g,'REPLACED');
document.getElementById('replaced').innerHTML = replacement;
}
custom()
Yes = :poop: should be replaced with "REPLACED"
No = :poop: should not be replaced. In other words, remain untouched.
Number 4, 5, 6 doesn't seems to follow the rule provided. I do know why, but I don't have much idea how to combine multiple expressions into one.
I have tried many others but I just can't get them to work the way I wanted them to be. Odds aren't in my favor.
And yes, this is very similar to how Facebook emoji in chat box works.
New issue:
http://jsfiddle.net/xaekh8op/13/
/(^|\s):bin:(\s|$)/gm
It is unable to scan and replace the one in the middle.
How can I fix that?
\B means "Any location not at a word boundary" whereas \s means "Whitespace". Based upon your given examples, the following code works perfectly.
function custom() {
var str = document.getElementById('original').innerHTML;
var replacement = str.replace(/([\s>]|^):poop:(?=[\s<]|$)/gm,'$1REPLACED');
document.getElementById('replaced').innerHTML = replacement;
}
custom()
http://jsfiddle.net/xaekh8op/15/
Explanation:
The regular expression ([\s>]|^):poop:(?=[\s<]|$) stands for the following:
(image created in Debuggex)
By picking one of \s and > at the start (or using ^ meaning start of line), and grouping it as group 1, we can use it later. Similarly for after the :poop: (\s or < or end-of-line $). However, the second time, it is done using a look-ahead ((?= ...) is the syntax), which checks whether the [\s<]|$ portion is there after, but it doesn't consume it in the replacement. The < and > take care of any HTML tags that might be just beside the :poop:. The $1 in the replacement string $1REPLACED places the first group back, thereby rendering only the :poop: being replaced with REPLACED. The second "group" was just a look-ahead, and thus does not need to be replaced back.
For further information on word boundaries, you can refer to http://www.regular-expressions.info/wordboundaries.html which says:
There are three different positions that qualify as word boundaries:
Before the first character in the string, if the first character is a word character.
After the last character in the string, if the last character is a word character.
Between two characters in the string, where one is a word character and the other is not a word character.
I have used Dylan's question on here regarding JavaScript syllable counting, and more specifically artfulhacker's answer, in my own code and, regardless of which single or multi word string I feed it, the function is always able to correctly count the number of syllables.
I have a limited experience with RegEx and not enough prior knowledge to decipher what exactly is happening in the following code without some help. I'm not someone who is ever happy with having some code I pulled from somewhere just work without me knowing how it works. Is someone able to please articulate what is happening in the new_count(word) function below and help me decipher the use of RegEx and how it is that the function is able to correctly count syllables? Many
function new_count(word) {
word = word.toLowerCase(); //word.downcase!
if(word.length <= 3) { return 1; } //return 1 if word.length <= 3
word = word.replace(/(?:[^laeiouy]es|ed|[^laeiouy]e)$/, ''); //word.sub!(/(?:[^laeiouy]es|ed|[^laeiouy]e)$/, '')
word = word.replace(/^y/, ''); //word.sub!(/^y/, '')
return word.match(/[aeiouy]{1,2}/g).length; //word.scan(/[aeiouy]{1,2}/).size
}
As far as I see it, we basically want to count the vowels, or vowel pairs, with some special cases. Let's start by the last line, which does that, i.e. count vowels and pairs:
return word.match(/[aeiouy]{1,2}/g).length;
This will match any vowel, or vowel pair. [...] means a character class, i.e. that if we go through the string character-by-character, we have a match, if the actual character is one of those. {1, 2} is the number of repetitions, i.e. it means that we should match exactly one or two such characters.
The other two lines are for special cases.
word = word.replace(/(?:[^laeiouy]es|ed|[^laeiouy]e)$/, '');
This line will remove 'syllables' from the end of the word, which are either:
Xes (where X is anything but any of 'laeiouy', e.g. 'zes')
ed
Xe (where X is anything but any of 'laeiouy', e.g. 'xe')
(I'm not really sure what the grammatical meaning behind this is, but I guess, that 'syllables' at the end of the word, like '-ed', '-ded', '-xed' etc. don't really count as such.)
As for the regexp part: (?:...) is a non-capturing group. I guess it's not really important in this case that this group be non-capturing; this just means that we would like to group the whole expression, but then we do not need to refer back to it. However, we could have used a capturing group too (i.e. (...) )
The [^...] is a negated character class. It means, match any character, which is none of those listed here. (Compare to the (non-negated) character-class mentioned above.)
The pipe symbol, i.e. |, is the alternation operator, which means, that any of the expressions can match.
Finally, the $ anchor matches the end of the line, or string (depending on the context).
word = word.replace(/^y/, '');
This line removes 'y'-s from the beginning of words (probably 'y' at the beginning does not count as a syllable -- which makes sense in my opinion).
^ is the anchor for matching the beginning of the line, or string (c.f. $ mentioned above).
Note: the algorithm only works if word really contains one single word.
/(?:[^laeiouy]es|ed|[^laeiouy]e)$/
That matches three possible substrings: a letter other than 'l' or a vowel followed by 'es' (like "res" or "tes"); 'ed'; or a non-vowel, non-'l' followed by just an 'e'. Those patterns must appear at the end of the word to match because of the $ at the end of the pattern. The grouping (?: ) is just a grouping; the leading ?: makes that distinction. The pattern could have been a little shorter:
/(?:[^laeiouy]es?|ed)$/
would do the same thing. In any case, if the pattern matches the characters involved are removed from the word.
Then,
/^y/
matches a 'y' at the beginning of a word. If a 'y' is found, it's removed.
Finally,
/[aeiouy]{1,2}/g
matches any one- or two-character stretch of vowels (including 'y'). The g suffix makes it a global match, so that the return value is an array consisting of all such spans of vowels. The length of that returned array is the number of syllables (according to this technique).
Note that the words "poem" and "lion" would be reported as one-syllable words, which may be correct for some English variants but not all.
Here is a pretty good reference for JavaScript regular expression operators.
I want to split up a string (sentence) in an array of words and keep the delimiters.
I have found and I am currently using this regex for this:
[^.!?\s][^.!?]*(?:[.!?](?!['"]?\s|$)[^.!?]*)*[.!?]?['"]?(?=\s|$)
An explanation can be found here: http://regex101.com/
This works exactly as I want it to and effectively makes a string like
This is a sentence.
To an array of
["This", "is", "a", "sentence."]
The problem here is that it does not include spaces nor newlines. I want the string to be parsed as words as it already does but I also want the corresponding space and or newline character to belong to the previous word.
I have read about positive lookahead that should look for future characters (space and or newline) but still take them into account when extracting the word. Although this might be the solution I have failed to implement it.
If it makes any difference I am using JavaScript and the following code:
//save the regex -- g modifier to get all matches
var reg = /[^.!?\s][^.!?]*(?:[.!?](?!['"]?\s|$)[^.!?]*)*[.!?]?['"]?(?=\s|$)/g;
//define variable for holding matches
var matches;
//loop through each match
while(matches = reg.exec(STRING_HERE)){
//the word without spaces or newlines
console.log(matches[0]);
}
The code works but as I said, it does not include spaces and newline characters.
Yo can try something simpler:
str.split(/\b(?!\s)/);
However, note non word characters (e.g. full stop) will be considered another word:
"This is a sentence.".split(/\b(?!\s)/);
// [ "This ", "is ", "a ", "sentence", "." ]
To fix that, you can use a character class with the characters that shouldn't begin another word:
str.split(/\b(?![\s.])/);
function split_string(str){
var arr = str.split(" ");
var last_i = arr.length - 1;
for(var i=0; i<last_i; i++){
arr[i]+=" ";
}
return arr;
}
It may be as simple as this:
var sentence = 'This is a sentence.';
sentence = sentence.split(' ').join(' ||');
sentence = sentence.split('\n').join('\n||');
var matches = sentence.split('||');
Note that I use 2 pipes as a delimiter, but ofcourse you can use anything as long as it's unique.
Also note that I only split \n as a newline, but you may add \r\n or whatever you want to split as well.
General Solution
To keep the delimiters conjoined in the results, the regex needs to be a zero-width match. In other words, the regex can be thought of as matching the point between a delimiter and non-delimiter, rather than matching the delimiters themselves. This can be achieved with zero-width matching expressions, matching before, at, or after the split point (at most one each); let's call these A, B, and C. Sometimes a single sub-expression will do it, others you'll need two; offhand, I can't think of a case where you'd need three.
Not only look-aheads but lookarounds in general are the perfect candidates for this purpose: lookbehinds ((?<=...)) to match before the split point, and lookaheads ((?=...)) after. That's the essence of this approach. Positive or negative lookarounds can be used. The one pitfall is that lookbehinds are relatively new to JS regexes, so not all browsers or other JS engines will support them (current versions of Firefox, Chrome, Opera, Edge, and node.js do; Safari does not). If you need to support a JS engine that doesn't support lookbehinds, you might still be able to write & use a regex that matches at-and-before (BC).
To have the delimiters appear at the end of each match, put them in A. To have them at the start, in C. Fortunately, JS regexes do not place restrictions on lookbehinds, so simply wrapping the delimiter regex in the positive lookaround markers should be all that's required for delimiters. If the delimiters aren't so simple (i.e. context-sensitive), it might take a little more work to write the regex, which doesn't need to match the entire delimiter.
Paired with the delimiter pattern, you'll need to write a pattern that matches the start (for C) or end (for A) of the non-delimiter. This step is likely the one that will require the most additional work.
The at-split-point match, B
will often (always?) be a simple boundary, such as \b.
Specific Solution
If spaces are the only delimiters, and they're to appear at the end of each match, the delimiter pattern would be (?<=\s), in A. However, there are some cases not covered in the problem description. For example, should words separated by only punctuation (e.g. "x.y") be split? Which side of a split point should quotation marks and hyphens appear, if any? Should they count as punctuation? Another option for the delimiter is to match (after) all non-word characters, in which case A would be (<?=\W).
Since the split-point is at a word boundary, B could be \b.
Since the start of a match is a word character, (?=\w) will suffice for C.
Any two of those three should suffice. One that is perhaps clearest in meaning (and splits at the most points) is /(<?=\W)(?=\w)/, which can be translated as "split at the start of each word". \b could be added, if you find it more understandable, though it has no functional affect: /(<?=\W)\b(?=\w)/.
Note Oriol's excellent solutions are given by B=\b and (C=(?!\s) or C=(?![\s.])).
Additional
As a point of interest, there would be a simpler solution for this particular case if JS regexes supported TCL word boundaries: \m matches only at the start of a word, so str.split(/\m/) would split exactly at the start of each word. (\m is equivalent to (<?=\W)(?=\w).)
If you want to include the whitespace after the word, the regex \S+\s* should work.
const s = `This is a sentence.
This is another sentence.`;
console.log(s.match(/\S+\s*/g))
Given the following Regular Expression:
\b(MyString|MyString-Dash)\b
And the text:
AString
MyString
MyString-Dash
Running a match against the text never finds a match for the second thing (MyString-Dash) because the '-' (dash) character isn't a word boundary character. The following javascript always outputs "MyString,MyString" to the "matches" div (I would like to find MyString and MyString-Dash as distinct matches). How can I define a pattern that will match both MyString and MyString-Dash ?
<html>
<body>
<h1>Content</h1>
<div id="content">
AString
MyString
MyString-Dash
</div>
<br>
<h1>Matches (expecting MyString,MyString-Dash)</h1>
<div id="matches"></div>
</body>
<script>
var content = document.getElementById('content');
var matchesDiv = document.getElementById('matches');
var pattern = '\\b(MyString|MyString-Dash)\\b';
var matches = content.innerHTML.match(pattern);
matchesDiv.innerHTML = matches;
</script>
</html>
Swap the order of your matching so that the longest possible is first:
content.innerHTML.match(/\b(MyString-Dash|MyString)\b/)
I believe regular expressions match from left to right. Just tested this in Firebug, it works.
I would also change that pattern var to a regular expression literal, from '\\b(MyString-Dash|MyString)\\b' to /\b(MyString-Dash|MyString)\b/g
You want the /g in there because that will make the regular expression return all matches, rather than just the first one.
Please see this answer for how to deal with words with dashes in them and the issues related to boundaries when you have those kinds of words.
There are a couple problems with your assumptions.
Running a match against the text never finds a match for the second thing (MyString-Dash) because the '-' (dash) character isn't a word boundary character.
There's no such thing as a word boundary character. Word boundaries are the space between characters that match \w and don't match \w. - does not match '\w', so on either side of it is a "word boundary", but that won't break your match: the - is a literal dash in your regex and the \b's are far outside of it.
Second, regexen will always try to match the first thing they can in the string that matches your regex. As long as that first string in there matches, it will keep returning the first thing in there. You're asking for the first match when you ask for a match. That's the design. If you didn't want it to match MyString, don't ask for it.
Third, most regex engines prioritize 'completing a match' over length of a match. Thus, 'MyString', if it matches, will always be the first thing it returns. You'll have to wait until Perl 6 grammars for a regex engine that prioritizes length. :)
The only way for you to really do this is with two checks, one for the longer one, first, and then one for the shorter one. It will always match the first thing it finds that works. If you have a priority other than that, it's up to you to code it in as separate checks.