Regex Expressions For Emoji

Regex Expressions For Emoji - javascript

http://jsfiddle.net/bxeLyneu/1/
function custom() {
var str = document.getElementById('original').innerHTML;
var replacement = str.replace(/\B:poop:\B/g,'REPLACED');
document.getElementById('replaced').innerHTML = replacement;
}
custom()
Yes = :poop: should be replaced with "REPLACED"
No = :poop: should not be replaced. In other words, remain untouched.
Number 4, 5, 6 doesn't seems to follow the rule provided. I do know why, but I don't have much idea how to combine multiple expressions into one.
I have tried many others but I just can't get them to work the way I wanted them to be. Odds aren't in my favor.
And yes, this is very similar to how Facebook emoji in chat box works.
New issue:
http://jsfiddle.net/xaekh8op/13/
/(^|\s):bin:(\s|$)/gm
It is unable to scan and replace the one in the middle.
How can I fix that?

\B means "Any location not at a word boundary" whereas \s means "Whitespace". Based upon your given examples, the following code works perfectly.
function custom() {
var str = document.getElementById('original').innerHTML;
var replacement = str.replace(/([\s>]|^):poop:(?=[\s<]|$)/gm,'$1REPLACED');
document.getElementById('replaced').innerHTML = replacement;
}
custom()
http://jsfiddle.net/xaekh8op/15/
Explanation:
The regular expression ([\s>]|^):poop:(?=[\s<]|$) stands for the following:
(image created in Debuggex)
By picking one of \s and > at the start (or using ^ meaning start of line), and grouping it as group 1, we can use it later. Similarly for after the :poop: (\s or < or end-of-line $). However, the second time, it is done using a look-ahead ((?= ...) is the syntax), which checks whether the [\s<]|$ portion is there after, but it doesn't consume it in the replacement. The < and > take care of any HTML tags that might be just beside the :poop:. The $1 in the replacement string $1REPLACED places the first group back, thereby rendering only the :poop: being replaced with REPLACED. The second "group" was just a look-ahead, and thus does not need to be replaced back.
For further information on word boundaries, you can refer to http://www.regular-expressions.info/wordboundaries.html which says:
There are three different positions that qualify as word boundaries:
Before the first character in the string, if the first character is a word character.
After the last character in the string, if the last character is a word character.
Between two characters in the string, where one is a word character and the other is not a word character.

Related

How to code a Regex for a shared character between two correspondent patterns?

I am going to find all 'aa' sub-strings in the 'caaab'. So, I've used the following regular expression.
/aa/g
Using the cited expression, I expect that JavaScript's match method returns two correspondent patterns. As you can see, the middle, shared 'a' causes two 'aa' patterns! Nonetheless, it merely returns the first one. What is the problem with the Regex, and how can I fix it?
let foundArray=d.match(/aa/g);

Here is one way to approach this. We can first record the length of the input string, for use later. Then, do a global regex replacement of a(?=a) with empty string. One by one, this will replace each occurrence of the substring aa in the input. Then, we can compare the length of the output against the input to figure out how many times aa occurred.
var input = "caaab";
var sLen = input.length;
var output = input.replace(/a(?=a)/g, "");
var eLen = output.length;
console.log("There were " + (sLen - eLen) + " occurrences of aa in the input");
Note that the difficulty you are encountering has to do with the behavior of JavaScript's regex engine. If you replace aa, it will consume everything, and so might be consuming the first letter a of the next sequential aa match. Using a(?=a) gets around this problem, because the lookahead (?=a) does not consume the next a.

Use a lookahead
As mentioned in a comment that's how regexes are designed to work:
it's working exactly as it's supposed to; once it consumes a character, it moves past it
Matches do not overlap, this isn't a limitation of js it's simply how regular expressions work.
The way to get around that is to use a zero-length match, i.e. a look-ahead or look-behind
Tim's existing answer already does this, but can be simplified as follows:
match = "caaab".match(/a(?=a)/g);
console.log(match);
This is finding an a followed by another a (which is not returned as part of the match). So technically it's finding:
caaab
^ first match, single character
^ second match, single character

Matching varients and mis-spellings of a word using RegEx in MS Word

I am trying to capture varients of a word using Microsft Word find and replace function. Here is a searchable snippet:
There are going to be 3 instances of the word successful for the purpose of Regex matching. Here is the second sucesfull and here is another succesfull , both spelt incorrectly.
This is my Regex expression used in Find and Replace with "Use Wildcards" selected (I have also tried this with replacing the braces with brackets with no joy)
<([Ss]uc[1,]es[1,]ful[1,])>

[Ss]uc{1,}es{1,}ful{1,}
Replace the [ ] with { } and it should work fine. The curly braces specify how many times you want a character to repeat. Square brackets are used to specify the acceptable characters.
So the current regular expression will match the following.
succcccesssfulll
sucesful
successful
Successsssfull
and so on.
I think this is cleaner and easier to type.
[Ss]uc+es+ful+
"+" counts for one or more occurrence of a character.

The search string you want would be:
<[sS]uc#es#ful#>
This searches for a word (the < and > symbols) starting with either s or S and including one or more (the # symbol) of c, s, and l.

RegEx: Understanding Syllable Counter Code

I have used Dylan's question on here regarding JavaScript syllable counting, and more specifically artfulhacker's answer, in my own code and, regardless of which single or multi word string I feed it, the function is always able to correctly count the number of syllables.
I have a limited experience with RegEx and not enough prior knowledge to decipher what exactly is happening in the following code without some help. I'm not someone who is ever happy with having some code I pulled from somewhere just work without me knowing how it works. Is someone able to please articulate what is happening in the new_count(word) function below and help me decipher the use of RegEx and how it is that the function is able to correctly count syllables? Many
function new_count(word) {
word = word.toLowerCase(); //word.downcase!
if(word.length <= 3) { return 1; } //return 1 if word.length <= 3
word = word.replace(/(?:[^laeiouy]es|ed|[^laeiouy]e)$/, ''); //word.sub!(/(?:[^laeiouy]es|ed|[^laeiouy]e)$/, '')
word = word.replace(/^y/, ''); //word.sub!(/^y/, '')
return word.match(/[aeiouy]{1,2}/g).length; //word.scan(/[aeiouy]{1,2}/).size
}

As far as I see it, we basically want to count the vowels, or vowel pairs, with some special cases. Let's start by the last line, which does that, i.e. count vowels and pairs:
return word.match(/[aeiouy]{1,2}/g).length;
This will match any vowel, or vowel pair. [...] means a character class, i.e. that if we go through the string character-by-character, we have a match, if the actual character is one of those. {1, 2} is the number of repetitions, i.e. it means that we should match exactly one or two such characters.
The other two lines are for special cases.
word = word.replace(/(?:[^laeiouy]es|ed|[^laeiouy]e)$/, '');
This line will remove 'syllables' from the end of the word, which are either:
Xes (where X is anything but any of 'laeiouy', e.g. 'zes')
ed
Xe (where X is anything but any of 'laeiouy', e.g. 'xe')
(I'm not really sure what the grammatical meaning behind this is, but I guess, that 'syllables' at the end of the word, like '-ed', '-ded', '-xed' etc. don't really count as such.)
As for the regexp part: (?:...) is a non-capturing group. I guess it's not really important in this case that this group be non-capturing; this just means that we would like to group the whole expression, but then we do not need to refer back to it. However, we could have used a capturing group too (i.e. (...) )
The [^...] is a negated character class. It means, match any character, which is none of those listed here. (Compare to the (non-negated) character-class mentioned above.)
The pipe symbol, i.e. |, is the alternation operator, which means, that any of the expressions can match.
Finally, the $ anchor matches the end of the line, or string (depending on the context).
word = word.replace(/^y/, '');
This line removes 'y'-s from the beginning of words (probably 'y' at the beginning does not count as a syllable -- which makes sense in my opinion).
^ is the anchor for matching the beginning of the line, or string (c.f. $ mentioned above).
Note: the algorithm only works if word really contains one single word.

/(?:[^laeiouy]es|ed|[^laeiouy]e)$/
That matches three possible substrings: a letter other than 'l' or a vowel followed by 'es' (like "res" or "tes"); 'ed'; or a non-vowel, non-'l' followed by just an 'e'. Those patterns must appear at the end of the word to match because of the $ at the end of the pattern. The grouping (?: ) is just a grouping; the leading ?: makes that distinction. The pattern could have been a little shorter:
/(?:[^laeiouy]es?|ed)$/
would do the same thing. In any case, if the pattern matches the characters involved are removed from the word.
Then,
/^y/
matches a 'y' at the beginning of a word. If a 'y' is found, it's removed.
Finally,
/[aeiouy]{1,2}/g
matches any one- or two-character stretch of vowels (including 'y'). The g suffix makes it a global match, so that the return value is an array consisting of all such spans of vowels. The length of that returned array is the number of syllables (according to this technique).
Note that the words "poem" and "lion" would be reported as one-syllable words, which may be correct for some English variants but not all.
Here is a pretty good reference for JavaScript regular expression operators.

Javascript Regex for all words not between certain characters

I'm trying to return a count of all words NOT between square brackets. So given ..
[don't match these words] but do match these
I get a count of 4 for the last four words.
This works in .net:
\b(?<!\[)[\w']+(?!\])\b
but it won't work in Javascript because it doesn't support lookbehind
Any ideas for a pure js regex solution?

Ok, I think this should work:
\[[^\]]+\](?:^|\s)([\w']+)(?!\])\b|(?:^|\s)([\w']+)(?!\])\b
You can test it here:
http://regexpal.com/
If you need an alternative with text in square brackets coming after the main text, it could be added as a second alternative and the current second one would become third.
It's a bit complicated but I can't think of a better solution right now.
If you need to do something with the actual matches you will find them in the capturing groups.
UPDATE:
Explanation:
So, we've got two options here:
\[[^\]]+\](?:^|\s)([\w']+)(?!\])\b
This is saying:
\[[^\]]+\] - match everything in square brackets (don't capture)
(?:^|\s) - followed by line start or a space - when I look at it now take the caret out as it doesn't make sense so this will become just \s
([\w']+) - match all following word characters as long as (?!\])the next character is not the closing bracket - well this is probably also unnecessary now, so let's try and remove the lookahead
\b - and match word boundary
2 (?:^|\s)([\w']+)(?!\])\b
If you cannot find the option 1 - do just the word matching, without looking for square brackets as we ensured with the first part that they are not here.
Ok, so I removed all the things that we don't need (they stayed there because I tried quite a few options before it worked:-) and the revised regex is the one below:
\[[^\]]+\]\s([\w']+)(?!\])\b|(?:^|\s)([\w']+)\b

I would use something like \[[^\]]*\] to remove the words between square brackets, and then explode by spaces the returned string to count the remaining words.

Chris, resurrecting this question because it had a simple solution that wasn't mentioned. (Found your question while doing some research for a general question about how to exclude patterns in regex.)
Here's our simple regex (see it at work on regex101, looking at the Group captures in the bottom right panel):
\[[^\]]*\]|(\b\w+\b)
The left side of the alternation matches complete [bracketed groups]. We will ignore these matches. The right side matches and captures words to Group 1, and we know they are the right words because they were not matched by the expression on the left.
This program shows how to use the regex (see the count result in the online demo):
<script>
var subject = '[match ye not these words] but do match these';
var regex = /\[[^\]]*\]|(\b\w+\b)/g;
var group1Caps = [];
var match = regex.exec(subject);
// put Group 1 captures in an array
while (match != null) {
if( match[1] != null ) group1Caps.push(match[1]);
match = regex.exec(subject);
}
document.write("<br>*** Number of Matches ***<br>");
document.write(group1Caps.length);
</script>
Reference
How to match (or replace) a pattern except in situations s1, s2, s3...

Javascript regex

I was trying to do a regex for someone else when I ran into this problem. The requirement was that the regex should return results from a set of strings that has, let's say, "apple" in it. For example, consider the following strings:
"I have an apple"
"You have two Apples"
"I give you one more orange"
The result set should have the first two strings.
The regex(es) I tried are:
/[aA]pple/ and /[^a-zA-Z0-9][aA]pple/
The problem with the first one is that words like "aapple", "bapple", etc (ok, so they are meaningless, but still...) test positive with it, and the problem with the second one is that when a string actually starts with the word "apple", "Apples and oranges", for example, it tests negative. Can someone explain why the second regex behaves this way and what the correct regex would be?

/(^.*?\bapples?\b.*$)/i
Edit: The above will match the entire string containing the word "apples", which I thought is what you were asking for. If you are just trying to see if the string contains the word, the following will work.
/\bapples?\b/i
The regex(es) I tried are:
/[aA]pple/ and /[^a-zA-Z0-9][aA]pple/
The first one just checks for the existence of the following characters, in order: a-p-p-l-e, regardless of what context they are used in. The \b, or word-boundary character, matches any spot where a non-word character and a word character meet, ala \W\w.
The second one is trying to match other characters before the occurrance of a-p-p-l-e, and is essentially the same as the first, except it requires other characters in front of it.
The one I answered with works like following. From the beginning of the string, matches any characters (if they exist) non-greedily until it encounters a word boundary. If the string starts with apple, the beginning of a string is a word-boundary, so it still matches. It then matches the letters a-p-p-l-e, and s if it exists, followed by another word boundary. It then matches all characters to the end of the string. The /i at the end means it's case-insensitive, so 'Apple', 'APPLE', and 'apple' are all valid.
If you have the time, I would highly recommend walking through the tutorial at http://regular-expressions.info. It really goes in-depth and talks about how the regular expression engines match different expressions, it helped me a ton.

To build on #tj111, the reason your second regex fails is that [^a-zA-Z0-9] requires that a character matches; that is, there is some character in that position, and its value is not contained in the set [a-zA-Z0-9]. Markers like \b are called "zero-width assertions". \b, in particular, matches against boundaries between characters or at the beginning or end of a string. Because it is not matching against any character, its "width" is zero.
In sum, [^a-zA-Z0-9] requires a character that does not take a particular value be present, while \b requires only that a boundary be present.
Edit: #tj111 has added most of this to his response. I'm in too late, again :)

This works for apple and apples and its case-insensitive spellings:
var strings = ["I have an apple", "You have two Apples", "I give you one more orange"];
var result = [];
var pattern = /\bapples?\b/i;
for (var i=0; i<strings.length; i++) {
if (pattern.test(strings[i])) {
result.push(strings[i]);
}
}

Your second regex requires a nonalphanumeric character before the first a in apple. "apple" doesn't satisfy this. As others note, "\b" matches not a character, but a word boundary position.

/\bapple/i
\b is a word boundary.
To explain why your attempts do not work, the first one does not check that it is the beginning of the word, so it can have something before it. The second regex you gave says that something must be before the word "apple", but it can't be alphanumeric.

We Keep Coding

JavaScript is the programming language of the Web.

Regex Expressions For Emoji - javascript

Related

How to code a Regex for a shared character between two correspondent patterns?

Matching varients and mis-spellings of a word using RegEx in MS Word

RegEx: Understanding Syllable Counter Code

Javascript Regex for all words not between certain characters

Javascript regex

Categories

Resources