Matching on words with possibly special characters

Matching on words with possibly special characters - javascript

I'm trying to replace all the occurrence of a given word in a string but it is possible that the word contains a special character that needs to be escaped. Here's an example:
The ERA is the mean of earned runs given up by a pitcher per nine
innings pitched. Meanwhile, the ERA+, the adjusted ERA, is a pitcher's
earned run average (ERA) according to the pitcher's ballpark (in case
the ballpark favors batters or pitchers) and the ERA of the pitcher's
league.
I would like to be able to do the following:
string = "The ERA..." // from above
string = string.replaceAll("ERA", "<b>ERA</b>");
string = string.replaceAll("ERA+", "<u>ERA+</u>");
without ERA and ERA conflicting. I've been using the protoype replaceAll posted previously along with a regular expression found somewhere else on SO (I can't seem to find the link in my history unfortunately)
String.prototype.replaceAll = function (find, replace) {
var str = this;
return str.replace(new RegExp(find.replace(/[-\/\\^$*+?.()|[\]{}]/g, '\\$&'), 'g'), replace);
};
function loadfunc() {
var markup = document.getElementById('thetext').innerHTML;
var terms = Object.keys(acronyms);
for (i=0; i<terms.length; i++) {
markup = markup.replaceAll(terms[i], '<abbr title=\"' + acronyms[terms[i]] + '\">' + terms[i] + '</abbr>');
}
document.getElementById('thetext').innerHTML = markup;
}
Basically what the code does is adding an tag to abbreviation to include the definition when mouseovering. The problem is that the current regular expression is way too loose. My previous attempts worked partially but failed to make the difference between things like ERA and ERA+ or would completely skip over something like "K/9" or "IP/GS" (which should be a match by itself and not for "IP" or "GS" individually)
I should mention that acronyms is an array that looks like:
var acronyms = {
"ERA": "Earned Run Average: ...",
"ERA+": "Earned Run Average adjusted to ..."
};
Also (although this is fairly obvious) 'thetext' is a dummy div containing some text. The loadfunc() function is executed from <body onload="loadfunc()">
Thanks!

OK, this is a lot to work with -- after looking at your jsFiddle.
I think the best you're going to get is searching for whole words that begin with a capital letter and may contain / or %. Something like this: ([A-Z][\w/%]+)
Caveat: no matter how you do this, if you're doing it in the browser (e.g. you can't update the raw data) it's going to be process intensive.
And you can implement it like this:
var repl = str.replace(/([A-Z][\w\/%]+)/g, function(match) {
//alert(match);
if (match in acronyms)
return "<abbr title='" + acronyms[match] + "'>" + match + "</abbr>";
else
return match;
});
Here's a working jsFiddle: http://jsfiddle.net/remus/9z6fg/
Note that jQuery isn't required, just used it in this case for ease of updating the DOM in jsFiddle.

You want to use regular expressions with negative lookahead:
string.replace(/\bERA(?!\+)\b/g, "<b>ERA</b>");
and
string.replace(/\bERA\+/g, "<u>ERA+</u>");
The zero-width word boundary \b has been added for good measure, so you don't accidentally match strings like 'BERA', etc.
Another idea is to sort the list of acronyms by longest key to smallest. This way you are sure to substitute all 'ERA+' before 'ERA', so there is no substring conflict.

Related

RegEx working in JavaScript but not in C#

I currently have a working WordWrap function in Javascript that uses RegEx. I pass the string I want wrapped and the length I want to begin wrapping the text, and the function returns a new string with newlines inserted at appropriate locations in the string as shown below:
wordWrap(string, width) {
let newString = string.replace(
new RegExp(`(?![^\\n]{1,${width}}$)([^\\n]{1,${width}})\\s`, 'g'), '$1\n'
);
return newString;
}
For consistency purposes I won't go into, I need to use an identical or similar RegEx in C#, but I am having trouble successfully replicating the function. I've been through a lot of iterations of this, but this is what I currently have:
private static string WordWrap(string str, int width)
{
Regex rgx = new Regex("(?![^\\n]{ 1,${" + width + "}}$)([^\\n]{1,${" + width + "}})\\s");
MatchCollection matches = rgx.Matches(str);
string newString = string.Empty;
if (matches.Count > 0)
{
foreach (Match match in matches)
{
newString += match.Value + "\n";
}
}
else
{
newString = "No matches found";
}
return newString;
}
This inevitably ends up finding no matches regardless of the string and length I pass. I've read that the RegEx used in JavaScript is different than the standard RegEx functionality in .NET. I looked into PCRE.NET but have had no luck with that either.
Am I heading in the right general direction with this? Can anyone help me convert the first code block in JavaScript to something moderately close in C#?
edit: For those looking for more clarity on what the working function does and what I am looking for the C# function to do: What I am looking to output is a string that has a newline (\n) inserted at the width passed to the function. One thing I forgot to mention (but really isn't related to my issue here) is that the working JavaScript version finds the end of the word so it doesn't cut up the word. So for example this string:
"This string is really really long so we want to use the word wrap function to keep it from running off the page.\n"
...would be converted to this with the width set to 20:
"This string is really \nreally long so we want \nto use the word wrap \nfunction to keep it \nfrom running off the \npage.\n"
Hope that clears it up a bit.

JavaScript and C# Regex engines are different. Also each language has it's own regex pattern executor, so Regex is language dependent. It's not the case, if it is working for one language so it will work for another.
C# supports named groups while JavaScript doesn't support them.
So you can find multiple difference between these two languages regex.

There are issues with the way you've translated the regex pattern from a JavaScript string to a C# string.
You have extra whitespace in the c# version, and you've also left in $ symbols and curly brackets { that are part of the string interpolation syntax in the JavaScript version (they are not part of the actual regex pattern).
You have:
"(?![^\\n]{ 1,${" + width + "}}$)([^\\n]{1,${" + width + "}})\\s"
when what I believe you want is:
"(?![^\\n]{1," + width + "}$)([^\\n]{1," + width + "})\\s"

Match and replace a substring while ignoring special characters

I am currently looking for a way to turn matching text into a bold html line. I have it partially working except for special characters giving me problems because I desire to maintain the original string, but not compare the original string.
Example:
Given the original string:
Taco John's is my favorite place to eat.
And wanting to match:
is my 'favorite'
To get the desired result:
Taco John's <b>is my favorite</b> place to eat.
The way I'm currently getting around the extra quotes in the matching string is by replacing them
let regex = new RegExp('('+escapeRegexCharacters(matching_text.replace(/[^a-z 0-9]/gi,''))+')',"gi")
let html= full_text.replace(/[^a-z 0-9]/gi,'').replace(regex, "<b>$1</b>")}}></span>
This almost works, except that I lose all punctuation:
Taco Johns <b>is my favorite</b> place to eat
Is there any way to use regex, or another method, to add tags surrounding a matching phrase while ignoring both case and special characters during the matching process?
UPDATE #1:
It seems that I am being unclear. I need the original string's puncuation to remain in the end result's html. And I need the matching text logic to ignore all special characters and capitalization. So is my favorite is My favorite and is my 'favorite' should all trigger a match.

Instead of removing the special characters from the string being searched, you could inject in your regular expression a pattern between each character-to-match that will skip any special characters that might occur. That way you build a regular expression that can be applied directly to the string being searched, and the replacing operation will thus not touch the special characters outside of the matches:
let escapeRegexCharacters =
s => s.replace(/[\-\[\]\/\{\}\(\)\*\+\?\.\\\^\$\|]/g, "\\$&"),
full_text = "Taco John's is My favorite place to eat.";
matching_text = "is my 'favorite'";
regex = new RegExp(matching_text.replace(/[^a-z\s\d]/gi, '')
.split().map(escapeRegexCharacters).join('[^a-z\s\d]*'), "gi"),
html = full_text.replace(regex, "<b>$&</b>");
console.log(html);

Regexps are useful where there is a pattern, but, in this case you have a direct match, so, the good approach is using a String.prototype.replace:
function wrap(source, part, tagName) {
return source
.replace(part,
`<${tagName}>${part}</${tagName}>`
)
;
}
At least, if there is a pattern, you should edit your question and provide it.

As an option, for single occurrence case - use String.split
Example replacing '###' with '###' :
let inputString = '1234###5678'
const chunks = inputString.split('###')
inputString = `${chunks[0]}###${chunks[1]}`

It's possible to avoid using a capture group with the $& replacement string, which means "entire matched substring":
var phrase = "Taco John's is my favorite place to eat."
var matchingText = "is my favorite"
var re = new RegExp(escapeRegexCharacters(matchingText), "ig");
phrase.replace(re, "<b>$&</b>");
(Code based on obarakon's answer.)

Generalizing, the regex you could use is my /w+. You can use that in a replacer function so that you can javascript manipulate the resultant text:
var str = "Taco John's is my favorite place to eat.";
var html = str.replace(/is my \w*/, function (x) {
return "<b>" + x + "</b>";
} );
console.log(html);

How to check multiple matching words with regex in Javascript

Hey I have code like this
var text = "We are downing to earth"
var regexes = "earth|art|ear"
if (regexes.length) {
var reg = new RegExp(regexes, "ig");
console.log(reg)
while ((regsult = reg.exec(text)) !== null) {
var word = regsult[0];
console.log(word)
}
}
I want to get matching words from text. It should have "earth", "art" and "ear" as well. Because "earth" consist of those substring. Instead, it only produce "earth".
Is there any mistake with my regex pattern?
Or should I use another approach in JS?
Thanks

As discussed in another answer, a single regexp cannot match multiple overlapping alternatives. In your case, simply do a separate regexp test for each word you are looking for:
var text = "We are downing to earth"
var regexes = ["earth", "art", "ear"];
var results = [];
for (var i = 0; i < regexes.length; i++ ) {
var word = regexes[i];
if (text.match(word) results.push(word);
}
You could tighten this up a little bit by doing
regexes . filter(function(word) { return (text.match(word) || [])[0]; });
If your "regexes" are actually just strings, you could just use indexOf and keep things simpler:
regexes . filter(function(word) { return text.indexOf(word) !== -1; });

You only get earth as a match because the regex engine has matched earth as the first alternative and then moved on in the source string, oblivious to the fact that you could also have matched ear or art. This is expected behavior with all regex engines - they don't try to return all possible matches, just the first one, and matches generally can't overlap.
Whether earth or ear is returned depends on the regex engine. A POSIX ERE engine will always return the leftmost, longest match, whereas most current regex engines (including JavaScript's) will return the first possible match, depending on the order of alternation in the regex.
So art|earth|ear would return earth, whereas ear|art|earth would return ear.
You can make the regex find overlapping matches (as long as they start in different positions in the string) by using positive lookahead assertions:
(?=(ear|earth|art))
will find ear and art, but not earth because it starts at the same position as ear. Note that you must not look for the regex' entire match (regsult[0] in your code) in this case but for the content of the capturing group, in this case (regsult[1]).
The only way around this that I can think of at the moment would be to use
(?=(ear(th)?|art))
which would have a result like [["", "ear", "th"], ["", "art", undefined]].

Count parentheses with regular expression

My string is: (as(dh(kshd)kj)ad)... ()()
How is it possible to count the parentheses with a regular expression? I would like to select the string which begins at the first opening bracket and ends before the ...
Applying that to the above example, that means I would like to get this string: (as(dh(kshd)kj)ad)
I tried to write it, but this doesn't work:
var str = "(as(dh(kshd)kj)ad)... ()()";
document.write(str.match(/(.*)/m));

As I said in the comments, contrary to popular belief (don't believe everything people say) matching nested brackets is possible with regex.
The downside of using it is that you can only do it up to a fixed level of nesting. And for every additional level you wish to support, your regex will be bigger and bigger.
But don't take my word for it. Let me show you. The regex \([^()]*\) matches one level. For up to two levels see the regex here. To match your case, you'd need:
\(([^()]*|\(([^()]*|\([^()]*\))*\))*\)
It would match the bold part: (as(dh(kshd)kj)ad)... ()()
Check the DEMO HERE and see what I mean by fixed level of nesting.
And so on. To keep adding levels, all you have to do is change the last [^()]* part to ([^()]*|\([^()]*\))* (check three levels here). As I said, it will get bigger and bigger.

See Tim's answer for why this won't work, but here's a function that'll do what you're after instead.
function getFirstBracket(str){
var pos = str.indexOf("("),
bracket = 0;
if(pos===-1) return false;
for(var x=pos; x<str.length; x++){
var char = str.substr(x, 1);
bracket = bracket + (char=="(" ? 1 : (char==")" ? -1 : 0));
if(bracket==0) return str.substr(pos, (x+1)-pos);
}
return false;
}
getFirstBracket("(as(dh(kshd)kj)ad)... ()(");

There is a possibility and your approach was quite good:
Match will give you an array if you had some hits, if so you can look up the array length.
var str = "(as(dh(kshd)kj)ad)... ()()",
match = str.match(new RegExp('.*?(?:\\(|\\)).*?', 'g')),
count = match ? match.length : 0;
This regular expression will get all parts of your text that include round brackets. See http://gskinner.com/RegExr/ for a nice online regex tester.
Now you can use count for all brackets.
match will deliver a array that looks like:
["(", "as(", "dh(", "kshd)", "kj)", "ad)", "... (", ")", "(", ")"]
Now you can start sorting your results:
var newStr = '', open = 0, close = 0;
for (var n = 0, m = match.length; n < m; n++) {
if (match[n].indexOf('(') !== -1) {
open++;
newStr += match[n];
} else {
if (open > close) newStr += match[n];
close++;
}
if (open === close) break;
}
... and newStr will be (as(dh(kshd)kj)ad)
This is probably not the nicest code but it will make it easier to understand what you're doing.
With this approach there is no limit of nesting levels.

This is not possible with a JavaScript regex. Generally, regular expressions can't handle arbitrary nesting because that can no longer be described by a regular language.
Several modern regex flavors do have extensions that allow for recursive matching (like PHP, Perl or .NET), but JavaScript is not among them.

No. Regular expressions express regular languages. Finite automatons (FA) are the machines which recognise regular language. A FA is, as its name implies, finite in memory. With a finite memory, the FA can not remember an arbitrary number of parentheses - a feature which is needed in order to do what you want.
I suggest you use an algorithms involving an enumerator in order to solve your problem.

try this jsfiddle
var str = "(as(dh(kshd)kj)ad)... ()()";
document.write(str.match(/\((.*?)\.\.\./m)[1] );

Regex lookbehind workaround for Javascript?

I am terrible at regex so I will communicate my question a bit unconventionally in the name of trying to better describe my problem.
var TheBadPattern = /(\d{2}:\d{2}:\d{2},\d{3})/;
var TheGoodPattern = /([a-zA-Z0-9\-,.;:'"])(?:\r\n?|\n)([a-zA-Z0-9\-])/gi;
// My goal is to then do this
inputString = inputString.replace(TheGoodPattern, '$1 $2);
Question: I want to match all the good patterns and do the subsequent find/replace UNLESS they are proceeded by the bad pattern, any ideas on how? I was able to accomplish this in other languages that support lookbehind but I am at a loss without it? (ps: from what I understand, JS does not support lookahead/lookbehind or if you prefer, '?>!', '?<=')

JavaScript does support lookaheads. And since you only need a lookbehind (and not a lookahead, too), there is a workaround (which doesn't really aid the readability of your code, but it works!). So what you can do is reverse both the string and the pattern.
inputString = inputString.split("").reverse().join("");
var pattern = /([a-z0-9\-])(?:\n\r?|\r)([a-z0-9\-,.;:'"])(?!\d{3},\d{2}:\d{2}:\d{2})/gi
inputString = inputString.replace(TheGoodPattern, '$1 $2');
inputString = inputString.split("").reverse().join("");
Note that you had redundantly used the upper case letters (they are being taken care of the i modifier).
I would actually test it for you if you supplied some example input.

I have also used the reverse methodology recommended by m.buettner, and it can get pretty tricky depending on your patterns. I find that workaround works well if you are matching simple patterns or strings.
With that said I thought I would go a bit outside the box just for fun. This solution is not without its own foibles, but it also works and it should be easy to adapt to existing code with medium to complicated regular expressions.
http://jsfiddle.net/52QBx/
js:
function negativeLookBehind(lookBehindRegExp, matchRegExp, modifiers)
{
var text = $('#content').html();
var badGoodRegex = regexMerge(lookBehindRegExp, matchRegExp, modifiers);
var badGoodMatches = text.match(badGoodRegex);
var placeHolderMap = {};
for(var i = 0;i<badGoodMatches.length;i++)
{
var match = badGoodMatches[i];
var placeHolder = "${item"+i+"}"
placeHolderMap[placeHolder] = match;
$('#content').html($('#content').html().replace(match, placeHolder));
}
var text = $('#content').html();
var goodRegex = matchRegExp;
var goodMatches = text.match(goodRegex);
for(prop in placeHolderMap)
{
$('#content').html($('#content').html().replace(prop, placeHolderMap[prop]));
}
return goodMatches;
}
function regexMerge(regex1, regex2, modifiers)
{
/*this whole concept could be its own beast, so I just asked to have modifiers for the combined expression passed in rather than determined from the two regexes passed in.*/
return new RegExp(regex1.source + regex2.source, modifiers);
}
var result = negativeLookBehind(/(bad )/gi, /(good\d)/gi, "gi");
alert(result);

html:
<div id="content">Some random text trying to find good1 text but only when that good2 text is not preceded by bad text so bad good3 should not be found bad good4 is a bad oxymoron anyway.</div>
The main idea is find all the total patterns (both the lookbehind and the real match) and temporarily remove those from the text being searched. I utilized a map as the values being hidden could vary and thus each replacement had to be reversible. Then we can run just the regex for the items you really wanted to find without the ones that would have matched the lookbehind getting in the way. After the results are determined we swap back in the original items and return the results. It is a quirky, yet functional, workaround.

We Keep Coding

JavaScript is the programming language of the Web.

Matching on words with possibly special characters - javascript

Related

RegEx working in JavaScript but not in C#

Match and replace a substring while ignoring special characters

How to check multiple matching words with regex in Javascript

Count parentheses with regular expression

Regex lookbehind workaround for Javascript?

Categories

Resources