Finding exact words in text, excluding quoted words - javascript

In the javascript code below I need to find in a text exact words, but excluding the words that are between quotes. This is my attempt, what's wrong with the regex? It should find all the words excluding word22 and "word3". If I use only \b in the regex it selects exact words but it doesn't exclude the words between quotes.
var text = 'word1, word2, word22, "word3" and word4';
var words = [ 'word1', 'word2', 'word3' , 'word4' ];
words.forEach(function(word){
var re = new RegExp('\\b^"' + word + '^"\\b', 'i');
var pos = text.search(re);
if (pos > -1)
alert(word + " found in position " + pos);
});

First, we'll use a function to escape the characters of the word, just in case there's some that have special meaning for regexp.
// from https://stackoverflow.com/a/30851002/240443
function regExpEscape(literal_string) {
return literal_string.replace(/[-[\]{}()*+!<=:?.\/\\^$|#\s,]/g, '\\$&');
}
Then, we construct a regular expression as an alternation between individual word regexps. For each word, we assert that it starts with a word boundary, ends with a word boundary, and has an even number of quote characters between its end, and the end of string. (Note that from the end of word3, there is only one quote till the end of string, which is odd.)
let text = 'word1, word2, word22, "word3" and word4';
let words = [ 'word1', 'word2', 'word3' , 'word4' ];
let regexp = new RegExp(words.map(word =>
'\\b' + regExpEscape(word) + '\\b(?=(?:[^"]*"[^"]*")*[^"]*$)').join('|'), 'g')
text.match(regexp)
// => word1, word2, word4
while ((m = regexp.exec(text))) {
console.log(m[0], m.index);
}
// word1 0
// word2 7
// word4 34
EDIT: Actually, we can speed the regexp up a bit if we factor out the surrounding conditions:
let regexp = new RegExp(
'\\b(?:' +
words.map(regExpEscape).join('|') +
')\\b(?=(?:[^"]*"[^"]*")*[^"]*$)', 'g')

Your excluding of the quote character is wrong, that's actually matching the beginning of the string followed by a quote. Trying this instead
var re = new RegExp('\\b[^"]' + word + '[^"]\\b', 'i');
Also, this site is amazing to help you debug regex : https://regexpal.com
Edit: Because \b will match on quotation marks, this needs to be tweaked further. Unfortunately javascript doesn't support lookbehinds, so we have to get a little tricky.
var re = new RegExp('(?:^|[^"\\w])' + word + '(?:$|[^"\\w])','i')
So what this is doing is saying
(?: Don't capture this group
^ | [^"\w]) either match the start of the line, or any non word (alphanumeric and underscore) character that isn't a quote
word capture and match your word here
(?: Don't capture this group either
$|[^"\w) either match the end of the line, or any non word character that isn't a quote again

Related

Is there a javascript method to recognize a string even if the words are out of order? [duplicate]

I would like to find all the matches of given strings (divided by spaces) in a string.
(The way for example, iTunes search box works).
That, for example, both "ab de" and "de ab" will return true on "abcde" (also "bc e a" or any order should return true)
If I replace the white space with a wild card, "ab*de" would return true on "abcde", but not "de*ab".
[I use * and not Regex syntax just for this explanation]
I could not find any pure Regex solution for that.
The only solution I could think of is spliting the search term and run multiple Regex.
Is it possible to find a pure Regex expression that will cover all these options ?
Returns true when all parts (divided by , or ' ') of a searchString occur in text. Otherwise false is returned.
filter(text, searchString) {
const regexStr = '(?=.*' + searchString.split(/\,|\s/).join(')(?=.*') + ')';
const searchRegEx = new RegExp(regexStr, 'gi');
return text.match(searchRegEx) !== null;
}
I'm pretty sure you could come up with a regex to do what you want, but it may not be the most efficient approach.
For example, the regex pattern (?=.*bc)(?=.*e)(?=.*a) will match any string that contains bc, e, and a.
var isMatch = 'abcde'.match(/(?=.*bc)(?=.*e)(?=.*a)/) != null; // equals true
var isMatch = 'bcde'.match(/(?=.*bc)(?=.*e)(?=.*a)/) != null; // equals false
You could write a function to dynamically create an expression based on your search terms, but whether it's the best way to accomplish what you are doing is another question.
Alternations are order insensitive:
"abcde".match(/(ab|de)/g); // => ['ab', 'de']
"abcde".match(/(de|ab)/g); // => ['ab', 'de']
So if you have a list of words to match you can build a regex with an alternation on the fly like so:
function regexForWordList(words) {
return new RegExp('(' + words.join('|') + ')', 'g');
}
'abcde'.match(['a', 'e']); // => ['a', 'e']
Try this:
var str = "your string";
str = str.split( " " );
for( var i = 0 ; i < str.length ; i++ ){
// your regexp match
}
This is script which I use - it works also with single word searchStrings
var what="test string with search cool word";
var searchString="search word";
var search = new RegExp(searchString, "gi"); // one-word searching
// multiple search words
if(searchString.indexOf(' ') != -1) {
search="";
var words=searchString.split(" ");
for(var i = 0; i < words.length; i++) {
search+="(?=.*" + words[i] + ")";
}
search = new RegExp(search + ".+", "gi");
}
if(search.test(what)) {
// found
} else {
// notfound
}
I assume you are matching words, or parts of words. You want space-separated search terms to limit search results, and it seems you intend to return only those entries which have all the words that the user supplies. And you intend a wildcard character * to stand for 0 or more characters in a matching word.
For example, if the user searches for the words term1 term2, you intend to return only those items which have both words term1 and term2. If the user searches for the word term*, it would match any word beginning with term.
There are suitable regular expressions which are equivalent to this search language and can be generated from it.
A simple example, the word term, can be asserted in regex by converting to \bterm\b. But two or more words which must match in any order require lookahead assertions. Using extended syntax, the equivalent regex is:
(?= .* \b term1 \b )
(?= .* \b term2 \b )
The asterisk wildcard can be asserted in regex with a character class followed by asterisk. The character class identifies which letters you consider to be part of word. For example, you might find that [A-Za-z0-9]* fits the bill.
In short, you might be satisfied if you convert an expression such as:
foo ba* quux
to:
(?= .* \b foo \b )
(?= .* \b ba[A-Za-z0-9]* \b )
(?= .* \b quux \b )
That is a simple matter of search and replace. But do be careful to sanitize the input string to avoid injection attacks by removing punctuation, etc.
I think you may be barking up the wrong tree with RegEx. What you might want to look at is the Levenshtein distance of two input strings.
There's a Javascript implementation here and a usage example here.

Non-capturing group matching whitespace boundaries in JavaScript regex

I have this function that finds whole words and should replace them. It identifies spaces but should not replace them, ie, not capture them.
function asd (sentence, word) {
str = sentence.replace(new RegExp('(?:^|\\s)' + word + '(?:$|\\s)'), "*****");
return str;
};
Then I have the following strings:
var sentence = "ich mag Äpfel";
var word = "Äpfel";
The result should be something like:
"ich mag *****"
and NOT:
"ich mag*****"
I'm getting the latter.
How can I make it so that it identifies the space but ignores it when replacing the word?
At first this may seem like a duplicate but I did not find an answer to this question, that's why I'm asking it.
Thank you
You should put back the matched whitespaces by using a capturing group (rather than a non-capturing one) with a replacement backreference in the replacement pattern, and you may also leverage a lookahead for the right whitespace boundary, which is handy in case of consecutive matches:
function asd (sentence, word) {
str = sentence.replace(new RegExp('(^|\\s)' + word + '(?=$|\\s)'), "$1*****");
return str;
};
var sentence = "ich mag Äpfel";
var word = "Äpfel";
console.log(asd(sentence, word));
See the regex demo.
Details
(^|\s) - Group 1 (later referred to with the help of a $1 placeholder in the replacement pattern): a capturing group that matches either start of string or a whitespace
Äpfel - a search word
(?=$|\s) - a positive lookahead that requires the end of string or whitespace immediately to the right of the current location.
NOTE: If the word can contain special regex metacharacters, escape them:
function asd (sentence, word) {
str = sentence.replace(new RegExp('(^|\\s)' + word.replace(/[-\/\\^$*+?.()|[\]{}]/g, '\\$&') + '(?=$|\\s)'), "$1*****");
return str;
};

Return word before or after a string with newline characters

In short: I want to return the word right before or after a newline character in a string. How would I accomplish that?
I want to return: 1,150 and Svendborg
This is my string:
var newline = /\n/;
var str = "Specialzed Road Expert 2017\nkr.1,150 - Svendborg\n\nSpecialzed"
This will essentially match a whole line with a leading and trailing newline character with groups to match just the first and last "words".
var str = "Specialzed Road Expert 2017\nkr.1,150 - Svendborg\n\nSpecialzed";
var matches = str.match(/\n([^\s]+).*?([^\s]+)\n/);
console.log(matches);
Your words would be in matches[1] and matches[2] with matches[0] being the whole line.

How to ban words with diacritics using a blacklist array and regex?

I have an input of type text where I return true or false depending on a list of banned words. Everything works fine. My problem is that I don't know how to check against words with diacritics from the array:
var bannedWords = ["bad", "mad", "testing", "băţ"];
var regex = new RegExp('\\b' + bannedWords.join("\\b|\\b") + '\\b', 'i');
$(function () {
$("input").on("change", function () {
var valid = !regex.test(this.value);
alert(valid);
});
});
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<input type='text' name='word_to_check'>
Now on the word băţ it returns true instead of false for example.
Chiu's comment is right: 'aaáaa'.match(/\b.+?\b/g) yelds quite counter-intuitive [ "aa", "á", "aa" ], because "word character" (\w) in JavaScript regular expressions is just a shorthand for [A-Za-z0-9_] ('case-insensitive-alpha-numeric-and-underscore'), so word boundary (\b) matches any place between chunk of alpha-numerics and any other character. This makes extracting "Unicode words" quite hard.
For non-unicase writing systems it is possible to identify "word character" by its dual nature: ch.toUpperCase() != ch.toLowerCase(), so your altered snippet could look like this:
var bannedWords = ["bad", "mad", "testing", "băţ", "bať"];
var bannedWordsRegex = new RegExp('-' + bannedWords.join("-|-") + '-', 'i');
$(function() {
$("input").on("input", function() {
var invalid = bannedWordsRegex.test(dashPaddedWords(this.value));
$('#log').html(invalid ? 'bad' : 'good');
});
$("input").trigger("input").focus();
function dashPaddedWords(str) {
return '-' + str.replace(/./g, wordCharOrDash) + '-';
};
function wordCharOrDash(ch) {
return isWordChar(ch) ? ch : '-'
};
function isWordChar(ch) {
return ch.toUpperCase() != ch.toLowerCase();
};
});
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<input type='text' name='word_to_check' value="ba">
<p id="log"></p>
Let's see what's going on:
alert("băţ".match(/\w\b/));
This is [ "b" ] because word boundary \b doesn't recognize word characters beyond ASCII. JavaScript's "word characters" are strictly [0-9A-Z_a-z], so aä, pπ, and zƶ match \w\b\W since they contain a word character, a word boundary, and a non-word character.
I think the best you can do is something like this:
var bound = '[^\\w\u00c0-\u02c1\u037f-\u0587\u1e00-\u1ffe]';
var regex = new RegExp('(?:^|' + bound + ')(?:'
+ bannedWords.join('|')
+ ')(?=' + bound + '|$)', 'i');
where bound is a reversed list of all ASCII word characters plus most Latin-esque letters, used with start/end of line markers to approximate an internationalized \b. (The second of which is a zero-width lookahead that better mimics \b and therefore works well with the g regex flag.)
Given ["bad", "mad", "testing", "băţ"], this becomes:
/(?:^|[^\w\u00c0-\u02c1\u037f-\u0587\u1e00-\u1ffe])(?:bad|mad|testing|băţ)(?=[^\w\u00c0-\u02c1\u037f-\u0587\u1e00-\u1ffe]|$)/i
This doesn't need anything like ….join('\\b|\\b')… because there are parentheses around the list (and that would create things like \b(?:hey\b|\byou)\b, which is akin to \bhey\b\b|\b\byou\b, including the nonsensical \b\b – which JavaScript interprets as merely \b).
You can also use var bound = '[\\s!-/:-#[-`{-~]' for a simpler ASCII-only list of acceptable non-word characters. Be careful about that order! The dashes indicate ranges between characters.
You need a Unicode aware word boundary. The easiest way is to use XRegExp package.
Although its \b is still ASCII based, there is a \p{L} (or a shorter pL version) construct that matches any Unicode letter from the BMP plane. To build a custom word boundary using this contruct is easy:
\b word \b
---------------------------------------
| | |
([^\pL0-9_]|^) word (?=[^\pL0-9_]|$)
The leading word boundary can be represented with a (non)capturing group ([^\pL0-9_]|^) that matches (and consumes) either a character other than a Unicode letter from the BMP plane, a digit and _ or a start of the string before the word.
The trailing word boundary can be represented with a positive lookahead (?=[^\pL0-9_]|$) that requires a character other than a Unicode letter from the BMP plane, a digit and _ or the end of string after the word.
See the snippet below that will detect băţ as a banned word, and băţy as an allowed word.
var bannedWords = ["bad", "mad", "testing", "băţ"];
var regex = new XRegExp('(?:^|[^\\pL0-9_])(?:' + bannedWords.join("|") + ')(?=$|[^\\pL0-9_])', 'i');
$(function () {
$("input").on("change", function () {
var valid = !regex.test(this.value);
//alert(valid);
console.log("The word is", valid ? "allowed" : "banned");
});
});
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/xregexp/3.1.1/xregexp-all.min.js"></script>
<input type='text' name='word_to_check'>
In stead of using word boundary, you could do it with
(?:[^\w\u0080-\u02af]+|^)
to check for start of word, and
(?=[^\w\u0080-\u02af]|$)
to check for the end of it.
The [^\w\u0080-\u02af] matches any characters not (^) being basic Latin word characters - \w - or the Unicode 1_Supplement, Extended-A, Extended-B and Extensions. This include some punctuation, but would get very long to match just letters. It may also have to be extended if other character sets have to be included. See for example Wikipedia.
Since javascript doesn't support look-behinds, the start-of-word test consumes any before mentioned non-word characters, but I don't think that should be a problem. The important thing is that the end-of-word test doesn't.
Also, putting these test outside a non capturing group that alternates the words, makes it significantly more effective.
var bannedWords = ["bad", "mad", "testing", "băţ", "båt", "süß"],
regex = new RegExp('(?:[^\\w\\u00c0-\\u02af]+|^)(?:' + bannedWords.join("|") + ')(?=[^\\w\\u00c0-\\u02af]|$)', 'i');
function myFunction() {
document.getElementById('result').innerHTML = 'Banned = ' + regex.test(document.getElementById('word_to_check').value);
}
<!DOCTYPE html>
<html>
<body>
Enter word: <input type='text' id='word_to_check'>
<button onclick='myFunction()'>Test</button>
<p id='result'></p>
</body>
</html>
When dealing with characters outside my base set (which can show up at any time), I convert them to an appropriate base equivalent (8bit, 16bit, 32bit). before running any character matching over them.
var bannedWords = ["bad", "mad", "testing", "băţ"];
var bannedWordsBits = {};
bannedWords.forEach(function(word){
bannedWordsBits[word] = "";
for (var i = 0; i < word.length; i++){
bannedWordsBits[word] += word.charCodeAt(i).toString(16) + "-";
}
});
var bannedWordsJoin = []
var keys = Object.keys(bannedWordsBits);
keys.forEach(function(key){
bannedWordsJoin.push(bannedWordsBits[key]);
});
var regex = new RegExp(bannedWordsJoin.join("|"), 'i');
function checkword(word) {
var wordBits = "";
for (var i = 0; i < word.length; i++){
wordBits += word.charCodeAt(i).toString(16) + "-";
}
return !regex.test(wordBits);
};
The separator "-" is there to make sure that unique characters don't bleed together creating undesired matches.
Very useful as it brings all the characters down to a common base that everything can interact with. And this can be re-encoded back to it's original without having to ship it in key/value pair.
For me the best thing about it is that I don't have to know all of the rules for all of the character sets that I might intersect with, because I can pull them all into a common playing field.
As a side note:
To speed things up, rather than passing the large regex statement that you probably have, which takes exponentially longer to pass with the length of the words that you're banning, I would pass each separate word in the sentence through the filter. And break the filter up into length based segments. like;
checkword3Chars();
checkword4Chars();
checkword5chars();
who's functions you can generate systematically and even create on the fly as and when they become required.

RegExp match word till space or character

I'm trying to match all the words starting with # and words between 2 # (see example)
var str = "#The test# rain in #SPAIN stays mainly in the #plain";
var res = str.match(/(#)[^\s]+/gi);
The result will be ["#The", "#SPAIN", "#plain"] but it should be ["#The test#", "#SPAIN", "#plain"]
Extra: would be nice if the result would be without the #.
Does anyone has a solution for this?
You can use
/#\w+(?:(?: +\w+)*#)?/g
See the demo here
The regex matches:
# - a hash symbol
\w+ - one or more alphanumeric and underscore characters
(?:(?: +\w+)*#)? - one or zero occurrence of:
(?: +\w+)* - zero or more occurrences of one or more spaces followed with one or more word characters followed with
# - a hash symbol
NOTE: If there can be characters other than word characters (those in the [A-Za-z0-9_] range), you can replace \w with [^ #]:
/#[^ #]+(?:(?: +[^ #]+)*#)?/g
See another demo
var re = /#[^ #]+(?:(?: +[^ #]+)*#)?/g;
var str = '#The test-mode# rain in #SPAIN stays mainly in the #plain #SPAIN has #the test# and more #here';
var m = str.match(re);
if (m) {
// Using ES6 Arrow functions
m = m.map(s => s.replace(/#$/g, ''));
// ES5 Equivalent
/*m = m.map(function(s) {
return s.replace(/#$/g, '');
});*/ // getting rid of the trailing #
document.body.innerHTML = "<pre>" + JSON.stringify(m, 0, 4) + "</pre>";
}
You can also try this regex.
#(?:\b[\s\S]*?\b#|\w+)
(?: opens a non capture group for alternation
\b matches a word boundary
\w matches a word character
[\s\S] matches any character
See demo at regex101 (use with g global flag)

Categories