How to search for accented characters in mongodb collection using nodejs

How to search for accented characters in mongodb collection using nodejs - javascript

MongoDB treats É and E as two separate things, so when I search for E it will not find É.
Is there a way to make MongoDB think of them as the same thing?
I am running
var find =Users.find();
var re = new RegExp(name, 'i');
find.where('info.name').equals(re);
How do I match for strings containing accented characters and get the result?

This feature is not supported in mongodb and i doubt if it will be in the near future. What you could do to overcome is store a different field in each document containing the simple form of each name, in lowercase.
{
info:{"name":"Éva","search":"eva"};
}
{
info:{"name":"Eva","Search":"eva"}
}
When you have your document structure this, you have a some advantages,
You could create an index over the field search,
db.user.ensureIndex({"Search":1})
and fire a simple query, to find the match. When you search for a particular term, convert that term to its simple form, and to lower case and then do a find.
User.find({"Search":"eva"});
This would make use of the index as well, which a regex query would not.
See Also: Mongodb match accented characters as underlying character
But if you would want to do it the hard way, which is not recommended. Just for the records i am posting it here,
You need to have a mapping between the simple alphabets and their possible accented forms. For example:
var map = {"A":"[AÀÁÂÃÄÅ]"};
Say the search term is a, but the database document has its accented form, then, you would need to build a dynamic regex yourself before passing it to the find(), query.
var searchTerm = "a".toUpperCase();
var term = [];
for(var i=0;i<searchTerm.length;i++){
var char = searchTerm.charAt(i);
var reg = map[char];
term.push(reg);
}
var regexp = new RegExp(term.join(""));
User.find({"info.name":{$regex:regexp}})
Note, that the depicted example can handle a search term of length > 1 too.

Related

Regex to match #word [duplicate]

I am writing an application in Node.js that allows users to mention each other in messages like on twitter. I want to be able to find the user and send them a notification. In order to do this I need to pull #usernames to find mentions from a string in node.js?
Any advice, regex, problems?

I have found that this is the best way to find mentions inside of a string in javascript.
var str = "#jpotts18 what is up man? Are you hanging out with #kyle_clegg";
var pattern = /\B#[a-z0-9_-]+/gi;
str.match(pattern);
["#jpotts18", "#kyle_clegg"]
I have purposefully restricted it to upper and lowercase alpha numeric and (-,_) symbols in order to avoid periods that could be confused for usernames like (#j.potts).
This is what twitter-text.js is doing behind the scenes.
// Mention related regex collection
twttr.txt.regexen.validMentionPrecedingChars = /(?:^|[^a-zA-Z0-9_!#$%&*#＠]|RT:?)/;
twttr.txt.regexen.atSigns = /[#＠]/;
twttr.txt.regexen.validMentionOrList = regexSupplant(
'(#{validMentionPrecedingChars})' + // $1: Preceding character
'(#{atSigns})' + // $2: At mark
'([a-zA-Z0-9_]{1,20})' + // $3: Screen name
'(\/[a-zA-Z][a-zA-Z0-9_\-]{0,24})?' // $4: List (optional)
, 'g');
twttr.txt.regexen.endMentionMatch = regexSupplant(/^(?:#{atSigns}|[#{latinAccentChars}]|:\/\/)/);
Please let me know if you have used anything that is more efficient, or accurate. Thanks!

Twitter has a library that you should be able to use for this. https://github.com/twitter/twitter-text-js.
I haven't used it, but if you trust its description, "the library provides autolinking and extraction for URLs, usernames, lists, and hashtags.". You should be able to use it in Node with npm install twitter-text.
While I understand that you're not looking for Twitter usernames, the same logic still applies and you should be able to use it fine (it does not validate that extracted usernames are valid twitter usernames). If not, forking it for your own purposes may be a very good place to start.
Edit: I looked at the docs closer, and there is a perfect example of what you need right here.
var usernames = twttr.txt.extractMentions("Mentioning #twitter and #jack")
// usernames == ["twitter", "jack"]

here is how you extract mentions from instagram caption with JavaScript and underscore.
var _ = require('underscore');
function parseMentions(text) {
var mentionsRegex = new RegExp('#([a-zA-Z0-9\_\.]+)', 'gim');
var matches = text.match(mentionsRegex);
if (matches && matches.length) {
matches = matches.map(function(match) {
return match.slice(1);
});
return _.uniq(matches);
} else {
return [];
}
}

I would respect names with diacritics, or character from any language \p{L}.
/(?<=^| )#\p{L}+/gu
Example on Regex101.com with description.
PS:
Don't use \B since it will match ##wrong.

How to find any of the specific characters exists in a string

Im looking for a solution to search the existence of given characters in a string. That means if any of the given characters present in a string, it should return true.
Now am doing it with arrays and loops. But honestly I feel its not a good way. So is there is any easiest way without array or loop?
var special = ['$', '%', '#'];
var mystring = ' using it to replace VLOOKUP entirely.$ But there are still a few lookups that you are not sure how to perform. Most importantly, you would like to be able to look up a value based on multiple criteria within separate columns.';
var exists = false;
$.each(special, function(index, item) {
if (mystring.indexOf(item) >= 0) {
exists = true;
}
});
console.info(exists);
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>

try with regex
var patt = /[$%#]/;
console.log(patt.test("using it to replace VLOOKUP entirely.$ But there are still a few lookups that you are not sure how to perform. Most importantly, you would like to be able to look up a value based on multiple criteria within separate columns."));

Be aware that [x] in regEx is for single characters only.
If you say wanted to search for say replace, it's going to look for anything with 'r,e,p,l,a,c' in the string.
Another thing to be aware of with regEx is escaping. Using a simple escape regEx found here -> Is there a RegExp.escape function in Javascript? I've made a more generic find in string.
Of course you asked given characters in a string, so this is more of an addenum answer for anyone finding this post on SO. As looking at your original question of an array of strings, it might be easy for people to think that's what you could just pass to the regEx. IOW: your questions wasn't how can I find out if $, %, # exist in a string.
var mystring = ' using it to replace VLOOKUP entirely.$ But there are still a few lookups that you are not sure how to perform. Most importantly, you would like to be able to look up a value based on multiple criteria within separate columns.';
function makeStrSearchRegEx(findlist) {
return new RegExp('('+findlist.map(
s=>s.replace(/[-\/\\^$*+?.()|[\]{}]/g, '\\$&')).join('|')+')');
}
var re = makeStrSearchRegEx(['$', '%', '#', 'VLOOKUP']);
console.log(re.test(mystring)); //true
console.log(re.test('...VLOOKUP..')); //true
console.log(re.test('...LOOKUP..')); //false

The best way is to use regular expressions. You can read more about it here.
In your case you should do something like this:
const specialCharacters = /[$%#]/;
const myString = ' using it to replace VLOOKUP entirely.$ But there are still a few lookups that you are not sure how to perform. Most importantly, you would like to be able to look up a value based on multiple criteria within separate columns.';
if(specialCharacters.test(myString)) {
console.info("Exists...");
}
Please, note, that it is good approach to store regular expressions in a variable to prevent creating of regular expression (which is not the fastest operation) each time you use it.

Mongoose.js and the query object with Regexp

In a locomotive app, in a search engine that queries my models (with rather specific metadata that I won't go into here for the usual reasons) I need to include a regexp engine to check against the keywords field.
My approach is as follows:
this.keywords = strings.makeSafe(this.param("keywords")).toLowerCase();
console.log(this.keywords);
if(strings.exists(this.keywords)) {
keywords = this.keywords.split(", ");
var len = keywords.length - 1;
do {
query.regex("/" + this.keywords[len] + "/ig", "keywords");
} while(len--);
}
(It's this.keywords so that I can pass it to the view should I need to).
however, I'm not matching data that I know to be available in the documents in the collection
the strings.makesafe call simply does this:
strings.makeSafe = function(str) {
str = String(str);
var re = /\$/gi;
str = str.replace(re, "U+FF04");
re = /\./gi;
return str.replace(re, "U+FF08");
};
and is an attempt to deal with mongoose's vulnerability to code injection via the "." and "$" characters. It's been tested and shouldn't be driving the issue.
I'm of the mind right now that it's something to do with the structure of the regexp or the calling method. is this the correct syntax to accomplish a search on comma separated list of keywords in mongoose.

There are a few problems with your approach:
Assuming that query is a Mongoose Query object, you've got the order swapped for the parameters to the regex method as the path comes first, then the regex value.
You need to construct the regular expression using the RegExp constructor function as the literal notation with the / chars can't be dynamically constructed from strings.
Calling query.regex in a loop like that doesn't OR the conditions together as each call simply overwrites the previous one. Instead, you can join the keywords together into a combined regex that matches any of them by using |.
Putting it all together it should be something like:
this.keywords = strings.makeSafe(this.param("keywords")).toLowerCase();
console.log(this.keywords);
if (strings.exists(this.keywords)) {
keywords = this.keywords.split(", ");
query.regex("keywords", new RegExp(keywords.join("|"), "i"));
}

how to pull # mentions out of strings like twitter in javascript

I am writing an application in Node.js that allows users to mention each other in messages like on twitter. I want to be able to find the user and send them a notification. In order to do this I need to pull #usernames to find mentions from a string in node.js?
Any advice, regex, problems?

I have found that this is the best way to find mentions inside of a string in javascript.
var str = "#jpotts18 what is up man? Are you hanging out with #kyle_clegg";
var pattern = /\B#[a-z0-9_-]+/gi;
str.match(pattern);
["#jpotts18", "#kyle_clegg"]
I have purposefully restricted it to upper and lowercase alpha numeric and (-,_) symbols in order to avoid periods that could be confused for usernames like (#j.potts).
This is what twitter-text.js is doing behind the scenes.
// Mention related regex collection
twttr.txt.regexen.validMentionPrecedingChars = /(?:^|[^a-zA-Z0-9_!#$%&*#＠]|RT:?)/;
twttr.txt.regexen.atSigns = /[#＠]/;
twttr.txt.regexen.validMentionOrList = regexSupplant(
'(#{validMentionPrecedingChars})' + // $1: Preceding character
'(#{atSigns})' + // $2: At mark
'([a-zA-Z0-9_]{1,20})' + // $3: Screen name
'(\/[a-zA-Z][a-zA-Z0-9_\-]{0,24})?' // $4: List (optional)
, 'g');
twttr.txt.regexen.endMentionMatch = regexSupplant(/^(?:#{atSigns}|[#{latinAccentChars}]|:\/\/)/);
Please let me know if you have used anything that is more efficient, or accurate. Thanks!

Twitter has a library that you should be able to use for this. https://github.com/twitter/twitter-text-js.
I haven't used it, but if you trust its description, "the library provides autolinking and extraction for URLs, usernames, lists, and hashtags.". You should be able to use it in Node with npm install twitter-text.
While I understand that you're not looking for Twitter usernames, the same logic still applies and you should be able to use it fine (it does not validate that extracted usernames are valid twitter usernames). If not, forking it for your own purposes may be a very good place to start.
Edit: I looked at the docs closer, and there is a perfect example of what you need right here.
var usernames = twttr.txt.extractMentions("Mentioning #twitter and #jack")
// usernames == ["twitter", "jack"]

here is how you extract mentions from instagram caption with JavaScript and underscore.
var _ = require('underscore');
function parseMentions(text) {
var mentionsRegex = new RegExp('#([a-zA-Z0-9\_\.]+)', 'gim');
var matches = text.match(mentionsRegex);
if (matches && matches.length) {
matches = matches.map(function(match) {
return match.slice(1);
});
return _.uniq(matches);
} else {
return [];
}
}

I would respect names with diacritics, or character from any language \p{L}.
/(?<=^| )#\p{L}+/gu
Example on Regex101.com with description.
PS:
Don't use \B since it will match ##wrong.

Regular Expression with multiple words (in any order) without repeat

I'm trying to execute a search of sorts (using JavaScript) on a list of strings. Each string in the list has multiple words.
A search query may also include multiple words, but the ordering of the words should not matter.
For example, on the string "This is a random string", the query "trin and is" should match. However, these terms cannot overlap. For example, "random random" as a query on the same string should not match.
I'm going to be sorting the results based on relevance, but I should have no problem doing that myself, I just can't figure out how to build up the regular expression(s). Any ideas?

The query trin and is becomes the following regular expression:
/trin.*(?:and.*is|is.*and)|and.*(?:trin.*is|is.*trin)|is.*(?:trin.*and|and.*trin)/
In other words, don't use regular expressions for this.

It probably isn't a good idea to do this with just a regular expression. A (pure, computer science) regular expression "can't count". The only "memory" it has at any point is the state of the DFA. To match multiple words in any order without repeat you'd need on the order of 2^n states. So probably a really horrible regex.
(Aside: I mention "pure, computer science" regular expressions because most implementations are actually an extension, and let you do things that are non-regular. I'm not aware of any extensions, certainly none in JavaScript, that make doing what you want to do any less painless with a single pattern.)
A better approach would be to keep a dictionary (Object, in JavaScript) that maps from words to counts. Initialize it to your set of words with the appropriate counts for each. You can use a regular expression to match words, and then for each word you find, decrement the corresponding entry in the dictionary. If the dictionary contains any non-0 values at the end, or if somewhere a long the way you try to over-decrement a value (or decrement one that doesn't exist), then you have a failed match.

I'm totally not sure if I get you right there, so I'll just post my suggestion for it.
var query = "trin and is",
target = "This is a random string",
search = { },
matches = 0;
query.split( /\s+/ ).forEach(function( word ) {
search[ word ] = true;
});
Object.keys( search ).forEach(function( word ) {
matches += +new RegExp( word ).test( target );
});
// do something useful with "matches" for the query, should be "3"
alert( matches );
So, the variable matches will contain the number of unique matches for the query. The first split-loop just makes sure that no "doubles" are counted since we would overwrite our search object. The second loop checks for the individuals words within the target string and uses the nifty + to cast the result (either true or false) into a number, hence, +1 on a match or +0.

I was looking for a solution to this issue and none of the solutions presented here was good enough, so this is what I came up with:
function filterMatch(itemStr, keyword){
var words = keyword.split(' '), i = 0, w, reg;
for(; w = words[i++] ;){
reg = new RegExp(w, 'ig');
if (reg.test(itemStr) === false) return false; // word not found
itemStr = itemStr.replace(reg, ''); // remove matched word from original string
}
return true;
}
// test
filterMatch('This is a random string', 'trin and is'); // true
filterMatch('This is a random string', 'trin not is'); // false

We Keep Coding

JavaScript is the programming language of the Web.

How to search for accented characters in mongodb collection using nodejs - javascript

Related

Regex to match #word [duplicate]

How to find any of the specific characters exists in a string

Mongoose.js and the query object with Regexp

how to pull # mentions out of strings like twitter in javascript

Regular Expression with multiple words (in any order) without repeat

Categories

Resources