how to pull # mentions out of strings like twitter in javascript - javascript

I am writing an application in Node.js that allows users to mention each other in messages like on twitter. I want to be able to find the user and send them a notification. In order to do this I need to pull #usernames to find mentions from a string in node.js?
Any advice, regex, problems?

I have found that this is the best way to find mentions inside of a string in javascript.
var str = "#jpotts18 what is up man? Are you hanging out with #kyle_clegg";
var pattern = /\B#[a-z0-9_-]+/gi;
str.match(pattern);
["#jpotts18", "#kyle_clegg"]
I have purposefully restricted it to upper and lowercase alpha numeric and (-,_) symbols in order to avoid periods that could be confused for usernames like (#j.potts).
This is what twitter-text.js is doing behind the scenes.
// Mention related regex collection
twttr.txt.regexen.validMentionPrecedingChars = /(?:^|[^a-zA-Z0-9_!#$%&*#@]|RT:?)/;
twttr.txt.regexen.atSigns = /[#@]/;
twttr.txt.regexen.validMentionOrList = regexSupplant(
'(#{validMentionPrecedingChars})' + // $1: Preceding character
'(#{atSigns})' + // $2: At mark
'([a-zA-Z0-9_]{1,20})' + // $3: Screen name
'(\/[a-zA-Z][a-zA-Z0-9_\-]{0,24})?' // $4: List (optional)
, 'g');
twttr.txt.regexen.endMentionMatch = regexSupplant(/^(?:#{atSigns}|[#{latinAccentChars}]|:\/\/)/);
Please let me know if you have used anything that is more efficient, or accurate. Thanks!

Twitter has a library that you should be able to use for this. https://github.com/twitter/twitter-text-js.
I haven't used it, but if you trust its description, "the library provides autolinking and extraction for URLs, usernames, lists, and hashtags.". You should be able to use it in Node with npm install twitter-text.
While I understand that you're not looking for Twitter usernames, the same logic still applies and you should be able to use it fine (it does not validate that extracted usernames are valid twitter usernames). If not, forking it for your own purposes may be a very good place to start.
Edit: I looked at the docs closer, and there is a perfect example of what you need right here.
var usernames = twttr.txt.extractMentions("Mentioning #twitter and #jack")
// usernames == ["twitter", "jack"]

here is how you extract mentions from instagram caption with JavaScript and underscore.
var _ = require('underscore');
function parseMentions(text) {
var mentionsRegex = new RegExp('#([a-zA-Z0-9\_\.]+)', 'gim');
var matches = text.match(mentionsRegex);
if (matches && matches.length) {
matches = matches.map(function(match) {
return match.slice(1);
});
return _.uniq(matches);
} else {
return [];
}
}

I would respect names with diacritics, or character from any language \p{L}.
/(?<=^| )#\p{L}+/gu
Example on Regex101.com with description.
PS:
Don't use \B since it will match ##wrong.

Related

How do I parse url after searching?

I'm trying to parse a specifc part of url after search using
any language.(Ideally Javascript but open to Python)
How do I get a specific part of url and save/store?
For example,
In songking.com,
The way to get artist_id is checking a specific part of the url after searching artist name
in the search bar of the website.
in the case below,
the artist id is 301329.
https://www.songkick.com/artists/301329-rac
I strongly believe there is a way to parse this part using either python or js
given that I have a csv file that has artist name in its column. Instead of searching all the artists one by one. I wonder about the algorithm that literate my csv column and search it and parse the url and save/store.
It would be very grateful even if I could only get a hint that I could start with.
Thank you so much always.
It can be done using regular expressions.
Here's an example of a JavaScript implementation
const url = "https://www.songkick.com/artists/301329-rac";
const regex = /https:\/\/www\.songkick\.com\/artists\/(\d+)-.+/;
const match = url.match(regex);
if (match) {
console.log('Artist ID: ' + match[1]);
} else {
console.log('No Artist ID found!');
}
This regular expression /https:\/\/www\.songkick\.com\/artists\/(\d+)-.+/ means that we're trying to match something that starts with https://www.songkick.com/artists/, preceded by a group of decimals a dash then a group of letters.
The match() method retrieves the result of matching a string against a
regular expression.
Thus it will return the overall string in the first index, then the matched (\d+) group in the second index (match[1] in our case).
If you're not sure of the protocol (http vs https) you can add a ? in the regex right after https. That makes the s in https optional. So the regex would become /https?:\/\/www\.songkick\.com\/artists\/(\d+)-.+/.
Let me know if you need more explanation.
First, you can use RegEx simply.
In python
import re
url = 'https://www.songkick.com/artists/301329-rac'
pattern = '/artists/(\d+)-\w'
match = re.search(pattern, url)
if match:
artist_id = match.group(1)
I hope this will help you.

Regex to match #word [duplicate]

I am writing an application in Node.js that allows users to mention each other in messages like on twitter. I want to be able to find the user and send them a notification. In order to do this I need to pull #usernames to find mentions from a string in node.js?
Any advice, regex, problems?
I have found that this is the best way to find mentions inside of a string in javascript.
var str = "#jpotts18 what is up man? Are you hanging out with #kyle_clegg";
var pattern = /\B#[a-z0-9_-]+/gi;
str.match(pattern);
["#jpotts18", "#kyle_clegg"]
I have purposefully restricted it to upper and lowercase alpha numeric and (-,_) symbols in order to avoid periods that could be confused for usernames like (#j.potts).
This is what twitter-text.js is doing behind the scenes.
// Mention related regex collection
twttr.txt.regexen.validMentionPrecedingChars = /(?:^|[^a-zA-Z0-9_!#$%&*#@]|RT:?)/;
twttr.txt.regexen.atSigns = /[#@]/;
twttr.txt.regexen.validMentionOrList = regexSupplant(
'(#{validMentionPrecedingChars})' + // $1: Preceding character
'(#{atSigns})' + // $2: At mark
'([a-zA-Z0-9_]{1,20})' + // $3: Screen name
'(\/[a-zA-Z][a-zA-Z0-9_\-]{0,24})?' // $4: List (optional)
, 'g');
twttr.txt.regexen.endMentionMatch = regexSupplant(/^(?:#{atSigns}|[#{latinAccentChars}]|:\/\/)/);
Please let me know if you have used anything that is more efficient, or accurate. Thanks!
Twitter has a library that you should be able to use for this. https://github.com/twitter/twitter-text-js.
I haven't used it, but if you trust its description, "the library provides autolinking and extraction for URLs, usernames, lists, and hashtags.". You should be able to use it in Node with npm install twitter-text.
While I understand that you're not looking for Twitter usernames, the same logic still applies and you should be able to use it fine (it does not validate that extracted usernames are valid twitter usernames). If not, forking it for your own purposes may be a very good place to start.
Edit: I looked at the docs closer, and there is a perfect example of what you need right here.
var usernames = twttr.txt.extractMentions("Mentioning #twitter and #jack")
// usernames == ["twitter", "jack"]
here is how you extract mentions from instagram caption with JavaScript and underscore.
var _ = require('underscore');
function parseMentions(text) {
var mentionsRegex = new RegExp('#([a-zA-Z0-9\_\.]+)', 'gim');
var matches = text.match(mentionsRegex);
if (matches && matches.length) {
matches = matches.map(function(match) {
return match.slice(1);
});
return _.uniq(matches);
} else {
return [];
}
}
I would respect names with diacritics, or character from any language \p{L}.
/(?<=^| )#\p{L}+/gu
Example on Regex101.com with description.
PS:
Don't use \B since it will match ##wrong.

How to I write regular expression that split by include token and exclude token

Hi I am not good using regex.
So I have a question.
I want to split text by specific token.
Token list will be ". ", "? ".
Also I want to exclude specific word in split text.
specific word list will be 'Mr .'.
ex) Mr. Smith bought this. and me too. -> ["Mr. Smith bought this.", "and me too."]
I want to split this text by using (javascript) regex.
How can I do?
The following is a simple regular expression matching the fixed constraints that you have provided. However, I suspect it might not be that usable in the end, especially if you intend to use dynamic split/ignore lists (which would imply some dynamic building of the regex pattern). In any case, I hope the pattern itself can be a good experience for you.
var example = "Mr. Smith bought this. and me too.";
var regexp = /((Mr\.)|[^.?]+?)*[.?]/gi;
var result = [];
var captures;
while((captures = regexp.exec(example)) != null) {
result.push(captures[0]); // trim?
}
console.log(result);

How to search for accented characters in mongodb collection using nodejs

MongoDB treats É and E as two separate things, so when I search for E it will not find É.
Is there a way to make MongoDB think of them as the same thing?
I am running
var find =Users.find();
var re = new RegExp(name, 'i');
find.where('info.name').equals(re);
How do I match for strings containing accented characters and get the result?
This feature is not supported in mongodb and i doubt if it will be in the near future. What you could do to overcome is store a different field in each document containing the simple form of each name, in lowercase.
{
info:{"name":"Éva","search":"eva"};
}
{
info:{"name":"Eva","Search":"eva"}
}
When you have your document structure this, you have a some advantages,
You could create an index over the field search,
db.user.ensureIndex({"Search":1})
and fire a simple query, to find the match. When you search for a particular term, convert that term to its simple form, and to lower case and then do a find.
User.find({"Search":"eva"});
This would make use of the index as well, which a regex query would not.
See Also: Mongodb match accented characters as underlying character
But if you would want to do it the hard way, which is not recommended. Just for the records i am posting it here,
You need to have a mapping between the simple alphabets and their possible accented forms. For example:
var map = {"A":"[AÀÁÂÃÄÅ]"};
Say the search term is a, but the database document has its accented form, then, you would need to build a dynamic regex yourself before passing it to the find(), query.
var searchTerm = "a".toUpperCase();
var term = [];
for(var i=0;i<searchTerm.length;i++){
var char = searchTerm.charAt(i);
var reg = map[char];
term.push(reg);
}
var regexp = new RegExp(term.join(""));
User.find({"info.name":{$regex:regexp}})
Note, that the depicted example can handle a search term of length > 1 too.

Regex lookbehind workaround for Javascript?

I am terrible at regex so I will communicate my question a bit unconventionally in the name of trying to better describe my problem.
var TheBadPattern = /(\d{2}:\d{2}:\d{2},\d{3})/;
var TheGoodPattern = /([a-zA-Z0-9\-,.;:'"])(?:\r\n?|\n)([a-zA-Z0-9\-])/gi;
// My goal is to then do this
inputString = inputString.replace(TheGoodPattern, '$1 $2);
Question: I want to match all the good patterns and do the subsequent find/replace UNLESS they are proceeded by the bad pattern, any ideas on how? I was able to accomplish this in other languages that support lookbehind but I am at a loss without it? (ps: from what I understand, JS does not support lookahead/lookbehind or if you prefer, '?>!', '?<=')
JavaScript does support lookaheads. And since you only need a lookbehind (and not a lookahead, too), there is a workaround (which doesn't really aid the readability of your code, but it works!). So what you can do is reverse both the string and the pattern.
inputString = inputString.split("").reverse().join("");
var pattern = /([a-z0-9\-])(?:\n\r?|\r)([a-z0-9\-,.;:'"])(?!\d{3},\d{2}:\d{2}:\d{2})/gi
inputString = inputString.replace(TheGoodPattern, '$1 $2');
inputString = inputString.split("").reverse().join("");
Note that you had redundantly used the upper case letters (they are being taken care of the i modifier).
I would actually test it for you if you supplied some example input.
I have also used the reverse methodology recommended by m.buettner, and it can get pretty tricky depending on your patterns. I find that workaround works well if you are matching simple patterns or strings.
With that said I thought I would go a bit outside the box just for fun. This solution is not without its own foibles, but it also works and it should be easy to adapt to existing code with medium to complicated regular expressions.
http://jsfiddle.net/52QBx/
js:
function negativeLookBehind(lookBehindRegExp, matchRegExp, modifiers)
{
var text = $('#content').html();
var badGoodRegex = regexMerge(lookBehindRegExp, matchRegExp, modifiers);
var badGoodMatches = text.match(badGoodRegex);
var placeHolderMap = {};
for(var i = 0;i<badGoodMatches.length;i++)
{
var match = badGoodMatches[i];
var placeHolder = "${item"+i+"}"
placeHolderMap[placeHolder] = match;
$('#content').html($('#content').html().replace(match, placeHolder));
}
var text = $('#content').html();
var goodRegex = matchRegExp;
var goodMatches = text.match(goodRegex);
for(prop in placeHolderMap)
{
$('#content').html($('#content').html().replace(prop, placeHolderMap[prop]));
}
return goodMatches;
}
function regexMerge(regex1, regex2, modifiers)
{
/*this whole concept could be its own beast, so I just asked to have modifiers for the combined expression passed in rather than determined from the two regexes passed in.*/
return new RegExp(regex1.source + regex2.source, modifiers);
}
var result = negativeLookBehind(/(bad )/gi, /(good\d)/gi, "gi");
alert(result);
​
html:
<div id="content">Some random text trying to find good1 text but only when that good2 text is not preceded by bad text so bad good3 should not be found bad good4 is a bad oxymoron anyway.</div>​
The main idea is find all the total patterns (both the lookbehind and the real match) and temporarily remove those from the text being searched. I utilized a map as the values being hidden could vary and thus each replacement had to be reversible. Then we can run just the regex for the items you really wanted to find without the ones that would have matched the lookbehind getting in the way. After the results are determined we swap back in the original items and return the results. It is a quirky, yet functional, workaround.

Categories