I need help for writing a regex pattern fo these conditions:
Limitations on Hashtag Characters
Length
You only need to add a # before a word to make it hashtag. However, because a Tweet is only limited to under 140 characters, the best hashtags are those composed of a single word or a few letters. Twitter experts recommend keeping the keyword under 6 characters.
Use only numbers and letters in your keyword. You may use an underscore but do this sparingly for aesthetic reasons. Hyphens and dashes will not work.
No Spaces
Hashtags do not support spaces. So if you're using two words, skip the space. For example, hashtags for following the US election are tagged as #USelection, not $US election.
No Special Characters
Hashtags only work with the # sign. Special characters like "!, $, %, ^, &, *, +, ." will not work. Twitter recognizes the pound sign and then converts the hashtag into a clickable link.
HashTags can start by numbers
Hashtags can be in any language
Hashtags can be emojis or symbols
I came up by the idea like this but it's not including the last two conditions:
const subStr = postText.split(/(?=[\s:#,+/][a-zA-Z\d]+)(#+\w{2,})/gm);
const result = _.filter(subStr, word => word.startsWith('#')).map(hashTag => hashTag.substr(1)) || [];
EDIT:
Example: If I have:
const postText = "#hello12#123 #hi #£hihi #This is #👩 #Hyvääpäivää #Dzieńdobry #जलवायुपरिवर्तन an #example of some text with #hash-tags - http://www.example.com/#anchor but dont want the link,#hashtag1,hi #123 hfg skjdf kjsdhf jsdhf kjhsdf kjhsdf khdsf kjhsdf kjhdsf hjjhjhf kjhsdjhd kjhsdfkjhsd #lasthashtag";
Result should be:
["hello12", "123", "hi", "This", "👩", "Hyvääpäivää", "Dzieńdobry", "जलवायुपरिवर्तन", "example", "hash", "anchor", "hashtag1", "123", "lasthashtag"]
What I have now:
["hello12", "123", "hi", "This", "Hyv", "Dzie", "example", "hash", "anchor", "hashtag1", "123", "lasthashtag"]
Note: I don't want to use JavaScript library.
Thanks
Assuming the characters that are not allowed in a hashtag are !$%^&*+. (the ones you mentioned) and , (based on your example), you can use the following regex pattern:
/#[^\s!$%^&*+.,#]+/gm
Here's a demo.
Note: To exclude more characters, you can add them in the character class as I did above. Obviously, you can't rely on alphanumeric characters only because you want to support other Unicode symbols and emojis.
JavaScript code sample:
const regex = /#[^\s!$%^&*+.,#]+/gm;
const str = "#hello12#123 #hi #£hihi #This is #👩 #Hyvääpäivää #Dzieńdobry #जलवायुपरिवर्तन an #example of some text with #hash-tags - http://www.example.com/#anchor but dont want the link,#hashtag1,hi #123 hfg skjdf kjsdhf jsdhf kjhsdf kjhsdf khdsf kjhsdf kjhdsf hjjhjhf kjhsdjhd kjhsdfkjhsd #lasthashtag";
let m;
while ((m = regex.exec(str)) !== null) {
if (m.index === regex.lastIndex) {
regex.lastIndex++;
}
m.forEach((match) => {
console.log("Found match: " + match);
});
}
This is one possible solution without while that worked for me and Thanks #Ahmed Abdelhameed for the pattern :
function getHashTags(postText) {
const regex = /#[^\s!$%^&*+.,£#]+/gm;
const selectedHashTag = [];
const subStr = postText.split(' ');
const checkHashTag = _.filter(subStr, word => word.startsWith('#') || word.includes('#'));
checkHashTag.map((hashTags) => {
if (hashTags.match(regex)) {
hashTags.match(regex).map(hashTag => selectedHashTag.push(hashTag.substr(1)));
}
return true;
});
return selectedHashTag;
}
Related
I would like to find all the matches of given strings (divided by spaces) in a string.
(The way for example, iTunes search box works).
That, for example, both "ab de" and "de ab" will return true on "abcde" (also "bc e a" or any order should return true)
If I replace the white space with a wild card, "ab*de" would return true on "abcde", but not "de*ab".
[I use * and not Regex syntax just for this explanation]
I could not find any pure Regex solution for that.
The only solution I could think of is spliting the search term and run multiple Regex.
Is it possible to find a pure Regex expression that will cover all these options ?
Returns true when all parts (divided by , or ' ') of a searchString occur in text. Otherwise false is returned.
filter(text, searchString) {
const regexStr = '(?=.*' + searchString.split(/\,|\s/).join(')(?=.*') + ')';
const searchRegEx = new RegExp(regexStr, 'gi');
return text.match(searchRegEx) !== null;
}
I'm pretty sure you could come up with a regex to do what you want, but it may not be the most efficient approach.
For example, the regex pattern (?=.*bc)(?=.*e)(?=.*a) will match any string that contains bc, e, and a.
var isMatch = 'abcde'.match(/(?=.*bc)(?=.*e)(?=.*a)/) != null; // equals true
var isMatch = 'bcde'.match(/(?=.*bc)(?=.*e)(?=.*a)/) != null; // equals false
You could write a function to dynamically create an expression based on your search terms, but whether it's the best way to accomplish what you are doing is another question.
Alternations are order insensitive:
"abcde".match(/(ab|de)/g); // => ['ab', 'de']
"abcde".match(/(de|ab)/g); // => ['ab', 'de']
So if you have a list of words to match you can build a regex with an alternation on the fly like so:
function regexForWordList(words) {
return new RegExp('(' + words.join('|') + ')', 'g');
}
'abcde'.match(['a', 'e']); // => ['a', 'e']
Try this:
var str = "your string";
str = str.split( " " );
for( var i = 0 ; i < str.length ; i++ ){
// your regexp match
}
This is script which I use - it works also with single word searchStrings
var what="test string with search cool word";
var searchString="search word";
var search = new RegExp(searchString, "gi"); // one-word searching
// multiple search words
if(searchString.indexOf(' ') != -1) {
search="";
var words=searchString.split(" ");
for(var i = 0; i < words.length; i++) {
search+="(?=.*" + words[i] + ")";
}
search = new RegExp(search + ".+", "gi");
}
if(search.test(what)) {
// found
} else {
// notfound
}
I assume you are matching words, or parts of words. You want space-separated search terms to limit search results, and it seems you intend to return only those entries which have all the words that the user supplies. And you intend a wildcard character * to stand for 0 or more characters in a matching word.
For example, if the user searches for the words term1 term2, you intend to return only those items which have both words term1 and term2. If the user searches for the word term*, it would match any word beginning with term.
There are suitable regular expressions which are equivalent to this search language and can be generated from it.
A simple example, the word term, can be asserted in regex by converting to \bterm\b. But two or more words which must match in any order require lookahead assertions. Using extended syntax, the equivalent regex is:
(?= .* \b term1 \b )
(?= .* \b term2 \b )
The asterisk wildcard can be asserted in regex with a character class followed by asterisk. The character class identifies which letters you consider to be part of word. For example, you might find that [A-Za-z0-9]* fits the bill.
In short, you might be satisfied if you convert an expression such as:
foo ba* quux
to:
(?= .* \b foo \b )
(?= .* \b ba[A-Za-z0-9]* \b )
(?= .* \b quux \b )
That is a simple matter of search and replace. But do be careful to sanitize the input string to avoid injection attacks by removing punctuation, etc.
I think you may be barking up the wrong tree with RegEx. What you might want to look at is the Levenshtein distance of two input strings.
There's a Javascript implementation here and a usage example here.
I'm trying to make the code a lot cleaner and concise. The main goal I want to do is to change the string to my requirements .
Requirements
I want to remove any empty lines (like the one in the middle of the two sentences down below)
I want to remove the * in front of each sentence, if there is.
I want to make the first letter of each word capital and the rest lowercase (except words that have $ in front of it)
This is what I've done so far:
const string =
`*SQUARE HAS ‘NO PLANS’ TO BUY MORE BITCOIN: FINANCIAL NEWS
$SQ
*$SQ UPGRADED TO OUTPERFORM FROM PERFORM AT OPPENHEIMER, PT $185`
const nostar = string.replace(/\*/g, ''); // gets rid of the * of each line
const noemptylines = nostar.replace(/^\s*[\r\n]/gm, ''); //gets rid of empty blank lines
const lowercasestring = noemptylines.toLowerCase(); //turns it to lower case
const tweets = lowercasestring.replace(/(^\w{1})|(\s{1}\w{1})/g, match => match.toUpperCase()); //makes first letter of each word capital
console.log(tweets)
I've done most of the code, however, I want to keep words that have $ in front of it, capital, which I don't know how to do.
Furthermore, I was wondering if its possible to combine regex expression, so its even shorter and concise.
You could make use of capture groups and the callback function of replace.
^(\*|[\r\n]+)|\$\S*|(\S+)
^ Start of string
(\*|[\r\n]*$) Capture group 1, match either * or 1 or more newlines
| Or
\$\S* Match $ followed by optional non whitespace chars (which will be returned unmodified in the code)
| Or
(\S+) Capture group 2, match 1+ non whitespace chars
Regex demo
const regex = /^(\*|[\r\n]+)|\$\S*|(\S+)/gm;
const string =
`*SQUARE HAS ‘NO PLANS’ TO BUY MORE BITCOIN: FINANCIAL NEWS
$SQ
*$SQ UPGRADED TO OUTPERFORM FROM PERFORM AT OPPENHEIMER, PT $185`;
const res = string.replace(regex, (m, g1, g2) => {
if (g1) return ""
if (g2) {
g2 = g2.toLowerCase();
return g2.toLowerCase().charAt(0).toUpperCase() + g2.slice(1);
}
return m;
});
console.log(res);
Making it readable is more important than making it short.
const tweets = string
.replace(/\*/g, '') // gets rid of the * of each line
.replace(/^\s*[\r\n]/gm, '') //gets rid of empty blank lines
.toLowerCase() //turns it to lower case
.replace(/(^\w{1})|(\s{1}\w{1})/g, match => match.toUpperCase()) //makes first letter of each word capital
.replace(/\B\$(\w+)\b/g, match => match.toUpperCase()); //keep words that have $ in front of it, capital
I want to implement a function that outputs the respective strings as an array from an input string like "str1|str2#str3":
function myFunc(string) { ... }
For the input string, however, it is only necessary that str1 is present. str2 and str3 (with their delimiters) are both optional. For that I have already written a regular expression that performs a kind of split. I can not do a (normal) split because the delimiters are different characters and also the order of str1, str2, and str3 is important. This works kinda with my regex pattern. Now, I'm struggling how to extend this pattern so that you can escape the two delimiters by using \| or \#.
How exactly can I solve this best?
var strings = [
'meaning',
'meaning|description',
'meaning#id',
'meaning|description#id',
'|description',
'|description#id',
'#id',
'meaning#id|description',
'sub1\\|sub2',
'mea\\|ning|descri\\#ption',
'mea\\#ning#id',
'meaning|description#identific\\|\\#ation'
];
var pattern = /^(\w+)(?:\|(\w*))?(?:\#(\w*))?$/ // works without escaping
console.log(pattern.exec(strings[3]));
Accordingly to the problem definition, strings 0-3 and 8-11 should be valid and the rest not. myFunc(strings[3]) and should return ['meaning','description','id'] and myFunc(strings[8]) should return [sub1\|sub2,null,null]
You need to allow \\[|#] alognside the \w in the pattern replacing your \w with (?:\\[#|]|\w) pattern:
var strings = [
'meaning',
'meaning|description',
'meaning#id',
'meaning|description#id',
'|description',
'|description#id',
'#id',
'meaning#id|description',
'sub1\\|sub2',
'mea\\|ning|descri\\#ption',
'mea\\#ning#id',
'meaning|description#identific\\|\\#ation'
];
var pattern = /^((?:\\[#|]|\w)+)(?:\|((?:\\[#|]|\w)*))?(?:#((?:\\[#|]|\w)*))?$/;
for (var s of strings) {
if (pattern.test(s)) {
console.log(s, "=> MATCHES");
} else {
console.log(s, "=> FAIL");
}
}
Pattern details
^ - string start
((?:\\[#|]|\w)+) - Group 1: 1 or more repetitions of \ followed with # or | or a word char
(?:\|((?:\\[#|]|\w)*))? - an optional group matching 1 or 0 occurrences of
\| - a | char
((?:\\[#|]|\w)*) - Group 2: 0 or more repetitions of \ followed with # or | or a word char
(?:#((?:\\[#|]|\w)*))? - an optional group matching 1 or 0 occurrences of
# - a # char
((?:\\[#|]|\w)*) Group 3: 0 or more repetitions of \ followed with # or | or a word char
$ - end of string.
My guess is that you wish to split all your strings, for which we'd be adding those delimiters in a char class maybe, similar to:
([|#\\]+)?([\w]+)
If we don't, we might want to do so for validations, otherwise our validation would become very complicated as the combinations would increase.
const regex = /([|#\\]+)?([\w]+)/gm;
const str = `meaning
meaning|description
meaning#id
meaning|description#id
|description
|description#id
#id
meaning#id|description
sub1\\|sub2
mea\\|ning|descri\\#ption
mea\\#ning#id
meaning|description#identific\\|\\#ation`;
let m;
while ((m = regex.exec(str)) !== null) {
// This is necessary to avoid infinite loops with zero-width matches
if (m.index === regex.lastIndex) {
regex.lastIndex++;
}
// The result can be accessed through the `m`-variable.
m.forEach((match, groupIndex) => {
console.log(`Found match, group ${groupIndex}: ${match}`);
});
}
Demo
Seems like what you're looking for may be this?
((?:\\#|\\\||[^\|#])*)*
Explanation:
Matches all sets that include "\#", "\|", or any character except "#" and "|".
https://regexr.com/4fr68
I have a string received from backend, and I need to extract hashtags. The tags are written in one of these two forms
type 1. #World is a #good #place to #live.
type 2. #World#place#live.
I managed to extract from first type by : str.replace(/#(\S*)/g
how can i change the second format to space seperated tags as well as format one?
basically i want format two to be converted from
#World#place#live.
to
#World #place #live.
You can use String.match, with regex #\w+:
var str = `
type 1. #World is a #good #place to #live.
type 2. #World#place#live.`
var matches = str.match(/#\w+/g)
console.log(matches)
\w+ matches any word character [a-zA-Z0-9_] more than once, so you might want to tweak that.
Once you have the matches in an array you can rearrange them to your likes.
The pattern #(\S*) will match a # followed by 0+ times a non whitespace character in a captured group. That would match a single # as well. The string #World#place#live. contains no whitespace character so the whole string will be matched.
You could match them instead by using a negated character class. Match #, followed by a negated character class that matches not a # or a whitespace character.
#[^#\s]+
Regex demo
const strings = [
"#World is a #good #place to #live.",
"#World#place#live."
];
let pattern = /#[^#\s]+/g;
strings.forEach(s => {
console.log(s.match(pattern));
});
How about that using regex /#([\w]+\b)/gm and join by space like below to extract #hastags from your string? OR you can use str.replace(/\b#[^\s#]+/g, " $&") as commented by #Wiktor
function findHashTags(str) {
var regex = /#([\w]+\b)/gm;
var matches = [];
var match;
while ((match = regex.exec(str))) {
matches.push(match[0]);
}
return matches;
}
let str1 = "#World is a #good #place to #live."
let str2 = "#World#place#live";
let res1 = findHashTags(str1);
let res2 = findHashTags(str2);
console.log(res1.join(' '));
console.log(res2.join(' '));
I have an input of type text where I return true or false depending on a list of banned words. Everything works fine. My problem is that I don't know how to check against words with diacritics from the array:
var bannedWords = ["bad", "mad", "testing", "băţ"];
var regex = new RegExp('\\b' + bannedWords.join("\\b|\\b") + '\\b', 'i');
$(function () {
$("input").on("change", function () {
var valid = !regex.test(this.value);
alert(valid);
});
});
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<input type='text' name='word_to_check'>
Now on the word băţ it returns true instead of false for example.
Chiu's comment is right: 'aaáaa'.match(/\b.+?\b/g) yelds quite counter-intuitive [ "aa", "á", "aa" ], because "word character" (\w) in JavaScript regular expressions is just a shorthand for [A-Za-z0-9_] ('case-insensitive-alpha-numeric-and-underscore'), so word boundary (\b) matches any place between chunk of alpha-numerics and any other character. This makes extracting "Unicode words" quite hard.
For non-unicase writing systems it is possible to identify "word character" by its dual nature: ch.toUpperCase() != ch.toLowerCase(), so your altered snippet could look like this:
var bannedWords = ["bad", "mad", "testing", "băţ", "bať"];
var bannedWordsRegex = new RegExp('-' + bannedWords.join("-|-") + '-', 'i');
$(function() {
$("input").on("input", function() {
var invalid = bannedWordsRegex.test(dashPaddedWords(this.value));
$('#log').html(invalid ? 'bad' : 'good');
});
$("input").trigger("input").focus();
function dashPaddedWords(str) {
return '-' + str.replace(/./g, wordCharOrDash) + '-';
};
function wordCharOrDash(ch) {
return isWordChar(ch) ? ch : '-'
};
function isWordChar(ch) {
return ch.toUpperCase() != ch.toLowerCase();
};
});
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<input type='text' name='word_to_check' value="ba">
<p id="log"></p>
Let's see what's going on:
alert("băţ".match(/\w\b/));
This is [ "b" ] because word boundary \b doesn't recognize word characters beyond ASCII. JavaScript's "word characters" are strictly [0-9A-Z_a-z], so aä, pπ, and zƶ match \w\b\W since they contain a word character, a word boundary, and a non-word character.
I think the best you can do is something like this:
var bound = '[^\\w\u00c0-\u02c1\u037f-\u0587\u1e00-\u1ffe]';
var regex = new RegExp('(?:^|' + bound + ')(?:'
+ bannedWords.join('|')
+ ')(?=' + bound + '|$)', 'i');
where bound is a reversed list of all ASCII word characters plus most Latin-esque letters, used with start/end of line markers to approximate an internationalized \b. (The second of which is a zero-width lookahead that better mimics \b and therefore works well with the g regex flag.)
Given ["bad", "mad", "testing", "băţ"], this becomes:
/(?:^|[^\w\u00c0-\u02c1\u037f-\u0587\u1e00-\u1ffe])(?:bad|mad|testing|băţ)(?=[^\w\u00c0-\u02c1\u037f-\u0587\u1e00-\u1ffe]|$)/i
This doesn't need anything like ….join('\\b|\\b')… because there are parentheses around the list (and that would create things like \b(?:hey\b|\byou)\b, which is akin to \bhey\b\b|\b\byou\b, including the nonsensical \b\b – which JavaScript interprets as merely \b).
You can also use var bound = '[\\s!-/:-#[-`{-~]' for a simpler ASCII-only list of acceptable non-word characters. Be careful about that order! The dashes indicate ranges between characters.
You need a Unicode aware word boundary. The easiest way is to use XRegExp package.
Although its \b is still ASCII based, there is a \p{L} (or a shorter pL version) construct that matches any Unicode letter from the BMP plane. To build a custom word boundary using this contruct is easy:
\b word \b
---------------------------------------
| | |
([^\pL0-9_]|^) word (?=[^\pL0-9_]|$)
The leading word boundary can be represented with a (non)capturing group ([^\pL0-9_]|^) that matches (and consumes) either a character other than a Unicode letter from the BMP plane, a digit and _ or a start of the string before the word.
The trailing word boundary can be represented with a positive lookahead (?=[^\pL0-9_]|$) that requires a character other than a Unicode letter from the BMP plane, a digit and _ or the end of string after the word.
See the snippet below that will detect băţ as a banned word, and băţy as an allowed word.
var bannedWords = ["bad", "mad", "testing", "băţ"];
var regex = new XRegExp('(?:^|[^\\pL0-9_])(?:' + bannedWords.join("|") + ')(?=$|[^\\pL0-9_])', 'i');
$(function () {
$("input").on("change", function () {
var valid = !regex.test(this.value);
//alert(valid);
console.log("The word is", valid ? "allowed" : "banned");
});
});
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/xregexp/3.1.1/xregexp-all.min.js"></script>
<input type='text' name='word_to_check'>
In stead of using word boundary, you could do it with
(?:[^\w\u0080-\u02af]+|^)
to check for start of word, and
(?=[^\w\u0080-\u02af]|$)
to check for the end of it.
The [^\w\u0080-\u02af] matches any characters not (^) being basic Latin word characters - \w - or the Unicode 1_Supplement, Extended-A, Extended-B and Extensions. This include some punctuation, but would get very long to match just letters. It may also have to be extended if other character sets have to be included. See for example Wikipedia.
Since javascript doesn't support look-behinds, the start-of-word test consumes any before mentioned non-word characters, but I don't think that should be a problem. The important thing is that the end-of-word test doesn't.
Also, putting these test outside a non capturing group that alternates the words, makes it significantly more effective.
var bannedWords = ["bad", "mad", "testing", "băţ", "båt", "süß"],
regex = new RegExp('(?:[^\\w\\u00c0-\\u02af]+|^)(?:' + bannedWords.join("|") + ')(?=[^\\w\\u00c0-\\u02af]|$)', 'i');
function myFunction() {
document.getElementById('result').innerHTML = 'Banned = ' + regex.test(document.getElementById('word_to_check').value);
}
<!DOCTYPE html>
<html>
<body>
Enter word: <input type='text' id='word_to_check'>
<button onclick='myFunction()'>Test</button>
<p id='result'></p>
</body>
</html>
When dealing with characters outside my base set (which can show up at any time), I convert them to an appropriate base equivalent (8bit, 16bit, 32bit). before running any character matching over them.
var bannedWords = ["bad", "mad", "testing", "băţ"];
var bannedWordsBits = {};
bannedWords.forEach(function(word){
bannedWordsBits[word] = "";
for (var i = 0; i < word.length; i++){
bannedWordsBits[word] += word.charCodeAt(i).toString(16) + "-";
}
});
var bannedWordsJoin = []
var keys = Object.keys(bannedWordsBits);
keys.forEach(function(key){
bannedWordsJoin.push(bannedWordsBits[key]);
});
var regex = new RegExp(bannedWordsJoin.join("|"), 'i');
function checkword(word) {
var wordBits = "";
for (var i = 0; i < word.length; i++){
wordBits += word.charCodeAt(i).toString(16) + "-";
}
return !regex.test(wordBits);
};
The separator "-" is there to make sure that unique characters don't bleed together creating undesired matches.
Very useful as it brings all the characters down to a common base that everything can interact with. And this can be re-encoded back to it's original without having to ship it in key/value pair.
For me the best thing about it is that I don't have to know all of the rules for all of the character sets that I might intersect with, because I can pull them all into a common playing field.
As a side note:
To speed things up, rather than passing the large regex statement that you probably have, which takes exponentially longer to pass with the length of the words that you're banning, I would pass each separate word in the sentence through the filter. And break the filter up into length based segments. like;
checkword3Chars();
checkword4Chars();
checkword5chars();
who's functions you can generate systematically and even create on the fly as and when they become required.