Remove punctuation, retain spaces, toLowerCase, add dashes succinctly - javascript

I need to do the following to a string:
Remove any punctuation (but retain spaces) (can include removal of foreign chars)
Add dashes instead of spaces
toLowercase
I'd like to be able to do this as succinctly as possible, so on one line for example.
At the moment I have:
const ele = str.replace(/[^\w\s]/, '').replace(/\s+/g, '-').toLowerCase();
Few problems I'm having. Firstly the line above is syntactically incorrect. I think it's a problem with /[^\w\s] but I am not sure what I've done wrong.
Secondly I wonder if it is possible to write a regex statement that removes the punctuation AND converts spaces to dashes?
And example of what I want to change:
Where to? = where-to
Destination(s) = destinations
Travel dates?: = travel-dates
EDIT: I have updated the missing / from the first regex replace. I am finding that Destination(s) is becoming destinations) which is peculiar.
Codepen: http://codepen.io/anon/pen/mAdXJm?editors=0011

You may use the following regex to only match ASCII punctuation and some symbols (source) - maybe we should remove _ from it:
var punct = /[!"#$%&'()*+,.\/:;<=>?#\[\\\]^`{|}~-]+/g;
or a more contracted one since some of these symbols appear in the ASCII table as consecutive chars:
var punct = /[!-\/:-#\[-^`{-~]+/g;
You may chain 2 regex replacements.
var punct = /[!"#$%&'()*+,.\/:;<=>?#\[\\\]^`{|}~-]+/g;
var s = "Where to?"; // = where-to
console.log(s.replace(punct, '').replace(/\s+/, '-').toLowerCase());
s = "Destination(s)"; // = destinations
console.log(s.replace(punct, '').replace(/\s+/, '-').toLowerCase());
console.log(s.replace(punct, '').replace(/\s+/, '-').toLowerCase());
Or use an anonymous method inside the replace with arrow functions (less compatibility, but succint):
var s="Travel dates?:"; // = travel-dates
var o=/([!-\/:-#\[-^`{-~]+)|\s+/g;
console.log(s.replace(o,(m,g)=>g?'':'-').toLowerCase());
Note you may also use XRegExp to match any Unicode punctuation with \pP construct.

Wiktor touched on the subject, but my first thought was an anonymous function using the regex /(\s+)|([\W])/g like this:
var inputs = ['Where to?', 'Destination(s)', 'Travel dates?:'],
res,
idx;
for( idx=0; idx<inputs.length; idx++ ) {
res = inputs[idx].replace(/(\s+)|([\W])/g, function(a, b) {return b ? '-' : '';}).toLowerCase();
document.getElementById('output').innerHTML += '"' + inputs[idx] + '" -> "'
+ res + '"<br/>';
}
<!DOCTYPE html>
<html>
<body>
<p id='output'></p>
</body>
</html>
The regex captures either white space (1+) or a non-word characters. If the first is true the anonymous function returns -, otherwise an empty string.

Related

How to use a variable inside Regex?

I have this line in my loop:
var regex1 = new RegExp('' + myClass + '[:*].*');
var rule1 = string.match(regex1)
Where "string" is a string of class selectors, for example: .hb-border-top:before, .hb-border-left
and "myClass" is a class: .hb-border-top
As I cycle through strings, i need to match strings that have "myClass" in them, including :before and :hover but not including things like hb-border-top2.
My idea for this regex is to match hb-border-top and then :* to match none or more colons and then the rest of the string.
I need to match:
.hb-fill-top::before
.hb-fill-top:hover::before
.hb-fill-top
.hb-fill-top:hover
but the above returns only:
.hb-fill-top::before
.hb-fill-top:hover::before
.hb-fill-top:hover
and doesn't return .hb-fill-top itself.
So, it has to match .hb-fill-top itself and then anything that follows as long as it starts with :
EDIT:
Picture below: my strings are the contents of {selectorText}.
A string is either a single class or a class with a pseudo element, or a rule with few clases in it, divided by commas.
each string that contains .hb-fill-top ONLY or .hb-fill-top: + something (hover, after, etc) has to be selected. Class is gonna be in variable "myClass" hence my issue as I can't be too precise.
I understand you want to get any CSS selector name that contains the value anywhere inside and has EITHER : and 0+ chars up to the end of string OR finish right there.
Then, to get matches for the .hb-fill-top value you need a solution like
/\.hb-fill-top(?::.*)?$/
and the following JS code to make it all work:
var key = ".hb-fill-top";
var rx = RegExp(key.replace(/[-\/\\^$*+?.()|[\]{}]/g, '\\$&') + "(?::.*)?$");
var ss = ["something.hb-fill-top::before","something2.hb-fill-top:hover::before","something3.hb-fill-top",".hb-fill-top:hover",".hb-fill-top2:hover",".hb-fill-top-2:hover",".hb-fill-top-bg-br"];
var res = ss.filter(x => rx.test(x));
console.log(res);
Note that .replace(/[-\/\\^$*+?.()|[\]{}]/g, '\\$&') code is necessary to escape the . that is a special regex metacharacter that matches any char but a line break char. See Is there a RegExp.escape function in Javascript?.
The ^ matches the start of a string.
(?::.*)?$ will match:
(?::.*)?$ - an optional (due to the last ? quantifier that matches 1 or 0 occurrences of the quantified subpattern) sequence ((?:...)? is a non-capturing group) of a
: - a colon
.* - any 0+ chars other than line break chars
$ - end of the string.
var regex1 = new RegExp(`^\\${myClass}(:{1,2}\\w+)*$`)
var passes = [
'.hb-fill-top::before',
'.hb-fill-top:hover::before',
'.hb-fill-top',
'.hb-fill-top:hover',
'.hb-fill-top::before',
'.hb-fill-top:hover::before',
'.hb-fill-top:hover'
];
var fails = ['.hb-fill-top-bg-br'];
var myClass = '.hb-fill-top';
var regex = new RegExp(`^\\${myClass}(:{1,2}\\w+)*$`);
passes.forEach(p => console.log(regex.test(p)));
console.log('---');
fails.forEach(f => console.log(regex.test(f)));
var regex1 = new RegExp('\\' + myClass + '(?::[^\s]*)?');
var rule1 = string.match(regex1)
This regex select my class, and everething after if it start with : and stop when it meets a whitespace character.
See the regex in action.
Notice also that I added '\\' at the beginning. This is in order to escape the dot in your className. Otherwise it would have matched something else like
ahb-fill-top
.some-other-hb-fill-top
Also be careful about .* it may match something else after (I don't know your set of strings). You might want to be more precise with :{1,2}[\w-()]+ in the last group. So:
var regex1 = new RegExp('\\' + myClass + '(?::{1,2}[\w-()]+)?');

How to ban words with diacritics using a blacklist array and regex?

I have an input of type text where I return true or false depending on a list of banned words. Everything works fine. My problem is that I don't know how to check against words with diacritics from the array:
var bannedWords = ["bad", "mad", "testing", "băţ"];
var regex = new RegExp('\\b' + bannedWords.join("\\b|\\b") + '\\b', 'i');
$(function () {
$("input").on("change", function () {
var valid = !regex.test(this.value);
alert(valid);
});
});
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<input type='text' name='word_to_check'>
Now on the word băţ it returns true instead of false for example.
Chiu's comment is right: 'aaáaa'.match(/\b.+?\b/g) yelds quite counter-intuitive [ "aa", "á", "aa" ], because "word character" (\w) in JavaScript regular expressions is just a shorthand for [A-Za-z0-9_] ('case-insensitive-alpha-numeric-and-underscore'), so word boundary (\b) matches any place between chunk of alpha-numerics and any other character. This makes extracting "Unicode words" quite hard.
For non-unicase writing systems it is possible to identify "word character" by its dual nature: ch.toUpperCase() != ch.toLowerCase(), so your altered snippet could look like this:
var bannedWords = ["bad", "mad", "testing", "băţ", "bať"];
var bannedWordsRegex = new RegExp('-' + bannedWords.join("-|-") + '-', 'i');
$(function() {
$("input").on("input", function() {
var invalid = bannedWordsRegex.test(dashPaddedWords(this.value));
$('#log').html(invalid ? 'bad' : 'good');
});
$("input").trigger("input").focus();
function dashPaddedWords(str) {
return '-' + str.replace(/./g, wordCharOrDash) + '-';
};
function wordCharOrDash(ch) {
return isWordChar(ch) ? ch : '-'
};
function isWordChar(ch) {
return ch.toUpperCase() != ch.toLowerCase();
};
});
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<input type='text' name='word_to_check' value="ba">
<p id="log"></p>
Let's see what's going on:
alert("băţ".match(/\w\b/));
This is [ "b" ] because word boundary \b doesn't recognize word characters beyond ASCII. JavaScript's "word characters" are strictly [0-9A-Z_a-z], so aä, pπ, and zƶ match \w\b\W since they contain a word character, a word boundary, and a non-word character.
I think the best you can do is something like this:
var bound = '[^\\w\u00c0-\u02c1\u037f-\u0587\u1e00-\u1ffe]';
var regex = new RegExp('(?:^|' + bound + ')(?:'
+ bannedWords.join('|')
+ ')(?=' + bound + '|$)', 'i');
where bound is a reversed list of all ASCII word characters plus most Latin-esque letters, used with start/end of line markers to approximate an internationalized \b. (The second of which is a zero-width lookahead that better mimics \b and therefore works well with the g regex flag.)
Given ["bad", "mad", "testing", "băţ"], this becomes:
/(?:^|[^\w\u00c0-\u02c1\u037f-\u0587\u1e00-\u1ffe])(?:bad|mad|testing|băţ)(?=[^\w\u00c0-\u02c1\u037f-\u0587\u1e00-\u1ffe]|$)/i
This doesn't need anything like ….join('\\b|\\b')… because there are parentheses around the list (and that would create things like \b(?:hey\b|\byou)\b, which is akin to \bhey\b\b|\b\byou\b, including the nonsensical \b\b – which JavaScript interprets as merely \b).
You can also use var bound = '[\\s!-/:-#[-`{-~]' for a simpler ASCII-only list of acceptable non-word characters. Be careful about that order! The dashes indicate ranges between characters.
You need a Unicode aware word boundary. The easiest way is to use XRegExp package.
Although its \b is still ASCII based, there is a \p{L} (or a shorter pL version) construct that matches any Unicode letter from the BMP plane. To build a custom word boundary using this contruct is easy:
\b word \b
---------------------------------------
| | |
([^\pL0-9_]|^) word (?=[^\pL0-9_]|$)
The leading word boundary can be represented with a (non)capturing group ([^\pL0-9_]|^) that matches (and consumes) either a character other than a Unicode letter from the BMP plane, a digit and _ or a start of the string before the word.
The trailing word boundary can be represented with a positive lookahead (?=[^\pL0-9_]|$) that requires a character other than a Unicode letter from the BMP plane, a digit and _ or the end of string after the word.
See the snippet below that will detect băţ as a banned word, and băţy as an allowed word.
var bannedWords = ["bad", "mad", "testing", "băţ"];
var regex = new XRegExp('(?:^|[^\\pL0-9_])(?:' + bannedWords.join("|") + ')(?=$|[^\\pL0-9_])', 'i');
$(function () {
$("input").on("change", function () {
var valid = !regex.test(this.value);
//alert(valid);
console.log("The word is", valid ? "allowed" : "banned");
});
});
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/xregexp/3.1.1/xregexp-all.min.js"></script>
<input type='text' name='word_to_check'>
In stead of using word boundary, you could do it with
(?:[^\w\u0080-\u02af]+|^)
to check for start of word, and
(?=[^\w\u0080-\u02af]|$)
to check for the end of it.
The [^\w\u0080-\u02af] matches any characters not (^) being basic Latin word characters - \w - or the Unicode 1_Supplement, Extended-A, Extended-B and Extensions. This include some punctuation, but would get very long to match just letters. It may also have to be extended if other character sets have to be included. See for example Wikipedia.
Since javascript doesn't support look-behinds, the start-of-word test consumes any before mentioned non-word characters, but I don't think that should be a problem. The important thing is that the end-of-word test doesn't.
Also, putting these test outside a non capturing group that alternates the words, makes it significantly more effective.
var bannedWords = ["bad", "mad", "testing", "băţ", "båt", "süß"],
regex = new RegExp('(?:[^\\w\\u00c0-\\u02af]+|^)(?:' + bannedWords.join("|") + ')(?=[^\\w\\u00c0-\\u02af]|$)', 'i');
function myFunction() {
document.getElementById('result').innerHTML = 'Banned = ' + regex.test(document.getElementById('word_to_check').value);
}
<!DOCTYPE html>
<html>
<body>
Enter word: <input type='text' id='word_to_check'>
<button onclick='myFunction()'>Test</button>
<p id='result'></p>
</body>
</html>
When dealing with characters outside my base set (which can show up at any time), I convert them to an appropriate base equivalent (8bit, 16bit, 32bit). before running any character matching over them.
var bannedWords = ["bad", "mad", "testing", "băţ"];
var bannedWordsBits = {};
bannedWords.forEach(function(word){
bannedWordsBits[word] = "";
for (var i = 0; i < word.length; i++){
bannedWordsBits[word] += word.charCodeAt(i).toString(16) + "-";
}
});
var bannedWordsJoin = []
var keys = Object.keys(bannedWordsBits);
keys.forEach(function(key){
bannedWordsJoin.push(bannedWordsBits[key]);
});
var regex = new RegExp(bannedWordsJoin.join("|"), 'i');
function checkword(word) {
var wordBits = "";
for (var i = 0; i < word.length; i++){
wordBits += word.charCodeAt(i).toString(16) + "-";
}
return !regex.test(wordBits);
};
The separator "-" is there to make sure that unique characters don't bleed together creating undesired matches.
Very useful as it brings all the characters down to a common base that everything can interact with. And this can be re-encoded back to it's original without having to ship it in key/value pair.
For me the best thing about it is that I don't have to know all of the rules for all of the character sets that I might intersect with, because I can pull them all into a common playing field.
As a side note:
To speed things up, rather than passing the large regex statement that you probably have, which takes exponentially longer to pass with the length of the words that you're banning, I would pass each separate word in the sentence through the filter. And break the filter up into length based segments. like;
checkword3Chars();
checkword4Chars();
checkword5chars();
who's functions you can generate systematically and even create on the fly as and when they become required.

RegEx to match words in comma delimited list

How can you match text that appears between delimiters, but not match the delimiters themselves?
Text
DoNotFindMe('DoNotFindMe')
DoNotFindMe(FindMe)
DoNotFindMe(FindMe,FindMe)
DoNotFindMe(FindMe,FindMe,FindMe)
Script
text = text.replace(/[\(,]([a-zA-Z]*)[,\)]/g, function(item) {
return "'" + item + "'";
});
Expected Result
DoNotFindMe('DoNotFindMe')
DoNotFindMe('FindMe')
DoNotFindMe('FindMe','FindMe')
DoNotFindMe('FindMe','FindMe','FindMe')
https://regex101.com/r/tB1nE2/1
Here's a pretty simple way to do it:
([a-zA-Z]+)(?=,|\))
This looks for any word that is succeeded by either a comma or a close-parenthesis.
var s = "DoNotFindMe('DoNotFindMe')\nDoNotFindMe(FindMe)\nDoNotFindMe(FindMe,FindMe)\nDoNotFindMe(FindMe,FindMe,FindMe)";
var r = s.replace(/([a-zA-Z]+)(?=,|\))/g, "'$1'" );
alert(r);
Used the same test code as the other two answers; thanks!
You can use:
var s = "DoNotFindMe('DoNotFindMe')\nDoNotFindMe(FindMe)\nDoNotFindMe(FindMe,FindMe)\nDoNotFindMe(FindMe,FindMe,FindMe)";
var r = s.replace(/(\([^)]+\))/g, function($0, $1) {
return $1.replace(/(\b[a-z]+(?=[,)]))/gi, "'$1'"); }, s);
DoNotFindMe('DoNotFindMe')
DoNotFindMe('FindMe')
DoNotFindMe('FindMe','FindMe')
DoNotFindMe('FindMe','FindMe','FindMe')
Here's a solution that avoids the function argument. It's a bit wonky, but works. Basically, you explicitly match the left delimiter and include it in the replacement string via backreference so it won't get dropped, but then you have to use a positive look-ahead assertion for the right delimiter, because otherwise the match pointer would be moved ahead of the right delimiter for the next match, and so it then wouldn't be able to match that delimiter as the left delimiter of the following delimited word:
var s = "DoNotFindMe('DoNotFindMe')\nDoNotFindMe(FindMe)\nDoNotFindMe(FindMe,FindMe)\nDoNotFindMe(FindMe,FindMe,FindMe)";
var r = s.replace(/([,(])([a-zA-Z]*)(?=[,)])/g, "$1'$2'" );
alert(r);
results in
DoNotFindMe('DoNotFindMe')
DoNotFindMe('FindMe')
DoNotFindMe('FindMe','FindMe')
DoNotFindMe('FindMe','FindMe','FindMe')
(Thanks anubhava, I stole your code template, cause it was perfect for my testing! I gave you an upvote for it.)

How do I extract only alphabet from a alphanumeric string

I have a string "5A" or "a6". I want to get only "A" or "a" on the result. I am using the following but it's not working.
Javascript
var answer = '5A';
answer = answer.replace(/^[0-9]+$/i);
//console.log(answer) should be 'A';
let answer = '5A';
answer = answer.replace(/[^a-z]/gi, '');
// [^a-z] matches everything but a-z
// the flag `g` means it should match multiple occasions
// the flag `i` is in case sensitive which means that `A` and `a` is treated as the same character ( and `B,b`, `C,c` etc )
Instead of a-z then you can use \p{L} and the /u modifier which will match any letter, and not just a though z, for instance:
'50Æ'.replace(/[^\p{L}]/gu, ''); // Æ
// [^\p{L}] matches everything but a unicode letter, this includes lower and uppercase letters
// the flag `g` means it should match multiple occasions
// the flag `u` will enable the support for unicode character classes.
See https://caniuse.com/mdn-javascript_builtins_regexp_unicode for support
var answer = '5A';
answer = answer.replace(/[^A-Za-z]/g, '');
g for global, no ^ or $, and '' to replace it with nothing. Leaving off the second parameter replaces it with the string 'undefined'.
I wondered if something like this this might be faster, but it and variations are much slower:
function alphaOnly(a) {
var b = '';
for (var i = 0; i < a.length; i++) {
if (a[i] >= 'A' && a[i] <= 'z') b += a[i];
}
return b;
}
http://jsperf.com/strip-non-alpha
The way you asked, you want to find the letter rather than remove the number (same thing in this example, but could be different depending on your circumstances) - if that's what you want, there's a different path you can choose:
var answer = "5A";
var match = answer.match(/[a-zA-Z]/);
answer = match ? match[0] : null;
It looks for a match on the letter, rather that removing the number. If a match is found, then match[0] will represent the first letter, otherwise match will be null.
var answer = '5A';
answer = answer.replace(/[0-9]/g, '');
You can also do it without a regular expression if you care about performance ;)
You code hade multiple issues:
string.replace takes tow parameters: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/replace
the flag i, standing for case insensitive, doesn't make sense since you are dealing with numbers (what's an upper-case 1?!)
/^[0-9]+$/ would only match a number, nothing more. You should check this out: http://www.regexper.com/. Enter your regex (without the slashes) in the box and hit enter!
In general I would advice you to learn a bit about basic regular expressions. Here is a useful app to play with them: http://rubular.com/
JavaScript:
It will extract all Alphabets from any string..
var answer = '5A';
answer = answer.replace(/[^a-zA-Z]/g, '');
/*var answer = '5A';
answer = answer.replace(/[^a-zA-Z]/g, '');*/
$("#check").click(function(){
$("#extdata").html("Extraxted Alphabets : <i >"+$("#data").val().replace(/[^a-zA-Z]/g, '')+"</i>");
});
i{
color:green;
letter-spacing:1px;
}
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<div>
<input type="text" id="data">
<button id="check">Click To Extract</button><br/>
<h5 id="extdata"></h5>
</div>
You can simplify a bit #TrevorDixon's and #Aegis's answers using \d (digit) instead of [0-9]
var answer = '5A';
answer = answer.replace(/\d/g, '');
var answer = '5A';
answer.replace(/\W/g,"").replace(/\d/g,"")

RegExp to match any string except two reserved strings?

Probably a simple one but my knowledge of creating regular expressions is a little vague.
I'm trying to match any string followed by a comma and a space except if it is 'Bair Hugger' or 'Fluid Warmer'
Here is what I have so far
var re_comma = new RegExp("\w+[^Bair Hugger|Fluid Warmer]" + ", ", "i");
Any ideas?
New answer
Regarding your example I'd say it is really easier to split the string and iterate over it:
function filter(str, delim, test) {
var parts = str.split(delim),
result = [];
for(var i = 0, len = parts.length; i < len; i++) {
if(test(parts[i])) result.push(parts[i]);
}
return result.join(delim);
}
str = filter(str, ', ', function(s) {
s = s.toLowerCase();
return s === 'bair hugger' || s === 'fluid warmer';
});
Otherwise, your expression becomes something like this:
new RegExp("(^|, )(?!(?:Bair Hugger|Fluid Warmer)(?:$|, )).+?(, |$)", "i");
and you have to use a callback for the replacement to decide whether to remove the preceding , or trailing , or not:
str = str.replace(re_comma, function(str, pre, tail) {
return pre && tail ? tail : '';// middle of the string, leave one
});
The intention of this code is less clear. Maybe there is a simpler expression, but I think filtering the array is still cleaner.
Old answer: (doesn't solve the problem at hand but provides information regarding regular expressions).
[] denotes a character class and will only match one character out of the ones you provided. [^Bair Hugger|Fluid Warmer] is the same as [^Bair Huge|FldWm].
You could use a negative lookahead:
new RegExp("^(?!(Bair Hugger|Fluid Warmer), ).+?, $", "i");
Note that you have to use \\ inside a string to produce one \. Otherwise, "\w" becomes w and is not a special character sequence anymore.You also have to anchor the expression.
Update: As you mentioned you want to match any string before the comma, I decided to use . instead of \w, to match any character.
I love regex and Felix Kling answer is correct :)
However for such simple matching I would normally use something like below
function contains(str, text) {
return str.indexOf(text) >= 0;
}
if(contains(myString, 'random')) {
//myString contains "random"
}
Solution:
reg =/(?:Bair Hugger|Fluid Warmer),| (.*)/
str='Bair Hugger, lalala'
reg.exec(str)
>> ["Bair Hugger, lalala", " lalala"]
newStr = reg.exec(str)[1]

Categories