Regex replace punctuation with whitespace - javascript

I have a word counter function but it doesn't account for people using poor punctuation, for example:
"hello.world"
That would only count is as 1 word.
Instead it should count that as 2 words.
So instead I need a regex to replace comma's, full stops and any whitespace that is 1+ with a single whitespace.
Here's what I have so far:
proWords = proWords.replace(/[,\s]/, '\s');
negWords = negWords.replace(/[,\s]/, '\s');

The replacement is just an ordinary string, it shouldn't contain regular expression escape sequences like \s.
proWords = proWords.replace(/[,.\s]+/g, ' ');
The + regular expression makes it replace any sequence of the characters, and you need the g modifier to replace multiple times.

Change
proWords = proWords.replace(/[,\s]/, '\s');
negWords = negWords.replace(/[,\s]/, '\s');
to
proWords = proWords.replace(/[,\.\s]/, ' ');
negWords = negWords.replace(/[,\.\s]/, ' ');
This should work.

Related

Javascript - Regex - how to filter characters that are not part of regex

I want to accept words and some special characters, so if my regex
does not fully match, let's say I display an error,
var re = /^[[:alnum:]\-_.&\s]+$/;
var string = 'this contains invalid chars like ##';
var valid = string.test(re);
but now I want to "filter" a phrase removing all characters not matching the regex ?
usualy one use replace, but how to list all characters not matching the regex ?
var validString = string.filter(re); // something similar to this
how do I do this ?
regards
Wiktor Stribiżew solution works fine :
regex=/[^a-zA-Z\-_.&\s]+/g;
let s='some bloody-test #rfdsfds';
s = s.replace(/[^\w\s.&-]+/g, '');
console.log(s);
Rajesh solution :
regex=/^[a-zA-Z\-_.&\s]+$/;
let s='some -test #rfdsfds';
s=s.split(' ').filter(x=> regex.test(x));
console.log(s);
JS regex engine does not support POSIX character classes like [:alnum:]. You may use [A-Za-z0-9] instead, but only to match ASCII letters and digits.
Your current regex matches the whole string that contains allowed chars, and it cannot be used to return the chars that are not matched with [^a-zA-Z0-9_.&\s-].
You may remove the unwanted chars with
var s = 'this contains invalid chars like ##';
var res = s.replace(/[^\w\s.&-]+/g, '');
var notallowedchars = s.match(/[^\w\s.&-]+/g);
console.log(res);
console.log(notallowedchars);
The /[^\w\s.&-]+/g pattern matches multiple occurrences (due to /g) of any one or more (due to +) chars other than word chars (digits, letters, _, matched with \w), whitespace (\s), ., & and -.
To match all characters that is not alphanumeric, or one of -_.& move ^ inside group []
var str = 'asd.=!_#$%^&*()564';
console.log(
str.match(/[^a-z0-9\-_.&\s]/gi),
str.replace(/[^a-z0-9\-_.&\s]/gi, '')
);

JavaScript Regex to create array of whole words only.

I'm trying to split out the whole words out of a string without whitespace or special characters.
So from
'(votes + downvotes) / views'
I'd like to create an array like the following
['votes', 'downvotes, 'views']
Have tried the following, but catching the parens and some whitespace.
https://regex101.com/r/yX9iW8/1
You could use /\w+/g as regular expression in combination with String#match
var array = '(votes + downvotes) / views'.match(/\w+/g);
console.log(array);
You can use \W+ to split on all non-word characters
'(votes + downvotes) / views'.split(/\W+/g).filter(x => x !== '');
// ["votes", "downvotes", "views"]
Or \w+ to match on all word characters
'(votes + downvotes) / views'.match(/\w+/g);
// ["votes", "downvotes", "views"]
It's very simple...
var matches = '(votes + downvotes) / views'.match(/[a-z]+/ig);
You must decide for your project what characters make a words, and min-length of word. It can be chars with digits and dash with min-length of 3 characters...
[a-z0-9-]{3,}
Good luck!

Regexp match spaces not followed be a specific word

I have spent the last couple of hours trying to figure out how to match all whitespace (\s) unless followed by AND\s or preceded by \sAND.
I have this so far
\s(?!AND\s)
but it is then matching the space after \sAND, but I don't want that.
Any help would be appreciated.
Often, when you want to split by a single character that appears in specific context, you can replace the approach with a matching one.
I suggest matching all sequences of non-whitespace characters joined with AND enclosed with whitespace ones before and then match any other non-whitespace sequences. Thus, we'll ensure we get an array of necessary substrings:
\S+\sAND\s\S+|\S+
See regex demo
I assume the \sAND\s pattern appears between some non-whitespace characters.
var re = /\S+\sAND\s\S+|\S+/g;
var str = 'split this but don\'t split this AND this';
var res = str.match(re);
document.write(JSON.stringify(res));
As Alan Moore suggests, the alternation can be unrolled into \S+(?:\sAND\s\S+)*:
\S+ - 1 or more non-whitespace characters
(?:\sAND\s\S+)* - 0 or more (thus, it is optional) sequences of...
\s - one whitespace (add + to match 1 or more)
AND - literal AND character sequence
\s - one whitespace (add + to match 1 or more)
\S+ - one or more non-whitespace symbols.
Since JS doesn't support lookbehinds, you can use the following trick:
Match (\sAND\s)|\s
Throw away any match where $1 has a value
Here's a short example which replaces the spaces you want with an underscore:
var str = "split this but don't split this AND this";
str = str.replace(/(\sAND\s)|\s/g, function(m, a) {
return a ? m : "_";
});
document.write(str);

Regular Expression for changing chars of the words in string to underscore except First char

I'm trying to find a regular expression which modifies the single words in a string containing underscore except the first character.
Example: This is a Test. => T___ i_ a T___.
I'm come up with: (\w)\w*/g which results in T i a T. But I don't know how to get the underscores in place.
Thanks.
This should work:
"This is a Test".replace(/\B\w/g, "_")
Explanation: replace every word character, unless it's preceded by a non-word char.
The naively correct version of your attempt would be
var wordMatch = /\b(\w)(\w+)/g;
input.replace(wordMatch, function ($0, $1, $2) {
return $1 + (new Array($2.length)).join('_');
});
However, this does not work with words that have accented characters, because \w only includes the ASCII range (a-z) and it includes the underscore, which strictly speaking is not a word character.
A more correct version would take set of Unicode ranges in place of \w:
var latinRanges = "\\u0041-\\u005a\\u0061-\\u007a\\u0100-\\u01bf\\u01c4-\\u024f";
wordMatch = new RegExp("(?:^|[^" + latinRanges + "])([" + latinRanges + "])([" + latinRanges + "]+)", "g");
input.replace(wordMatch, function ($0, $1, $2) {
return $1 + (new Array($2.length)).join('_');
});
The ranges \u0041-\u005a, \u0061-\u007a, \u0100-\u01bf and \u01c4-\u024f include every character in the extended Latin alphabet (basic forms, accented forms, upper- and lowercase forms).
You could do like this,
> var s = 'This is a Test.'
> s.replace(/((?:^|\s)\w)(\w*)/g, function(x,y,z) {return y+z.replace(/./g, '_')});
'T___ i_ a T___.'
((?:^|\s)\w) regex captures the first word character along with the preceding space or start of the line boundary.
(\w*) captures the following zero or more word characters.
So the whole match was referred by the first functional parameter x then the chars inside the first captured group was referred by y and the chars inside second captured group was referred by z.
Now the whole match was replaced by ,
y -> chars inside first capturing group.
Plus
z.replace(/./g, '_') will replace each char present inside the second capturing group with _ symbol. Then the final result was concatenated with y and forms the final replacement string.
Your regular expression, as you say mtaches the word. To replace the letters with _ use the replace variant with a function parameter:
var sentence = "Now is the time for all good men";
var cached = sentence.replace (/(\w)(\w*)/g,
function (_,initial, rest) {
return initial + rest.replace (/./g, '_');
});

regex string replace

I am trying to do a basic string replace using a regex expression, but the answers I have found do not seem to help - they are directly answering each persons unique requirement with little or no explanation.
I am using str = str.replace(/[^a-z0-9+]/g, ''); at the moment. But what I would like to do is allow all alphanumeric characters (a-z and 0-9) and also the '-' character.
Could you please answer this and explain how you concatenate expressions.
This should work :
str = str.replace(/[^a-z0-9-]/g, '');
Everything between the indicates what your are looking for
/ is here to delimit your pattern so you have one to start and one to end
[] indicates the pattern your are looking for on one specific character
^ indicates that you want every character NOT corresponding to what follows
a-z matches any character between 'a' and 'z' included
0-9 matches any digit between '0' and '9' included (meaning any digit)
- the '-' character
g at the end is a special parameter saying that you do not want you regex to stop on the first character matching your pattern but to continue on the whole string
Then your expression is delimited by / before and after.
So here you say "every character not being a letter, a digit or a '-' will be removed from the string".
Just change + to -:
str = str.replace(/[^a-z0-9-]/g, "");
You can read it as:
[^ ]: match NOT from the set
[^a-z0-9-]: match if not a-z, 0-9 or -
/ /g: do global match
More information:
https://developer.mozilla.org/en-US/docs/JavaScript/Guide/Regular_Expressions
Your character class (the part in the square brackets) is saying that you want to match anything except 0-9 and a-z and +. You aren't explicit about how many a-z or 0-9 you want to match, but I assume the + means you want to replace strings of at least one alphanumeric character. It should read instead:
str = str.replace(/[^-a-z0-9]+/g, "");
Also, if you need to match upper-case letters along with lower case, you should use:
str = str.replace(/[^-a-zA-Z0-9]+/g, "");
str = str.replace(/\W/g, "");
This will be a shorter form
We can use /[a-zA-Z]/g to select small letter and caps letter sting in the word or sentence and replace.
var str = 'MM-DD-yyyy'
var modifiedStr = str.replace(/[a-zA-Z]/g, '_')
console.log(modifiedStr)

Categories