JavaScript regular expressions to match no digits, whitespace and selected symbols

JavaScript regular expressions to match no digits, whitespace and selected symbols - javascript

Thanks for taking a look.
My goal is to come up with a regexp that will match input that contains no digits, whitespace or the symbols !#£$%^&*()+= or any other symbol I may choose.
I am however struggling to grasp precisely how regular expressions work.
I started out with the simple pattern /\D/, which from my understanding will match the first non-digit character it can find. This would match the string 'James' which is correct but also 'James1' which I don't want.
So, my understanding is that if I want to ensure that a pattern is not found anywhere in a given string, I use the ^ and $ characters, as in /^\D$/. Now because this will only match a single character that is not a digit, I needed to use + to specify that 1 or more digits should not be founds in the entire string, giving me the expression /^\D+$/. Brilliant, it no longer matches 'James1'.
Question 1
Is my reasoning up to this point correct?
The next requirement was to ensure no whitespace is in the given string. \s will match a single whitespace and [^\s] will match the first non-whitespace character. So, from my understanding I just had to add this to what I have already to match strings that contain no digits and no whitespace. Again, because [^\s] will only match a single non-white space character, I used + to match one or more whitespace characters, giving the new regexp of /^\D+[^\s]+$/.
This is where I got lost, as the expression now matches 'James1' or even 'James Smith25'. What? Massively confused at this point.
Question 2
Why is /^\D+[^\s]+$/ matching strings that contain spaces?
Question 3
How would I go about writing the regular expression I'm trying to solve?
While I am keen to solve the problem I am more interested in figuring where my understanding of regular expressions is lacking, so any explanations would be helpful.

Not quite; ^ and $ are actually "anchors" - they mean "start" and "end", it's actually a little more complicated, but you can consider them to mean the start and end of a line for now - look up the various modifiers on regular expressions if you're interested in learning more about this. Unfortunately ^ has an overloaded meaning; if used inside square brackets it means "not", which is the meaning you are already acquainted with. It's very important that you understand the difference between these two meanings and that the definition in your head actually applies only to character range matching!
Contributing further to your confusion is that \d means "a numerical digit" and \D means "not a numerical digit". Similarly \s means "a whitespace (space/tab/newline/etc.) character" and \S means "not a whitespace character."
It's worth noting that \d is effectively a shortcut for [0-9] (note that - has a special meaning inside square brackets), and \D is a shortcut for [^0-9].
The reason it's matching strings that contain spaces is that you've asked for "1+ non-numerical digits followed by 1+ non-space characters" - so it'll match lots of strings! I think that perhaps you don't understand that regular expressions match bits of strings, you're not adding constraints as you go, but rather building up bots of matchers that will match bits of corresponding strings.
/^[^\d\s!#£$%^&*()+=]+$/ is the answer you're looking for - I'd look at it like this:
i. [] - match a range of characters
ii. []+ - match one or more of that range of characters
iii. [^\d\s]+ - match one or more characters that do not match \d (numerical digit) or \s (whitespace)
iv. [^\d\s!#£$%^&*()+=]+ - here's a bunch of other characters I don't want you to match
v. ^[^\d\s!#£$%^&*()+=]+$ - now there are anchors applied, so this matcher has to apply to the whole line otherwise it fails to match
A useful website to explore regexs is http://regexr.com/3b9h7 - which I supply with my suggested solution as an example. Edit: Pruthvi Raj's link to debuggerx is awesome!

Is my reasoning up to this point correct?
Almost. /\D/ matches any character other than a digit, but not just the first one (if you use g option).
and [^\s] will match the first non-whitespace character
Almost, [^\s] will match any non-whitespace character, not just the first one (if you use g option).
/^\D+[^\s]+$/ matching strings that contain spaces?
Yes, it does, because \D matches a space (space is not a digit).
Why is /^\D+[^\s]+$/ matching strings that contain spaces?
Because \D+ in /^\D+[^\s]+$/can match spaces.
Conclusion:
Use
^[^\d\s!#£$%^&*()+=]+$
It will match strings that have no digits and spaces, and the symbols you do not allow.
Mind that to match a literal -, ] or [ with a character class, you either need to escape them, or use at the start or end of the expression. To play it safe, escape them.

Just insert every character you don't want to include in a negated character class as follows:
^[^\s\d!#£$%^&*()+=]*$
DEMO
Debuggex Demo
^ - start of the string
[^...] - matches one character that is not in `...`
\s - matches a whitespace (space, newline,tab)
\d - matches a digit from 0 to 9
* - a quantifier that repeats immediately preceeding element by 0 or more times
so the regex matches any string that has
1. string that has a beginning
2. containing 0 or more number of characters that is not whitesapce, digit, and all the symbols included in the character class ( In this example !#£$%^&*()+=) i.e., characters that are not included in the character class `[...]`
3.that has ending
NOTE:
If the symbols you don't want it to have also includes - , a hyphen, don't put it in between some other characters because it is a metacharacter in character class, put it at last of character class

Related

Simple regex with repeated unordered matches

I have this regex
/^[a-z]{1,}( (?=[a-z])){0,}(_(?=[a-z])){0,}[a-z]{0,}$/
I want to match
ag_b_cf_ajk
or
zva b c de
or
hh_b opxop a_b
so any character tokens separated by a single space or underscore.
(In the regex above, we have a literal space, which is legal, and we have look-aheads that ensure that a space or underscore is followed by a character).
The problem is, my above regex is only matching the first space or underscore, like so:
axz_be
axz be
but these fail
axz_be_j
axz be j
I believe I missing some concept with regexes in order to solve this as I have been trying for the last few hours!

It seems you can just use
^[a-z]+(?:[_ ][a-z]+)*$
See the regex demo
The regex matches
^ - start of string
[a-z]+ - one or more lowercase ASCII letters
(?:[_ ][a-z]+)* - zero or more sequences of:
[_ ] - a space or an underscore
[a-z]+ - one or more lowercase ASCII letters
$ - end of string
If the space or underscore must appear at least once, use the + quantifier instead of *:
^[a-z]+(?:[_ ][a-z]+)+$
^
To add a multicharacter alternative to the underscore and hyphen, you need to introduce another non-capturing group:
^[a-z]+(?:(?:[_ ]|\[])[a-z]+)+$
See another regex demo

Misbehaving Regex

I have been using the RegEx /[ -~]/i in JavaScript for a while now and found that it works well testing for any ASCII character including the space. Today I accidentally used /^[ -~]$/i and found much to my surprise that /^[ -~]$/i.test('Stackoverflow is great') failed owing to the space character. My understanding of Regexes is rather limited but even so I fail to see what I might be doing wrong here. Perhaps somone here can shed some light on what is happening?

You miss a quantifier, a + or *:
alert(/^[ -~]*$/i.test('Stackoverflow is great'));
Without the quantifier a character class just matches 1 symbol. You need that quantifier in this case because you added anchors that require matching at the beginning of the string (^) and at the string end ($).
Note that * means match 0 or more occurrences of the preceding subpattern, and + matches 1 or more occurrences.
And it is true as for what your regex matches as the hyphen creates a range between a space and a tilde:

RegExp extract specific string followed by any number with leading / trailing whitespace

I want to extract a string from another using JavaScript / RegExp.
Here is what I got:
var string = "wp-button wp-image-45 wp-label";
string.match(/(?:(?:.*)?\s+)?(wp-image-([0-9]+))(:?\s(?:.*)?)?/);
// returnes: ["wp-button ", "wp-image-45", "45", undefined]
I just want to have "wp-image-45", so:
(Optional) any character
(Optional) followed by whitespace
(Required) followed by "wp-image-"
(Required) followed by any number
(Optional) followed by whitespacy
(Optional) followed by any character
What is missing here? Is it just some kind of bracketing or more?
I also tried
string.match(/(?:(?:.*)?\s+)?(?=(wp-image-([0-9]+)))(?=(:?\s(?:.*)?)?)/)
Edit: In the end I just want to have the number. But I'd also make this step in between.

Regexps are not required to start matching at the beginning of the string, so your attempts to match whitespace and any character aren't necessary. Also, "any character" includes whitespace (except newlines in certain modes).
This should be all you need:
string.match(/\bwp-image-(\d+)\b/)
This will capture, for example, "wp-image-123" into matching group 0, and "123" into matching group 1.
\b means "word boundary", which ensures that you won't match "abcwp-image-123def". A word boundary is defined as any place where a non-word character is followed by a word character, or vice versa. A word character is consists of a letter, a number or an underscore.
Also, I used \d instead of [0-9] simply out of convenience. They have slightly different meaning (\d also matches characters considered numbers in other languages), but that won't make a difference in your case.

If all of that surrounding stuff is optional and all you want is the number then there's no point to matching for any of that stuff except for that "wp-image-" prefix, just do:
var string = "wp-button wp-image-45 wp-label";
string.match(/wp-image-([0-9]+)/);

/(\S)\1(\1)+/g matching all occurrences of three equal non-whitespace characters following each other

Its given: /(\S)\1(\1)+/g matches all occurrences of three equal non-whitespace characters following each other.
I don't understand why there is () around (\S) and 2nd (\1), but not around 1st (\1). Can anyone help in explaining how above regex works?
src: http://www.javascriptkit.com/javatutors/redev2.shtml
Thnx in advance.

The \S needs parentheses to capture its value, so you can refer back to the captured value with \1. \1 means "match the same text which capturing group #1 matched".
I believe there is a problem with this regex. You said you want to match "three equal non-whitespace characters". But the + will make this match 3 or more equal, consecutive non-whitespace characters.
The g on the end means "apply this regex over the entire input string, or globally".

The second set of parentheses is not necessary. It needlessly captures the repeated character a second time, while matching the same strings as this regex:
/(\S)\1\1+/g
Also, as #AlexD pointed out, the description should say that it matches at least three characters. If you replaced that regex with BONK in the string fooxxxxxxbar:
'fooxxxxxxbar'.replace(/(\S)\1\1+/g, 'BONK')
..you might expect the result to be fooBONKBONKbar from their description, because there are two sets of three 'x's. But in fact the result would be fooBONKbar; the first \1 matches the second 'x', and the \1+ matches the third 'x' and any 'x's that follow it. If they wanted to match just three characters, they should have left the + off.
I noticed several other sloppy descriptions like that, plus at least one outright error: \B is equivalent to (?!\b) (a position that's not a word boundary), not [^\b] (a character that's not a backspace). For that matter, their description of word boundaries--"the position between a word and a space"--is wrong, too. A word boundary isn't defined by any particular character, like a space--in fact, it can just as well be the absence of any character that creates one. The string:
Word
...starts with a word boundary because 'W' is a word character and, being first, it's not preceded by another word character. Similarly, the 'd' is not followed by another word character, so the end of the string is also a word boundary.
Also, a regex doesn't know from words, only word characters. The definition of a word character can vary depending on the regex flavor and Unicode or locale settings, but it always includes [A-Za-z0-9_] (ASCII letters and digits plus the underscore). A word boundary is simply a position that's between one of those characters and any other character (or no other character, as I explained earlier).
If you want to learn about regexes, I suggest you forget that site and start here instead: regular-expressions.info.

What are the meanings of these regular expressions in JavaScript?

1) ^[^\s].{1,20}$
2) ^[-/##&$*\w\s]+$
3) ^([\w]{3})$
Are there any links for more information?

^[^\s].{1,20}$
Matches any non-white-space character followed by between 1 and 20 characters. [^\s] could be replaced with \S.
^[-/##&$*\w\s]+$
Matches 1 or more occurances of any of these characters: -/##&$*, plus any word character (A-Ba-b0-9_) plus any white-space character.
^([\w]{3})$
Matches three word characters (A-Ba-b0-9_). This regular expression forms a group (with (...)), which is quite pointless because the group will always equal the aggregate match. Note that the [...] is redundant -- might as well just use \w without wrapping it in a character class.
More info: "Regular Expression Basic Syntax Reference"

1) match everything without space what have 1 to 20 chars.
2) match all this signs -/##&$* plus words and spaces, at last one char must be
3) match three words
here is excelent source of regex
http://www.regular-expressions.info/

Matches any string that starts with a non-whitespace character that's followed by at least one and up to 20 other characters before the end of the string.
Matches any string that contains one or more "word" characters (letters etc), whitespace characters, or any of "-/##&$*"
Matches a string with exactly 3 "word" characters

We Keep Coding

JavaScript is the programming language of the Web.