What's the difference between these regexes - javascript

I'm reading Ionic's source code. I came across this regex, and i"m pretty baffled by it.
([\s\S]+?)
Ok, it's grouping on every char that is either a white space, or non white space???
Why didn't they just do
(.+?)
Am I missing something?

The . matches any symbol but a newline. In order to make it match a newline, in most languages there is a modifier (dotall, singleline). However, in JS, there is no such a modifier.
Thus, a work-around is to use a [\s\S] character class that will match any character, including a newline, because \s will match all whitespace and \S will match all non-whitespace characters. Similarly, one could use [\d\D] or [\w\W].
Also, there is a [^] pattern to match the same thing in JS, but since it is JavaScript-specific, the regexes containing this pattern are not portable between regex flavors.
The +? lazy quanitifier matches 1 or more symbols conforming to the preceding subpattern, but as few as possible. Thus, it will match just 1 symbol if used like this, at the end of the pattern.

In many realizations of Regexp "." doesn't match new lines. So they use "[\s\S]" as a little hack =)

A . matches everything but the newline character. This is actually a well known/documented problem with javascript. The \s (whitespace match) alongside it's negation \S (non-whitespace match) provides a dotall match including the newline. Thus [\s\S] is generally used more frequently than .

The RegEx they used includes more characters (essentially everything).
\s matches any word or digit character or whitespace.
\S matches anything except a digit, word character, or whitespace
As Casimir notes:
. matches any character except newline (\n)

. matches any char except carriage return /r and new line /n
The Shortest way to do [/s/S](white space and non white space) is [^](not nothing)

Related

Ignore newlines in a regex that doesn't care about order

I have a regex here at scriptular.com
/(?=.*net)(?=.*income)(?=.*total)(?=.*depreciation)/i
How do I make the regex successfully match the string?
Without the newline characters in the string, the regex would succeed. I could remove them... but I'd rather not.
1.) The dot matches any character besides newline. It won't skip over newlines if the desired words would match in lines after the first one. In many regex flavors there is the dotall or single line s-flag available for making the dot also match newlines but unfortunately not in JS Regex.
Workarounds are to use a character class that contains any character. Such as [\s\S] any whitespace character \s together with any non whitespace \S or [\w\W] for any word character together with any non word character or even [^] for not nothing instead of the dot.
2.) Anchor the lookaheads to ^ start of string as it's not wanted to repeat the lookaheads at any position in the string. This will drastically improve performance.
3.) Use lazy matching for being satisfied with first match of each word.
/^(?=[\s\S]*?net)(?=[\s\S]*?income)(?=[\s\S]*?total)(?=[\s\S]*?depreciation)/i
See demo at regex101 (dunno why this doesn't work in your demo tool)
Additionally you can use \b word boundaries around the words for making sure such as net won't be matched in brunet, network... so the regex becomes ^(?=[\s\S]*?\bnet\b)...

JavaScript regular expressions to match no digits, whitespace and selected symbols

Thanks for taking a look.
My goal is to come up with a regexp that will match input that contains no digits, whitespace or the symbols !#£$%^&*()+= or any other symbol I may choose.
I am however struggling to grasp precisely how regular expressions work.
I started out with the simple pattern /\D/, which from my understanding will match the first non-digit character it can find. This would match the string 'James' which is correct but also 'James1' which I don't want.
So, my understanding is that if I want to ensure that a pattern is not found anywhere in a given string, I use the ^ and $ characters, as in /^\D$/. Now because this will only match a single character that is not a digit, I needed to use + to specify that 1 or more digits should not be founds in the entire string, giving me the expression /^\D+$/. Brilliant, it no longer matches 'James1'.
Question 1
Is my reasoning up to this point correct?
The next requirement was to ensure no whitespace is in the given string. \s will match a single whitespace and [^\s] will match the first non-whitespace character. So, from my understanding I just had to add this to what I have already to match strings that contain no digits and no whitespace. Again, because [^\s] will only match a single non-white space character, I used + to match one or more whitespace characters, giving the new regexp of /^\D+[^\s]+$/.
This is where I got lost, as the expression now matches 'James1' or even 'James Smith25'. What? Massively confused at this point.
Question 2
Why is /^\D+[^\s]+$/ matching strings that contain spaces?
Question 3
How would I go about writing the regular expression I'm trying to solve?
While I am keen to solve the problem I am more interested in figuring where my understanding of regular expressions is lacking, so any explanations would be helpful.
Not quite; ^ and $ are actually "anchors" - they mean "start" and "end", it's actually a little more complicated, but you can consider them to mean the start and end of a line for now - look up the various modifiers on regular expressions if you're interested in learning more about this. Unfortunately ^ has an overloaded meaning; if used inside square brackets it means "not", which is the meaning you are already acquainted with. It's very important that you understand the difference between these two meanings and that the definition in your head actually applies only to character range matching!
Contributing further to your confusion is that \d means "a numerical digit" and \D means "not a numerical digit". Similarly \s means "a whitespace (space/tab/newline/etc.) character" and \S means "not a whitespace character."
It's worth noting that \d is effectively a shortcut for [0-9] (note that - has a special meaning inside square brackets), and \D is a shortcut for [^0-9].
The reason it's matching strings that contain spaces is that you've asked for "1+ non-numerical digits followed by 1+ non-space characters" - so it'll match lots of strings! I think that perhaps you don't understand that regular expressions match bits of strings, you're not adding constraints as you go, but rather building up bots of matchers that will match bits of corresponding strings.
/^[^\d\s!#£$%^&*()+=]+$/ is the answer you're looking for - I'd look at it like this:
i. [] - match a range of characters
ii. []+ - match one or more of that range of characters
iii. [^\d\s]+ - match one or more characters that do not match \d (numerical digit) or \s (whitespace)
iv. [^\d\s!#£$%^&*()+=]+ - here's a bunch of other characters I don't want you to match
v. ^[^\d\s!#£$%^&*()+=]+$ - now there are anchors applied, so this matcher has to apply to the whole line otherwise it fails to match
A useful website to explore regexs is http://regexr.com/3b9h7 - which I supply with my suggested solution as an example. Edit: Pruthvi Raj's link to debuggerx is awesome!
Is my reasoning up to this point correct?
Almost. /\D/ matches any character other than a digit, but not just the first one (if you use g option).
and [^\s] will match the first non-whitespace character
Almost, [^\s] will match any non-whitespace character, not just the first one (if you use g option).
/^\D+[^\s]+$/ matching strings that contain spaces?
Yes, it does, because \D matches a space (space is not a digit).
Why is /^\D+[^\s]+$/ matching strings that contain spaces?
Because \D+ in /^\D+[^\s]+$/can match spaces.
Conclusion:
Use
^[^\d\s!#£$%^&*()+=]+$
It will match strings that have no digits and spaces, and the symbols you do not allow.
Mind that to match a literal -, ] or [ with a character class, you either need to escape them, or use at the start or end of the expression. To play it safe, escape them.
Just insert every character you don't want to include in a negated character class as follows:
^[^\s\d!#£$%^&*()+=]*$
DEMO
Debuggex Demo
^ - start of the string
[^...] - matches one character that is not in `...`
\s - matches a whitespace (space, newline,tab)
\d - matches a digit from 0 to 9
* - a quantifier that repeats immediately preceeding element by 0 or more times
so the regex matches any string that has
1. string that has a beginning
2. containing 0 or more number of characters that is not whitesapce, digit, and all the symbols included in the character class ( In this example !#£$%^&*()+=) i.e., characters that are not included in the character class `[...]`
3.that has ending
NOTE:
If the symbols you don't want it to have also includes - , a hyphen, don't put it in between some other characters because it is a metacharacter in character class, put it at last of character class

RegEx works in tester not on site using JavaScript

I'm trying to write a RegEx that returns true if the string starts with / or http: and only allows alpha numeric characters, the dash and underscore. Any white space and any other special characters should fire a false response when tested.
Below works fine (except that it allows special characters, I have not figured out how to do that yet) when tested at https://www.regex101.com/#javascript. Unfortunately returns false when I implement it in my site and test it with /products/homedecor/tablecloths. What am I doing wrong and is there a better regEx to use that would accomplish my goals?
^(\\/|(?:http:))\S+[a-zA-Z0-9-_]+$
Keep unescaped hyphen at first or at last position in character class:
^(\/|(?:http:))[/.a-zA-Z0-9_-]+$
Or even simpler:
^(\/|http:)[/\w.-]+$
Since \w is same as [a-zA-Z0-9_]
To match URL you may need to match DOT and forward slash as well.
Just remove the \S+ from your regex and put the hyphen inside the character class at the first or at the last. Note that \S+ matches any non-space characters (including non-word characters).
^(\/|http:)[a-zA-Z0-9_-]+$

To the last tag (already in a string) RegEx

I do not know what I am doing wrong. I have this string that I want to replace
<?xml version="1.0" encoding="utf-8" ?>
<Sections>
<Section>
I am using regex to replace everything including <Section>, and leave the rest untouched.
arrayValues[index].replace("/[([.,\n,\s])*<Section>]/", "---");
What is wrong with my regex? Doesn't this mean repalce every character, including new line and spaces, up to and including <Section> with ---?
First of all, you need to remove the quotes around your regex—if they're there, the argument won't be processed as a regex. JavaScript will see it as a string (because it is a string) and try to match it literally.
Now that that's taken care of, we can simplify your regex a bit:
arrayValues[index].replace(/[\s\S]*?<Section>/, "---");
[\s\S] gets around JavaScript's lack of an s flag (a handy option supported by most languages that enables . to match newlines). \s does match newlines (even without an s flag specified), so the character class [\s\S] tells the regex engine to match:
\s - a whitespace character, which could be a newline
OR
\S - a non-whitespace character
So you can think of [\s\S] as matching . (any character except a newline) or the literal \n (a newline). See Javascript regex multiline flag doesn't work for more.
? is used to make the initial [\s\S]* match non-greedy, so the regex engine will stop once it hits the first occurrence of <Section>.
arrayValues[index].replace("/[([.,\n,\s])*<Section>]/", "---");
What is wrong with my regex?
It's no regex, it's string literal. A string would be converted to a regex, but yours would then include the slashes. Use a regex literal instead:
arrayValues[index].replace(/[\S\s]*<Section>/, "---");
Also, you have too many unnecessary characters in it. The [] around the whole thing build a character class, which is not what you want. The capturing group () just wraps a character class which can be repeated itself. And a dot . inside a character class does match a literal dot, instead of all characters.

Javascript Regular Expression "Single Space Character"

I am learning javascript and I am analyzing existing codes.
In my JS reference book, it says to search on a single space use "\s"?
But I have came across the code
obj.match(/Kobe Bryant/);
Instead of using \s, it uses the actual space?
Why doesn't this generate an error?
The character class \s does not just contain the space character but also other Unicode white space characters. \s is equivalent to this character class:
[\t\n\v\f\r \u00a0\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200a\u200b\u2028\u2029\u3000]
No. It is perfectly legal to include a literal space in a regex.
However, it's not equivalent - \s will include any whitespace character, including tabs, non-breaking spaces, half-width spaces and other characters, whereas a literal space will only match the regular space character.
\s matches any whitespace character, including tabs etc. Sure you can use a literal space also without problems. Just like you can use [0-9] instead of \d to denote any digit. However, keep in mind that [0-9] is equivalent to \d whereas the literal space is a subset of \s.
In addition to normal spaces, \s matches different kinds of white space characters, including tabs (and possibly newline characters, according to configuration). That said, matching with a normal space is certainly valid, especially in your case where it seems you want to match a name, which is normally separated by a normal space.

Categories