JS regular expression to find a substring surrounded by double quotes

JS regular expression to find a substring surrounded by double quotes - javascript

I need to find a substring surrounded by double quotes, for example, like "test", "te\"st" or "", but not """ neither "\". To achieve this, which is the best way to go for it in the following
1) /".*"/g
2) /"[^"\\]*(?:\\[\S\s][^"\\]*)*"/g
3) /"(?:\\?[\S\s])*?"/g
4) /"([^"\\]*("|\\[\S\s]))+/g
I was asked this question yesterday during an interview, and would like to know the answer for future reference.

These expressions evaluate as follows:
Expression 1 matches:
An inverted comma
Greedily any character, including an inverted comma or a slash
A final inverted comma.
This would match "test" some wrong text "text", and therefore fails
Expression 2 matches:
An inverted comma
Greedily as many characters that are not either an inverted comma or a slash
Greedily as many sets of
Any chracter preceded by a slash
Greedily as many characters that are not either an inverted comma or a slash
A final inverted comma
So this collects all chracters within the inverted commas in sets, broken by slashes. It specifically excludes an inverted comma if it is preceded by a slash by including it in any subsequent sets. This will work.
Expression 3 matches:
An inverted comma
As few sets as fit of:
Any one character preceded by an optional slash
A final inverted comma
This collects all characters , optionally preceded by a slash, but not greedily. This will work
Expression 4 matches:
An inverted comma
Greedily all characters that are no either an inverted comma or a slash
One or more of:
An inverted comma or
A slash and any character
This will match "test"\x, and therefore fails
Conclusion:
From what I can tell, both expressions 2 and 3 will work. I may have missed something, but both will certainly work (or not as appropriate) for the examples given. So the question, then, is which is better. I'd vote for three, because it's simpler.

You could also get away with this simpler guy:
/("(\\"|[^"])+")/g
http://jsfiddle.net/b9chris/eMN2S/

Your grammar is a little unclear. I will assume that you want to find all strings of the form DQ [anything but DQ or \DQ]* DQ.
The regex for this /"([^"\\\\]|\\\\"|\\\\[^"])*"/g

Related

Replacements only in the first line with a regex

There is a transform of multiline string.
!a! b!
should become
.a. b.
And
!a! b!
c!
!d!
should become
.a. b.
c!
!d!
I approached it with a lookbehind:
str(/(?<!\n)([^\n!]*)!+/g, '$1.')
It didn't work as intended:
.a. b.
c.
!d.
Splitting a string and transforming the first line seems straightforward. But is there a reliable way to do replacements only in the first line of multiline string with a regex only?
Also would appreciate an explanation what exactly goes wrong with my approach so it fails.
The question is not limited to JS regex flavour but I'm interested in this one in the first place.

About the pattern you tried:
(?<!\n) Negative lookbehind, assert what is directly to the left is not a newline or !
([^\n!]*) Capture group 1, match 0+ times any char except a newline or !
!+ Match 1+ times ! (What you want to remove)
The pattern will match too much, as it will match all the individual parts. There is for example no rule that says match this pattern 2 times, so you will replace with group 1 for every time that pattern has a match.
Note that the quantifier in this part is 0+ times ([^\n!]*) it will also match a single ! except when preceded by a newline.
If you can make use of SKIP FAIL, you can first match what you want to avoid, which in this case is a line that optionally starts with an exclamation mark and ends with an exclamation mark with none in between.
After that match all the other exclamation marks and replace them with a dot.
^!?[^\r\n!]*!$(*SKIP)(*FAIL)|!
See a regex demo
Another option could be using 2 capturing groups.
The first group will match between the first set of exclamation marks, and the second group will match the whitespaces after followed by a char other than !.
Then match the ! at the end so it is not in the replacement
!([^\s!]+)!([^\S\r\n]+[^\s!])!
See another regex demo
In the replacement use the 2 capturing groups with the dots
.$1.$2.

Regex including unwated digits

I'm trying to create an expression to collect anything but digits and the character *. I tried to use the expression \D[^*] but somehow it's retrieving the first digit after blank espaces. I tried this expression with the string 1234 1234 1234 **** and the matches were: ' 1', ' 1'. Can anyone tell me why would the expression collect the digits with the blank spaces?
Thank you.

Your regex ('\D[^*]') will match a Space (thats' not a digit) followed by a digit (that's not a star '*').
You can do several Things, the easiest is to include '\d' in the character Group, then it will Work, because both '\d' and '*' are excluded:
/[^\d*]/g
Now it will only match Spaces in your example.

A square brackets in regex are a set, and a square bracket with ^ mean not in set.
The \d should also be inside the brackets:
[^\d*]
https://regex101.com/r/WizvVh/1

Regex: How do I remove the character BEFORE the matched string?

I am intercepting messages which contain the following characters:
*_-
However, whenever any one of these characters comes through, it will always be preceded by a \. The \ is just for formatting though and I want to remove it before sending it off to my server. I know how to easily create a regex which would remove this backslash from a single letter:
'omg\_bbq\_everywhere'.replace(/\\_/g, '')
And I recognize I could just do this operation 3 times: once for each character I want to remove the preceding backslash for. But how can I create a single regex which would detect all three characters and remove the preceding backslash in all 3 cases?

You can use a character class like [*_-].
To remove only the backslash before these characters:
document.body.innerHTML =
"omg\\-bbq\\*everywhere\\-".replace(/\\([*_-])/g, '$1');
When you place a subpattern into a capturing group ((...)), you capture that subtext into a numbered buffer, and then you can reference it with a $1 backreference (1 because there is only one (...) in the pattern.)

This is a good time to use atomic matching. Specifically you want to check for the slash and then positive lookahead for any of those characters.
Ignoring the code, the raw regex you want is:
\\(?=[*_-])
A literal backslash, with one of these characters in front of it: *_-
So now you are matching the slash. The atomic match is a 0 length match, so it doesn't match anything, but sets a requirement that "for this to be a valid match, it needs to be followed by [*_-]"
Atomic groups: http://www.regular-expressions.info/atomic.html
Lookaround statements: http://www.regular-expressions.info/lookaround.html
Positive and negative lookahead and lookbehind matches are available.

JavaScript regular expressions to match no digits, whitespace and selected symbols

Thanks for taking a look.
My goal is to come up with a regexp that will match input that contains no digits, whitespace or the symbols !#£$%^&*()+= or any other symbol I may choose.
I am however struggling to grasp precisely how regular expressions work.
I started out with the simple pattern /\D/, which from my understanding will match the first non-digit character it can find. This would match the string 'James' which is correct but also 'James1' which I don't want.
So, my understanding is that if I want to ensure that a pattern is not found anywhere in a given string, I use the ^ and $ characters, as in /^\D$/. Now because this will only match a single character that is not a digit, I needed to use + to specify that 1 or more digits should not be founds in the entire string, giving me the expression /^\D+$/. Brilliant, it no longer matches 'James1'.
Question 1
Is my reasoning up to this point correct?
The next requirement was to ensure no whitespace is in the given string. \s will match a single whitespace and [^\s] will match the first non-whitespace character. So, from my understanding I just had to add this to what I have already to match strings that contain no digits and no whitespace. Again, because [^\s] will only match a single non-white space character, I used + to match one or more whitespace characters, giving the new regexp of /^\D+[^\s]+$/.
This is where I got lost, as the expression now matches 'James1' or even 'James Smith25'. What? Massively confused at this point.
Question 2
Why is /^\D+[^\s]+$/ matching strings that contain spaces?
Question 3
How would I go about writing the regular expression I'm trying to solve?
While I am keen to solve the problem I am more interested in figuring where my understanding of regular expressions is lacking, so any explanations would be helpful.

Not quite; ^ and $ are actually "anchors" - they mean "start" and "end", it's actually a little more complicated, but you can consider them to mean the start and end of a line for now - look up the various modifiers on regular expressions if you're interested in learning more about this. Unfortunately ^ has an overloaded meaning; if used inside square brackets it means "not", which is the meaning you are already acquainted with. It's very important that you understand the difference between these two meanings and that the definition in your head actually applies only to character range matching!
Contributing further to your confusion is that \d means "a numerical digit" and \D means "not a numerical digit". Similarly \s means "a whitespace (space/tab/newline/etc.) character" and \S means "not a whitespace character."
It's worth noting that \d is effectively a shortcut for [0-9] (note that - has a special meaning inside square brackets), and \D is a shortcut for [^0-9].
The reason it's matching strings that contain spaces is that you've asked for "1+ non-numerical digits followed by 1+ non-space characters" - so it'll match lots of strings! I think that perhaps you don't understand that regular expressions match bits of strings, you're not adding constraints as you go, but rather building up bots of matchers that will match bits of corresponding strings.
/^[^\d\s!#£$%^&*()+=]+$/ is the answer you're looking for - I'd look at it like this:
i. [] - match a range of characters
ii. []+ - match one or more of that range of characters
iii. [^\d\s]+ - match one or more characters that do not match \d (numerical digit) or \s (whitespace)
iv. [^\d\s!#£$%^&*()+=]+ - here's a bunch of other characters I don't want you to match
v. ^[^\d\s!#£$%^&*()+=]+$ - now there are anchors applied, so this matcher has to apply to the whole line otherwise it fails to match
A useful website to explore regexs is http://regexr.com/3b9h7 - which I supply with my suggested solution as an example. Edit: Pruthvi Raj's link to debuggerx is awesome!

Is my reasoning up to this point correct?
Almost. /\D/ matches any character other than a digit, but not just the first one (if you use g option).
and [^\s] will match the first non-whitespace character
Almost, [^\s] will match any non-whitespace character, not just the first one (if you use g option).
/^\D+[^\s]+$/ matching strings that contain spaces?
Yes, it does, because \D matches a space (space is not a digit).
Why is /^\D+[^\s]+$/ matching strings that contain spaces?
Because \D+ in /^\D+[^\s]+$/can match spaces.
Conclusion:
Use
^[^\d\s!#£$%^&*()+=]+$
It will match strings that have no digits and spaces, and the symbols you do not allow.
Mind that to match a literal -, ] or [ with a character class, you either need to escape them, or use at the start or end of the expression. To play it safe, escape them.

Just insert every character you don't want to include in a negated character class as follows:
^[^\s\d!#£$%^&*()+=]*$
DEMO
Debuggex Demo
^ - start of the string
[^...] - matches one character that is not in `...`
\s - matches a whitespace (space, newline,tab)
\d - matches a digit from 0 to 9
* - a quantifier that repeats immediately preceeding element by 0 or more times
so the regex matches any string that has
1. string that has a beginning
2. containing 0 or more number of characters that is not whitesapce, digit, and all the symbols included in the character class ( In this example !#£$%^&*()+=) i.e., characters that are not included in the character class `[...]`
3.that has ending
NOTE:
If the symbols you don't want it to have also includes - , a hyphen, don't put it in between some other characters because it is a metacharacter in character class, put it at last of character class

RegExp to match hashtag at the begining of the string or after a space

I have looked through previous questions and answers, however they do not solve the following:
https://stackoverflow.com/questions/ask#notHashTag
The closest I got to is this: (^#|(?:\s)#)(\w+), which finds the hashtag in half the necessary cases and also includes the leading space in the returned text. Here are all the cases that need to be matched:
#hashtag
a #hashtag
a #hashtag world
cool.#hashtag
##hashtag, but only until the comma and starting at second hash
#hashtag#hashtag two separate matches
And these should be skipped:
https://stackoverflow.com/questions/ask#notHashTag
Word#notHashTag
#ab is too short to be a hashtag, 3 characters minimum

This should work for everything but #hashtag#duplicates, and because JS doesn't support lookbehind, that's probably not possible to match that by itself.
\B#\w{3,}
\B is designed to match only between two word characters or two non-word characters. Since # is a non-word character, this forces the match to be preceded by a space or punctuation, or the beginning of the string.

Try this regex:
(?:^|[\s.])(#+\w{3,})(#+\w{3,})?
Online Demo: http://regex101.com/r/kG1nD5

We Keep Coding

JavaScript is the programming language of the Web.