Regex splitting on newline outside of quotes - javascript

I want to split a stream of data on new lines that are NOT within double quotes. The stream contains rows of data, where each row is separated by a newline. However, the rows of data can potentially contain newlines within double quotes. These newlines do not signify that the next row of data has started, so I want to ignore them.
So the data might look something like this:
Row 1: bla bla, 12345, ...
Row 2: "bla
bla", 12345, ...
Row 3: bla bla, 12345, ...
I tried using regex from a similar post about splitting on commas not found with double quotes (Splitting on comma outside quotes) by replacing the comma with the newline character:
\n(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)
This regex doesn't match where I'd expect it to though. Am I missing something?

Here are two ways of doing that.
#1
You can match the regular expression
[^"\r\n]+(?:"[^"]*"[^"\r\n]+)*
Demo
The expression can be broken down as follows.
[^"\r\n]* # match zero or more characters other than those in the
# character class
(?: # begin non-capture group
"[^"]*" # match double-quote followed by zero or more characters
# other than a double-quote, followed by a double-quote
[^"\r\n]+ # match zero or more characters other than those in the
# character class
)* # end non-capture group and execute it zero or more times
#2
Matching line terminators that are not between double-quotes is equivalent to matching line terminators that are preceded, from the beginning of the string, by an even number of double quotes. You can match such line terminators with the following regular expression (with the multi-line flag not set, so that ^ matches the beginning of the string, not the beginning of a line).
/(?<=^[^"]*(?:"[^"]*"[^"]*)*)\r?\n/
Start your engine!
Javascript's regex engine (which impressively supports variable-length lookbehinds) performs the following operations.
(?<= : begin positive lookbehind
^ : match beginning of string (not line)
[^"]* : match 0+ chars other than '"'
(?: : begin non-capture group
"[^"]*" : match '"', 0+ chars other than '"', '"'
[^"]* : match 0+ chars other than '"'
)* : end non-capture group and execute 0+ times
) : end positive lookbehind
\r?\n : match line terminator

Related

Regex to capture first group in brackets before a match. Here: hexcolor within {my text}{#F00} [duplicate]

I have a text like this;
[Some Text][1][Some Text][2][Some Text][3][Some Text][4]
I want to match [Some Text][2] with this regex;
/\[.*?\]\[2\]/
But it returns [Some Text][1][Some Text][2]
How can i match only [Some Text][2]?
Note : There can be any character in Some Text including [ and ] And the numbers in square brackets can be any number not only 1 and 2. The Some Text that i want to match can be at the beginning of the line and there can be multiple Some Texts
JSFiddle
The \[.*?\]\[2\] pattern works like this:
\[ - finds the leftmost [ (as the regex engine processes the string input from left to right)
.*? - matches any 0+ chars other than line break chars, as few as possible, but as many as needed for a successful match, as there are subsequent patterns, see below
\]\[2\] - ][2] substring.
So, the .*? gets expanded upon each failure until it finds the leftmost ][2]. Note the lazy quantifiers do not guarantee the "shortest" matches.
Solution
Instead of a .*? (or .*) use negated character classes that match any char but the boundary char.
\[[^\]\[]*\]\[2\]
See this regex demo.
Here, .*? is replaced with [^\]\[]* - 0 or more chars other than ] and [.
Other examples:
Strings between angle brackets: <[^<>]*> matches <...> with no < and > inside
Strings between parentheses: \([^()]*\) matches (...) with no ( and ) inside
Strings between double quotation marks: "[^"]*" matches "..." with no " inside
Strings between curly braces: \{[^{}]*} matches "..." with no " inside
In other situations, when the starting pattern is a multichar string or complex pattern, use a tempered greedy token, (?:(?!start).)*?. To match abc 1 def in abc 0 abc 1 def, use abc(?:(?!abc).)*?def.
You could try the below regex,
(?!^)(\[[A-Z].*?\]\[\d+\])
DEMO

regex to replace regular quotes with curly quotes

I have a block of text where the opening and closing quotes are same
"Hey", How are you? "Hey there"... “Some more text” and some more "here".
Please note that the quote character is " and not “ ” these characters
(["'])(?:(?=(\\?))\2.)*?\1
I want to replace the opening " character as “
it will now look as
“Hey", How are you? “Hey there"... “Some more text” and some more “here".
and then again running I can simply find and replace the left over " occurance as ”
and that would give the expected output which should look as
“Hey”, How are you? “Hey there”... “Some more text” and some more “here”.
My preference would be for the solution given by #WiktorStribiżew in a comment on the question, but I wish to give an alternative solution that may be of interest to some readers.
The second replacement of the remaining (trailing) double-quotes (i.e., ASCII 32) is straightforward, so I will not discuss that.
You could match leading double-quotes with the following regular expression, and then replace each match with “:
"(?=(?:(?:[^"]*"){2})*[^"]*"[^"]*$)
Demo
This regex is based on the observation that we want to identify all double-quotes that are followed later in the string by an odd number of double-quotes (assuming the string contains an even number of double-quotes.
The regular expression can be broken down as follows.
" # match a double-quote (dq)
(?= # begin a positive lookahead
(?: # begin a non-capture group
(?: # begin a non-capture group
[^"]*" # match 0+ chars other than dq then match dq
){2} # end non-capture group and execute it twice
)* # end non-capture group and execute it 0+ times
[^"]*"[^"]* # match dq preceded and followed by 0+ non-dq chars
$ # match end of string
) # end positive lookahead
If the data set is large it may be advisable to perform some benchmarking to see if execution speed is satisfactory.

Need a regex with some validations for address field

Allowed characters are [a-zA-Z0-9- /.#,] but
A blank must precede the pound sign.
There must not be a blank immediately before or after a dash.
Address must not begin with #, -, or /.
Address must not end with #, -, or /.
A slash must be surrounded in numerics.
Triple alphas are not allowed immediately following a numeric.
No single characters in the address field with the exception of N, S, E and W
So can any one suggest me to how to do this? Any help is greatly appreciated. Thanks in advance.
In regexes, it's usually easier to think of what's allowed, rather than what's forbidden. For instance:
Blank must precede the pound sign is better rephrased as "blank followed by pound sign is allowed", so one of the components will be: ( #)
Must not start/end with #-/. is something like ^[a-zA-Z0-9]...[a-zA-Z0-9]$ (with a more generous pattern in the middle.
If you're trying to validate addresses, as in postal addresses, consider whether this is a useful thing to do; there are many surprising quirks with addresses. What problem are you trying to solve? Is it worth rejecting some valid addresses in order to solve that problem?
As others have commented, post what you've already tried and what didn't work?
Use one of the interactive regex tools like regex101 to help, especially if you have examples of texts that should and shouldn't match.
I have assumed a "blank" is a space and have not implemented requirement #7 because I do not understand what it means. Once #7 has been clarified I will attempt to amend the following regular expression.
^(?!#|.*[^ ]#)(?!.*(?: -|- ))(?!.*(?:\D\/|\/\D))(?!.*\d[A-Za-z]{3})(?![/-])[a-zA-Z0-9- /.#,]*(?<![#/-])$
Start your engine!
Notice that the regex contains 5 negative lookaheads ((?!...)), three of which begin by consuming zero or more characters other than line terminators ((?!.*...)), and one negative lookbehind at the end ((?<![#/-]) and no capture groups. The negative lookaheads implement the following assertions (in order):
the pound sign, '#', must be preceded by a comma;
a hyphen cannot be preceded or followed by a space;
forward slashes must be preceded and followed by a digit; and
digits may not be followed by 3 letters.
the string may not begin with '/' or '-'
Note that the first of these requirements ensures the string does not begin with a pound sign.
Javascript's regex engine performs the following operations.
^ : match beginning of string
(?! : begin negative lookahead
# : match '#'
| : or
.*[^ ]# : match 0+ chars then char other than a space then '#'
) : end negative lookahead
(?! : begin negative lookahead
.* : match 0+ chars other than newlines
(?: -|- ) : match ' -' or '- '
) : end negative lookahead
(?! : begin negative lookahead
.* : match 0+ chars other than newlines
(?: : begin a non-capture group
\D\/ : match a non-digit followed by '/'
| : or
\/\D : match '/' followed by a non-digit
) : end non-capture group
) : end negative lookahead
(?! : begin negative lookahead
.* : match 0+ chars other than newlines
\d[A-Za-z]{3} : match a digit, then 3 letters
) : end negative lookahead
(?! : begin negative lookahead
[/-] : match '/' or '-'
) : end negative lookahead
[a-zA-Z0-9- /.#,]* : match 0+ chars in char class
(?<![#/-]) : match '#', '/' or '-' in negative lookbehind
$ : match end of string

Remove all special characters except for # symbol from string in JavaScript

I have a string:
“Gazelles were mentioned by #JohnSmith while he had $100 in his pocket and screamed W#$#%#$!!!!"
I need:
“Gazelles were mentioned by #JohnSmith while he had 100 in his pocket and screamed"
How to remove all special characters from string EXCEPT the # symbol. I tried:
str.replace(/[^\w\s]/gi, '')
If you want to keep the # when it is followed by a word char and keeping the W is also ok and also remove the newlines, you could for example change the \s to match spaces or tabs [ \t]
Add the # to the negated character class and use an alternation specifying to only match the # when it is not followed by a word character using a negative lookahead.
[^\w \t#]+|#(?!\w)
[^\w \t#]+ Match 1+ times any char except a word char, space or tab
| Or
#(?!\w) Match an # not directly followed by a word char
Regex demo
In the replacement use an empty string.

Regex match a dollar, without a backslash before it

I want to match only a dollar symbol without a backslash immediately before, as demonstrated below:
$not\$yes $
^.........^
So far, I have [^\\]\$, but this doesn't match any dollar that begins a line. The dollar could be the first symbol in the document, so matching a newline would not work. How do I match this? Is the regex I have so far even right?
You could use an alternation with the ^ anchor in order to match the $ character literally if it is the first character in the string or if it follows a character that is not a backslash.
/(?:^|[^\\])\$/
Explanation:
(?: - Start of a non-capturing group that is used to group the alternation.
^|[^\\] - Alternation that matches the start of the string using the ^ anchor or match a non-\ character
) - Close the non-capturing group that was used to group ^|[^\\]
\$ - The $ character literally
In other words, the ^ anchor will match the start of the string; while [^\\] will match anything but a backslash. The pipe | acts as an "or" operator that will match the start of the string or anything but a backslash (i.e., ^|[^\\]).
So in the string you provided, the first/last $ character would be matched.
Use a negative lookbehind assertion
(?<!\\)\$
In Action: https://regex101.com/r/dA8aA1/1

Categories