Regex to match instances of X unless they're within comments

Regex to match instances of X unless they're within comments - javascript

I have this regex that matches an 'if' statement if it follows a comment:
(?:(\/\*[\w\s\']*)|(\/\/[\w\s\']*))if
What I need is to find all instances of 'if' as long as they don't match the regex above, so if the 'if' statement is within comments ignore it, otherwise match it. Is this possible with Regex alone? This feels like its very close.

How about first cleaning the code from comments using REGEX and then make a second run on the clean code?
I did it in SQL using REGEXP_REPLACE with a PCRE dialect but I'm sure you'll get the hang of it
SQL/Regex Challenge/Puzzle: How to remove comments from SQL code (by using SQL query)?
P.s.
You might also want to remove string using the same concept.

Related

Regular expression matching multiple entries, spanning multiple lines [duplicate]

This question already has answers here:
Regular expression to stop at first match
(9 answers)
Closed 2 years ago.
I have this gigantic ugly string:
J0000000: Transaction A0001401 started on 8/22/2008 9:49:29 AM
J0000010: Project name: E:\foo.pf
J0000011: Job name: MBiek Direct Mail Test
J0000020: Document 1 - Completed successfully
I'm trying to extract pieces from it using regex. In this case, I want to grab everything after Project Name up to the part where it says J0000011: (the 11 is going to be a different number every time).
Here's the regex I've been playing with:
Project name:\s+(.*)\s+J[0-9]{7}:
The problem is that it doesn't stop until it hits the J0000020: at the end.
How do I make the regex stop at the first occurrence of J[0-9]{7}?

Make .* non-greedy by adding '?' after it:
Project name:\s+(.*?)\s+J[0-9]{7}:

Using non-greedy quantifiers here is probably the best solution, also because it is more efficient than the greedy alternative: Greedy matches generally go as far as they can (here, until the end of the text!) and then trace back character after character to try and match the part coming afterwards.
However, consider using a negative character class instead:
Project name:\s+(\S*)\s+J[0-9]{7}:
\S means “everything except a whitespace and this is exactly what you want.

Well, ".*" is a greedy selector. You make it non-greedy by using ".*?" When using the latter construct, the regex engine will, at every step it matches text into the "." attempt to match whatever make come after the ".*?". This means that if for instance nothing comes after the ".*?", then it matches nothing.
Here's what I used. s contains your original string. This code is .NET specific, but most flavors of regex will have something similar.
string m = Regex.Match(s, #"Project name: (?<name>.*?) J\d+").Groups["name"].Value;

I would also recommend you experiment with regular expressions using "Expresso" - it's a utility a great (and free) utility for regex editing and testing.
One of its upsides is that its UI exposes a lot of regex functionality that people unexprienced with regex might not be familiar with, in a way that it would be easy for them to learn these new concepts.
For example, when building your regex using the UI, and choosing "*", you have the ability to check the checkbox "As few as possible" and see the resulting regex, as well as test its behavior, even if you were unfamiliar with non-greedy expressions before.
Available for download at their site:
http://www.ultrapico.com/Expresso.htm
Express download:
http://www.ultrapico.com/ExpressoDownload.htm

(Project name:\s+[A-Z]:(?:\\w+)+.[a-zA-Z]+\s+J[0-9]{7})(?=:)
This will work for you.
Adding (?:\\w+)+.[a-zA-Z]+ will be more restrictive instead of .*

Regex for repeated sub strings in a lengthy string [duplicate]

This question already has answers here:
Regular expression to stop at first match
(9 answers)
Closed 2 years ago.
I have this gigantic ugly string:
J0000000: Transaction A0001401 started on 8/22/2008 9:49:29 AM
J0000010: Project name: E:\foo.pf
J0000011: Job name: MBiek Direct Mail Test
J0000020: Document 1 - Completed successfully
I'm trying to extract pieces from it using regex. In this case, I want to grab everything after Project Name up to the part where it says J0000011: (the 11 is going to be a different number every time).
Here's the regex I've been playing with:
Project name:\s+(.*)\s+J[0-9]{7}:
The problem is that it doesn't stop until it hits the J0000020: at the end.
How do I make the regex stop at the first occurrence of J[0-9]{7}?

Make .* non-greedy by adding '?' after it:
Project name:\s+(.*?)\s+J[0-9]{7}:

Using non-greedy quantifiers here is probably the best solution, also because it is more efficient than the greedy alternative: Greedy matches generally go as far as they can (here, until the end of the text!) and then trace back character after character to try and match the part coming afterwards.
However, consider using a negative character class instead:
Project name:\s+(\S*)\s+J[0-9]{7}:
\S means “everything except a whitespace and this is exactly what you want.

Well, ".*" is a greedy selector. You make it non-greedy by using ".*?" When using the latter construct, the regex engine will, at every step it matches text into the "." attempt to match whatever make come after the ".*?". This means that if for instance nothing comes after the ".*?", then it matches nothing.
Here's what I used. s contains your original string. This code is .NET specific, but most flavors of regex will have something similar.
string m = Regex.Match(s, #"Project name: (?<name>.*?) J\d+").Groups["name"].Value;

I would also recommend you experiment with regular expressions using "Expresso" - it's a utility a great (and free) utility for regex editing and testing.
One of its upsides is that its UI exposes a lot of regex functionality that people unexprienced with regex might not be familiar with, in a way that it would be easy for them to learn these new concepts.
For example, when building your regex using the UI, and choosing "*", you have the ability to check the checkbox "As few as possible" and see the resulting regex, as well as test its behavior, even if you were unfamiliar with non-greedy expressions before.
Available for download at their site:
http://www.ultrapico.com/Expresso.htm
Express download:
http://www.ultrapico.com/ExpressoDownload.htm

(Project name:\s+[A-Z]:(?:\\w+)+.[a-zA-Z]+\s+J[0-9]{7})(?=:)
This will work for you.
Adding (?:\\w+)+.[a-zA-Z]+ will be more restrictive instead of .*

Regular expression: matching word boundaries [duplicate]

This question already has answers here:
Regular expression to stop at first match
(9 answers)
Closed 2 years ago.
I have this gigantic ugly string:
J0000000: Transaction A0001401 started on 8/22/2008 9:49:29 AM
J0000010: Project name: E:\foo.pf
J0000011: Job name: MBiek Direct Mail Test
J0000020: Document 1 - Completed successfully
I'm trying to extract pieces from it using regex. In this case, I want to grab everything after Project Name up to the part where it says J0000011: (the 11 is going to be a different number every time).
Here's the regex I've been playing with:
Project name:\s+(.*)\s+J[0-9]{7}:
The problem is that it doesn't stop until it hits the J0000020: at the end.
How do I make the regex stop at the first occurrence of J[0-9]{7}?

Make .* non-greedy by adding '?' after it:
Project name:\s+(.*?)\s+J[0-9]{7}:

Using non-greedy quantifiers here is probably the best solution, also because it is more efficient than the greedy alternative: Greedy matches generally go as far as they can (here, until the end of the text!) and then trace back character after character to try and match the part coming afterwards.
However, consider using a negative character class instead:
Project name:\s+(\S*)\s+J[0-9]{7}:
\S means “everything except a whitespace and this is exactly what you want.

Well, ".*" is a greedy selector. You make it non-greedy by using ".*?" When using the latter construct, the regex engine will, at every step it matches text into the "." attempt to match whatever make come after the ".*?". This means that if for instance nothing comes after the ".*?", then it matches nothing.
Here's what I used. s contains your original string. This code is .NET specific, but most flavors of regex will have something similar.
string m = Regex.Match(s, #"Project name: (?<name>.*?) J\d+").Groups["name"].Value;

I would also recommend you experiment with regular expressions using "Expresso" - it's a utility a great (and free) utility for regex editing and testing.
One of its upsides is that its UI exposes a lot of regex functionality that people unexprienced with regex might not be familiar with, in a way that it would be easy for them to learn these new concepts.
For example, when building your regex using the UI, and choosing "*", you have the ability to check the checkbox "As few as possible" and see the resulting regex, as well as test its behavior, even if you were unfamiliar with non-greedy expressions before.
Available for download at their site:
http://www.ultrapico.com/Expresso.htm
Express download:
http://www.ultrapico.com/ExpressoDownload.htm

(Project name:\s+[A-Z]:(?:\\w+)+.[a-zA-Z]+\s+J[0-9]{7})(?=:)
This will work for you.
Adding (?:\\w+)+.[a-zA-Z]+ will be more restrictive instead of .*

Highlighting HTML parts matching a perl regex - do it in server side perl or client side javascript?

A Perl CGI application is providing a search function. The application writes matching snippets to the HTML page. Now I would like to highlight the matches inside the snippets. I could use something like
s/($searchregex)/<span class="highlight">$1<\/span>/gi
to highlight the matches. This is working fine for text only cases, but breaks sometimes with snippets containing itself HTML tag, e.g. for links or images with references. In failing cases the above replacement is destroying the HTML links by inserting the span tag inside the href value.
At the moment I am seeing three possible solutions:
Write a regex that is not replacing matches inside of html tags, e.g. inside <>. I am not aware how to write a replacement regex for this case. Is there a perl regex to allow this replacement and how does it look like?
Write a regex that replaces all wrong replacements of the above replacement. This would fix the wrong span tags inside the href.
Use Javascript to highlight the matches inside the resulting DOM tree. Possible ways using jQuery are outlined in highlight html with matching text. Even normal Javascript may be enough JavaScript’s Regular Expression Flavor. There are special jQuery plugins for highlighting highlight regular expressions , too. I am new to Javascript so some more advise is appreciated, too.
What is the preferable solution? The best way would to it as 1. - but that seems not possible. So the remaining question is: Do the work in an ugly way on the server side or introduce Javascript to solve the problem in a cleaner way on the client side.

in perl with a lookahead after pattern
s/($searchregex)(?=[^>]*<)/<span class="highlight">$1<\/span>/gi
or shorter
s/$searchregex(?=[^>]*<)/<span class="highlight">$&<\/span>/gi
but maybe you will need to read the whole file in a string or change the input record separator ($/) to '<', because the regexp matches the pattern if it's followed by a sequence of any character except '>' and by '<' because will not match if ($/="\n" and there is a newline between pattern and next '<'.

You could use an HTML parser on the server side, which is the correct tool for the job you are doing.
Or you could do it with javascript as you say, which I prefer myself as it is more versatile, and could lead to more interactivity, although you would probably be facing a similar issue to what you are facing now (just that you have moved it to the client side).
It is actually a more complex question than it first appears. Without more information, it is impossible to try to come up with a better solution.
One good solution would be to traverse the DOM tree and match against each text node, but you have a problem then that you would not match text that spans several text nodes - for example "John the Con Johnson" would not match the search for "John the Con" as they would be in separate nodes. This might or might not be a problem for you, depending on your use case.

Finding beginning and end quotations

I'm starting to write a code syntax highlighter in JavaScript, and I want to highlight text that is in quotes (both "s and 's) in a certain color. I need it be able to not be messed up by one of one type of quote being in the middle of a pair of the other quotes as well, but i'm really not sure where to even start. I'm not sure how I should go about finding the quotes and then finding the correct end quote.

Unless you're doing this for the challenge, have a look at Google Code Prettify.
For your problem, you could read up on parsing (and lexers) at Wikipedia. It's a huge topic and you'll find that you'll come upon bigger problems than parsing strings.
To start, you could use regular expressions (although they rarely have the accuracy of a true lexer.) A typical regular expression for matching a string is:
/"(?:[^"\\]+|\\.)*"/
And then the same for ' instead of ".
Otherwise, for a character-by-character parser, you would set some kind of state that you're in a string once you hit ", then when you hit " that is not preceded by an uneven amount of backslashes (an even amount of backslashes would escape eachother), you exit the string.

You can find quotes using regular expressions but if you're writing a syntax highlighter then the only reliable way is to step through the code, character by character, and decide what to do from there.
E.g. of a Regex
/("|')((?:\\\1|.)+?)\1/g
(matches "this" and 'this' and "thi\"s")

use stack.. if unmatched quote found push it.. if match found pop

I did it with a single regular expression in php using backwards references. JS does not support it and i think that's what you need if you really want to detect undefined backslashes.

We Keep Coding

JavaScript is the programming language of the Web.