I am currently trying to parse out comments in a particular format using JavaScript. While I have a basic understanding of regular expressions, with this one I seem to have reached my current limit. This is what I am trying to do:
The comments
//
This is a
multiline comment
Code here
//
This is another
comment
Again, code here
For the Regex, it currently looks this like this:
\/\/\n(\s+[\s\S]+)
\/\/\n matches the //sequence including the new line.
Since I am interested in the comments, I am opening a capture group.
\s+ matches the indentation. I could probably be a bit more precise by only accepting tabs or spaces in a particular count – for me this is not relevant
[\s\S] is supposed to match the actual words and and spaces between the words.
This seems to currently match the whole file, which is not what I want. What I now can't wrap my head around is how to solve this?
I think my problem is related to me not knowing how to think about regexes. Is it like a program that matches line per line, so I need to work more on the quantifiers? Or is there maybe a way to stop at lines only consisting of a newline? When I try to match for the newline character, I of course receive matches at each line ending, which is not helpful.
You may use
/^\/\/((?:\r?\n[^\S\r\n].*)*)/gm
See the regex demo
Details
^ - start of a line (due to m modifier, ^ also matches line start positions)
\/\/ - a // string
((?:\r?\n[^\S\r\n].*)*) - Capturing group 1: zero or more repetitions of
\r?\n - a CRLF or LF line ending
[^\S\r\n] - any whitespace but CR and LF
.* - the rest of the line.
JS demo:
var text = "//\n This is a\n multiline comment\n\nCode here\n\n\n//\n This is another\n comment\n\nAgain, code here";
var regex = /^\/\/((?:\r?\n[^\S\r\n].*)*)/gm, m, results=[];
while (m = regex.exec(text)) {
results.push(m[1].trim());
}
console.log(results);
Related
I am trying to match each section of a string starting with one or more + and ending whenever it finds another instance of it or the end of the sample.
The sample is the following:
+ Horizontal Rules
Lorem ipsum
+ Lists
Everything about lists
++ Bulleted Lists
Make a list element by starting a line with an asterisk. To increase the indent put extra spaces
before the asterisk.
+ Another title
content
The regex I'm currently using (/^(\++) (.*?)(?=^\++)/gms) will split the sample into different matches. But in this case, the last section isn't matched, since the look-ahead doesn't find anything:
+ Another title
content
I tried using \z, which is supposed to match the end of the sample in a multiline regex :
^(\++) (.*?)(?=^\++|\z)
But unfortunately, the JavaScript regex engines haven't implemented this basic feature, making it look for a literal z instead.
Is there any way to replicate \z behavior ?
I've tried the following regex without any concrete results : ^(\+)+ (.*?)(?=^\++)|(?:$(?!.*))
You may use
/^(\++) (.*?)(?=^\+|$(?!.))/gms
The $(?!.) part matches the end of a line that is not followed with any char (. with s modifier matches any char).
However, it is more efficient to unroll the lazy dot matching part to grab whole lines that do not start with +:
/^(\++) (.*(?:\r?\n(?!\+).*)*)/gm
See the regex demo. Note the absense of the s flag.
Details
^ - start of a line (due to m flag)
(\++) - Capturing group 1: one or more + symbols
- a space
(.*(?:\r?\n(?!\+).*)*) - Capturing group 2:
.* - 0+ chars other than line break chars as few as possible
(?:\r?\n(?!\+).*)* - 0 or more repetitions of
\r?\n(?!\+) - a CRLF or LF line ending that is not immediately followed with + (if you want, you may make it safer if you use (?!\++ ))
.* - 0+ chars other than line break chars as few as possible
I want to split up a string (sentence) in an array of words and keep the delimiters.
I have found and I am currently using this regex for this:
[^.!?\s][^.!?]*(?:[.!?](?!['"]?\s|$)[^.!?]*)*[.!?]?['"]?(?=\s|$)
An explanation can be found here: http://regex101.com/
This works exactly as I want it to and effectively makes a string like
This is a sentence.
To an array of
["This", "is", "a", "sentence."]
The problem here is that it does not include spaces nor newlines. I want the string to be parsed as words as it already does but I also want the corresponding space and or newline character to belong to the previous word.
I have read about positive lookahead that should look for future characters (space and or newline) but still take them into account when extracting the word. Although this might be the solution I have failed to implement it.
If it makes any difference I am using JavaScript and the following code:
//save the regex -- g modifier to get all matches
var reg = /[^.!?\s][^.!?]*(?:[.!?](?!['"]?\s|$)[^.!?]*)*[.!?]?['"]?(?=\s|$)/g;
//define variable for holding matches
var matches;
//loop through each match
while(matches = reg.exec(STRING_HERE)){
//the word without spaces or newlines
console.log(matches[0]);
}
The code works but as I said, it does not include spaces and newline characters.
Yo can try something simpler:
str.split(/\b(?!\s)/);
However, note non word characters (e.g. full stop) will be considered another word:
"This is a sentence.".split(/\b(?!\s)/);
// [ "This ", "is ", "a ", "sentence", "." ]
To fix that, you can use a character class with the characters that shouldn't begin another word:
str.split(/\b(?![\s.])/);
function split_string(str){
var arr = str.split(" ");
var last_i = arr.length - 1;
for(var i=0; i<last_i; i++){
arr[i]+=" ";
}
return arr;
}
It may be as simple as this:
var sentence = 'This is a sentence.';
sentence = sentence.split(' ').join(' ||');
sentence = sentence.split('\n').join('\n||');
var matches = sentence.split('||');
Note that I use 2 pipes as a delimiter, but ofcourse you can use anything as long as it's unique.
Also note that I only split \n as a newline, but you may add \r\n or whatever you want to split as well.
General Solution
To keep the delimiters conjoined in the results, the regex needs to be a zero-width match. In other words, the regex can be thought of as matching the point between a delimiter and non-delimiter, rather than matching the delimiters themselves. This can be achieved with zero-width matching expressions, matching before, at, or after the split point (at most one each); let's call these A, B, and C. Sometimes a single sub-expression will do it, others you'll need two; offhand, I can't think of a case where you'd need three.
Not only look-aheads but lookarounds in general are the perfect candidates for this purpose: lookbehinds ((?<=...)) to match before the split point, and lookaheads ((?=...)) after. That's the essence of this approach. Positive or negative lookarounds can be used. The one pitfall is that lookbehinds are relatively new to JS regexes, so not all browsers or other JS engines will support them (current versions of Firefox, Chrome, Opera, Edge, and node.js do; Safari does not). If you need to support a JS engine that doesn't support lookbehinds, you might still be able to write & use a regex that matches at-and-before (BC).
To have the delimiters appear at the end of each match, put them in A. To have them at the start, in C. Fortunately, JS regexes do not place restrictions on lookbehinds, so simply wrapping the delimiter regex in the positive lookaround markers should be all that's required for delimiters. If the delimiters aren't so simple (i.e. context-sensitive), it might take a little more work to write the regex, which doesn't need to match the entire delimiter.
Paired with the delimiter pattern, you'll need to write a pattern that matches the start (for C) or end (for A) of the non-delimiter. This step is likely the one that will require the most additional work.
The at-split-point match, B
will often (always?) be a simple boundary, such as \b.
Specific Solution
If spaces are the only delimiters, and they're to appear at the end of each match, the delimiter pattern would be (?<=\s), in A. However, there are some cases not covered in the problem description. For example, should words separated by only punctuation (e.g. "x.y") be split? Which side of a split point should quotation marks and hyphens appear, if any? Should they count as punctuation? Another option for the delimiter is to match (after) all non-word characters, in which case A would be (<?=\W).
Since the split-point is at a word boundary, B could be \b.
Since the start of a match is a word character, (?=\w) will suffice for C.
Any two of those three should suffice. One that is perhaps clearest in meaning (and splits at the most points) is /(<?=\W)(?=\w)/, which can be translated as "split at the start of each word". \b could be added, if you find it more understandable, though it has no functional affect: /(<?=\W)\b(?=\w)/.
Note Oriol's excellent solutions are given by B=\b and (C=(?!\s) or C=(?![\s.])).
Additional
As a point of interest, there would be a simpler solution for this particular case if JS regexes supported TCL word boundaries: \m matches only at the start of a word, so str.split(/\m/) would split exactly at the start of each word. (\m is equivalent to (<?=\W)(?=\w).)
If you want to include the whitespace after the word, the regex \S+\s* should work.
const s = `This is a sentence.
This is another sentence.`;
console.log(s.match(/\S+\s*/g))
I'm trying to return a count of all words NOT between square brackets. So given ..
[don't match these words] but do match these
I get a count of 4 for the last four words.
This works in .net:
\b(?<!\[)[\w']+(?!\])\b
but it won't work in Javascript because it doesn't support lookbehind
Any ideas for a pure js regex solution?
Ok, I think this should work:
\[[^\]]+\](?:^|\s)([\w']+)(?!\])\b|(?:^|\s)([\w']+)(?!\])\b
You can test it here:
http://regexpal.com/
If you need an alternative with text in square brackets coming after the main text, it could be added as a second alternative and the current second one would become third.
It's a bit complicated but I can't think of a better solution right now.
If you need to do something with the actual matches you will find them in the capturing groups.
UPDATE:
Explanation:
So, we've got two options here:
\[[^\]]+\](?:^|\s)([\w']+)(?!\])\b
This is saying:
\[[^\]]+\] - match everything in square brackets (don't capture)
(?:^|\s) - followed by line start or a space - when I look at it now take the caret out as it doesn't make sense so this will become just \s
([\w']+) - match all following word characters as long as (?!\])the next character is not the closing bracket - well this is probably also unnecessary now, so let's try and remove the lookahead
\b - and match word boundary
2 (?:^|\s)([\w']+)(?!\])\b
If you cannot find the option 1 - do just the word matching, without looking for square brackets as we ensured with the first part that they are not here.
Ok, so I removed all the things that we don't need (they stayed there because I tried quite a few options before it worked:-) and the revised regex is the one below:
\[[^\]]+\]\s([\w']+)(?!\])\b|(?:^|\s)([\w']+)\b
I would use something like \[[^\]]*\] to remove the words between square brackets, and then explode by spaces the returned string to count the remaining words.
Chris, resurrecting this question because it had a simple solution that wasn't mentioned. (Found your question while doing some research for a general question about how to exclude patterns in regex.)
Here's our simple regex (see it at work on regex101, looking at the Group captures in the bottom right panel):
\[[^\]]*\]|(\b\w+\b)
The left side of the alternation matches complete [bracketed groups]. We will ignore these matches. The right side matches and captures words to Group 1, and we know they are the right words because they were not matched by the expression on the left.
This program shows how to use the regex (see the count result in the online demo):
<script>
var subject = '[match ye not these words] but do match these';
var regex = /\[[^\]]*\]|(\b\w+\b)/g;
var group1Caps = [];
var match = regex.exec(subject);
// put Group 1 captures in an array
while (match != null) {
if( match[1] != null ) group1Caps.push(match[1]);
match = regex.exec(subject);
}
document.write("<br>*** Number of Matches ***<br>");
document.write(group1Caps.length);
</script>
Reference
How to match (or replace) a pattern except in situations s1, s2, s3...
I'm parsing text that is many repetitions of a simple pattern. The text is in the format of a script for a play, like this:
SAMPSON
I mean, an we be in choler, we'll draw.
GREGORY
Ay, while you live, draw your neck out o' the collar.
I'm currently using the pattern ([A-Z0-9\s]+)\s*\:?\s*[\r\n](.+)[\r\n]{2}, which works fine (explanation below) except for when the character's speech has line breaks in it. When that happens, the character's name is captured successfully but only the first line of the speech is captured.
Turning on Single-line mode (to include line breaks in .) just creates one giant match.
How can I tell the (.+) to stop when it finds the next character name and end the match?
I'm iterating over each match individually (JavaScript), so the name must be available to the next match.
Ideally, I would be able to match all characters until the entire pattern is repeated.
Pattern explained:
The first group matches a character's name (allowing capital letters, numbers, and whitespace), (with a trailing colon and whitespace optional).
The second group (character's speech) begins on a new line and captures any characters (except, problematically, line breaks and characters after them).
The pattern ends (and starts over) after a blank line.
Consider going a different direction with this. You really want to split a larger dialogue on any line that contains a name. You can do this with a regular expression still (replace the regex with whatever will match the "speaker" line):
results = "Insert script here".split(/^([A-Z]+)$/)
On a standards compliant implementation, you example text will end up in an array like so:
results[0] = ""
results[1] = "SAMPSON"
results[2] = "I mean, an we be in choler, we'll draw.
"
results[3] = "GREGORY"
results[4] = "Ay, while you live, draw your neck out o' the collar. "
A caveat is that most browsers are spotty on the standard here. You can use the library XRegExp to get cross platform behaviour.
Okay, I did a little tinkering and found something that works. It isn't super elegant, but it does the job.
([A-Z0-9\s]+)\s*\:?\s*[\r\n]((.+[\r\n]?.*)+)[\r\n]{2}
I modified the last capture group to allow endless repetitions of arbitrary text, a new line, and more arbitrary text. Since two line breaks in a row aren't allowed, the pattern ends after the speech.
I finally managed to get it to match only what you wanted, i.e.
- the name of the character, allowing for whitespaces and the colon
- and, optionally multiline with linebreaks, the text associated with the person
You would need to do findAll using this regex - it is case sensitive:
((?:[A-Z]{2,}\s*:?\s*)+)\s+((?![A-Z]{2,}\s*:?\s*).+?[.?!]\s*)+
Explanation:
((?:[A-Z]{2,}\s*:?\s*)+) - the first group captures the upper case name of the person - it will match 'GREGOR' as well as 'MANFRED THE GREATEST:'
\s+ - at least one whitespace character
Then repeat at least once:
(?![A-Z]{2,}\s*:?\s*) - look ahead to check that the next text is not the upper case character name
.+?[.?!]\s* - match everything until you find a character that ends a sentence [.?!] and optionally whitespaces
How do you split a long piece of text into separate lines? Why does this return line1 twice?
/^(.*?)$/mg.exec('line1\r\nline2\r\n');
["line1", "line1"]
I turned on the multi-line modifier to make ^ and $ match beginning and end of lines. I also turned on the global modifier to capture all lines.
I wish to use a regex split and not String.split because I'll be dealing with both Linux \n and Windows \r\n line endings.
arrayOfLines = lineString.match(/[^\r\n]+/g);
As Tim said, it is both the entire match and capture. It appears regex.exec(string) returns on finding the first match regardless of global modifier, wheras string.match(regex) is honouring global.
Use
result = subject.split(/\r?\n/);
Your regex returns line1 twice because line1 is both the entire match and the contents of the first capturing group.
I am assuming following constitute newlines
\r followed by \n
\n followed by \r
\n present alone
\r present alone
Please Use
var re=/\r\n|\n\r|\n|\r/g;
arrayofLines=lineString.replace(re,"\n").split("\n");
for an array of all Lines including the empty ones.
OR
Please Use
arrayOfLines = lineString.match(/[^\r\n]+/g);
For an array of non empty Lines
Even simpler regex that handles all line ending combinations, even mixed in the same file, and removes empty lines as well:
var lines = text.split(/[\r\n]+/g);
With whitespace trimming:
var lines = text.trim().split(/\s*[\r\n]+\s*/g);
Unicode Compliant Line Splitting
Unicode® Technical Standard #18 defines what constitutes line boundaries. That same section also gives a regular expression to match all line boundaries. Using that regex, we can define the following JS function that splits a given string at any line boundary (preserving empty lines as well as leading and trailing whitespace):
const splitLines = s => s.split(/\r\n|(?!\r\n)[\n-\r\x85\u2028\u2029]/)
I don't understand why the negative look-ahead part ((?!\r\n)) is necessary, but that is what is suggested in the Unicode document 🤷♂️.
The above document recommends to define a regular expression meta-character for matching all line ending characters and sequences. Perl has \R for that. Unfortunately, JavaScript does not include such a meta-character. Alas, I could not even find a TC39 proposal for that.
First replace all \r\n with \n, then String.split.
http://jsfiddle.net/uq55en5o/
var lines = text.match(/^.*((\r\n|\n|\r)|$)/gm);
I have done something like this. Above link is my fiddle.