JS regex to split by line - javascript

How do you split a long piece of text into separate lines? Why does this return line1 twice?
/^(.*?)$/mg.exec('line1\r\nline2\r\n');
["line1", "line1"]
I turned on the multi-line modifier to make ^ and $ match beginning and end of lines. I also turned on the global modifier to capture all lines.
I wish to use a regex split and not String.split because I'll be dealing with both Linux \n and Windows \r\n line endings.

arrayOfLines = lineString.match(/[^\r\n]+/g);
As Tim said, it is both the entire match and capture. It appears regex.exec(string) returns on finding the first match regardless of global modifier, wheras string.match(regex) is honouring global.

Use
result = subject.split(/\r?\n/);
Your regex returns line1 twice because line1 is both the entire match and the contents of the first capturing group.

I am assuming following constitute newlines
\r followed by \n
\n followed by \r
\n present alone
\r present alone
Please Use
var re=/\r\n|\n\r|\n|\r/g;
arrayofLines=lineString.replace(re,"\n").split("\n");
for an array of all Lines including the empty ones.
OR
Please Use
arrayOfLines = lineString.match(/[^\r\n]+/g);
For an array of non empty Lines

Even simpler regex that handles all line ending combinations, even mixed in the same file, and removes empty lines as well:
var lines = text.split(/[\r\n]+/g);
With whitespace trimming:
var lines = text.trim().split(/\s*[\r\n]+\s*/g);

Unicode Compliant Line Splitting
Unicode® Technical Standard #18 defines what constitutes line boundaries. That same section also gives a regular expression to match all line boundaries. Using that regex, we can define the following JS function that splits a given string at any line boundary (preserving empty lines as well as leading and trailing whitespace):
const splitLines = s => s.split(/\r\n|(?!\r\n)[\n-\r\x85\u2028\u2029]/)
I don't understand why the negative look-ahead part ((?!\r\n)) is necessary, but that is what is suggested in the Unicode document 🤷‍♂️.
The above document recommends to define a regular expression meta-character for matching all line ending characters and sequences. Perl has \R for that. Unfortunately, JavaScript does not include such a meta-character. Alas, I could not even find a TC39 proposal for that.

First replace all \r\n with \n, then String.split.

http://jsfiddle.net/uq55en5o/
var lines = text.match(/^.*((\r\n|\n|\r)|$)/gm);
I have done something like this. Above link is my fiddle.

Related

Javascript RegEx: Matching indented comments

I am currently trying to parse out comments in a particular format using JavaScript. While I have a basic understanding of regular expressions, with this one I seem to have reached my current limit. This is what I am trying to do:
The comments
//
This is a
multiline comment
Code here
//
This is another
comment
Again, code here
For the Regex, it currently looks this like this:
\/\/\n(\s+[\s\S]+)
\/\/\n matches the //sequence including the new line.
Since I am interested in the comments, I am opening a capture group.
\s+ matches the indentation. I could probably be a bit more precise by only accepting tabs or spaces in a particular count – for me this is not relevant
[\s\S] is supposed to match the actual words and and spaces between the words.
This seems to currently match the whole file, which is not what I want. What I now can't wrap my head around is how to solve this?
I think my problem is related to me not knowing how to think about regexes. Is it like a program that matches line per line, so I need to work more on the quantifiers? Or is there maybe a way to stop at lines only consisting of a newline? When I try to match for the newline character, I of course receive matches at each line ending, which is not helpful.
You may use
/^\/\/((?:\r?\n[^\S\r\n].*)*)/gm
See the regex demo
Details
^ - start of a line (due to m modifier, ^ also matches line start positions)
\/\/ - a // string
((?:\r?\n[^\S\r\n].*)*) - Capturing group 1: zero or more repetitions of
\r?\n - a CRLF or LF line ending
[^\S\r\n] - any whitespace but CR and LF
.* - the rest of the line.
JS demo:
var text = "//\n This is a\n multiline comment\n\nCode here\n\n\n//\n This is another\n comment\n\nAgain, code here";
var regex = /^\/\/((?:\r?\n[^\S\r\n].*)*)/gm, m, results=[];
while (m = regex.exec(text)) {
results.push(m[1].trim());
}
console.log(results);

How to capture the line delimiting character(s) in a platform independent way?

The following expression
/^\S+\s*$/m.exec("a\nb\n")[0]
returns just "a" but not the line delimiter, although \s should match \n.
By experimenting I found out that the following expression does somehow what I want:
/^\S+\s*$\n\r?/m.exec("a\nb\n")[0]
But now the regular expression is platform dependent.
How to include the line delimiting character(s) into the match in a platform independent way?
Sorry but I got it myself.
Instead of matching the end of line it seems to be better to match the beginning of the next line. This:
/^\S+\s*^/m.exec("a\nb\n")[0]
returns "a\n".

Can't figure out what this JS code means

I've been romping through a piece of JS I came across online and can't figure out what this piece of code means.
global$string$newLines = function(a) {
return a.replace(/(\r\n|\r|\n)/g, "\n");
},
I'm specifically wondering about the piece /(\r\n|\r|\n)/g
Also - Is this machine generated code? Is that why the variable name is full of $s?
They are regular expresions
\r = Find a carriage return character
\n = Find a new line character
the /g (g only) mean to find all
http://www.w3schools.com/jsref/jsref_obj_regexp.asp
So the code mean to find all \r\n or just \r or just \n and replace it with \n
They are whitespace characters so they not visible.
It's a regular expression for replacing newline characters.
There are different types of new line characters inserted by various browsers/editors/OSes etc.
\n is the default on all (true) Unix systems with \r having no meaning, C, Java, C++, etc, adopted this convention.
\r is from the days of Mac before it was a Unix system, while the duplicate \r\n is the Windows way.
The /g flag represents a global setting telling the regular expression to search the entire document.
So what the code is doing is using a regular expression to globally find all possible equivalents of a newLine, and replacing them with the defacto standard, '\n'
This is just a regular expression used to replace Carriage Returns and New Line characters with new line characters.
Your Regex: /(\r\n|\r|\n)/g
Explanation:
1st Capturing group (\r\n|\r|\n)
1st Alternative: \r\n
\r matches a carriage return (ASCII 13)
\n matches a line-feed (newline) character (ASCII 10)
2nd Alternative: \r
\r matches a carriage return (ASCII 13)
3rd Alternative: \n
\n matches a line-feed (newline) character (ASCII 10)
g modifier: global. Give All matches (i.e Don't return on first match).
PS: Check out regex101.com for generating such beautiful explanation for any Regex.
The code replaces carriage-return/new-line combinations with a single newline.
The $'s in the variable name is done by several javascript compilers out there. Developers will often break their code up into namespaces of the form global.string.newline, for example. But when we want to run that code on a client, it's safer and more efficient to turn this object-within-an-object-within-an-object into a single variable. Usually, the javascript compiler will go one step further and then turn this long variable name into some short unique sequence, but it will also preserve this intermediate form for easier debugging.
It is a regex to remove the carriage return/new line/carriage return + new line with new line from a string.
/(\r\n|\r|\n)/g
the /g in the end signifies globally, hence throughout the string and not just the first occurence.
Working Fiddle
JS Code:
global$string$newLines = function (a) {
return a.replace(/(\r\n|\r|\n)/g, "\n")
}
function abc() {
var text = document.getElementById("test").value;
console.log(global$string$newLines(text));
}
HTML Code:
<textarea id="test"></textarea>
<button id="testClick" onclick="abc()">Click</button>
This is a regular expression replace that means:
Find any occurence of either:
\r\n
\r
\n
And replace it with \n.
Comments:
The /g means it will mach all findings, not only the first occurence.
The third option to replace \n by \n is nonsense as it has no effect.
Doc of the replace and link to regex: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/replace

javascript replace() not replacing text containing literal \r\n strings

Using this bit of code trims out hidden characters like carriage returns and linefeeds with nothing using javascript just fine:
value = value.replace(/[\r\n]*/g, "");
but when the code actually contains \r\n text what do I do to trim it without affecting r's and n's in my content? I've tried this code:
value = value.replace(/[\\r\\n]+/g, "");
on this bit of text:
{"client":{"werdfasreasfsd":"asdfRasdfas\r\nMCwwDQYJKoZIhvcNAQEBBQADGw......
I end up with this:
{"cliet":{"wedfaseasfsd":"asdfRasdfasMCwwDQYJKoZIhvcNAQEBBQADGw......
Side note: It leaves the upper case versions of R and N alone because I didn't include the /i flag at the end and thats ok in this case.
What do I do to just remove \r\n text found in the string?
If you want to match literal \r and literal \n then you should use the following:
value = value.replace(/(?:\\[rn])+/g, "");
You might think that matching literal \r and \n with [\\r\\n] is the right way to do it and it is a bit confusing but it won't work and here is why:
Remember that in character classes, each single character represents a single letter or symbol, it doesn't represent a sequence of characters, it is just a set of characters.
So the character class [\\r\\n] actually matches the literal characters \, r and n as separate letters and not as sequences.
Edit: If you want to replace all carriage returns \r, newlines \n and also literal \r and '\n` then you could use:
value = value.replace(/(?:\\[rn]|[\r\n]+)+/g, "");
About (?:) it means a non-capturing group, because by default when you put something into a usual group () then it gets captured into a numbered variable that you can use elsewhere inside the regular expression itself, or latter in the matches array.
(?:) prevents capturing the value and causes less overhead than (), for more info see this article.
To just remove them, this seems to work for me:
value = value.replace(/[\r\n]/g, "");
You don't need the * after the character set because the g flag solves that for you.
Note, this will remove all \r or \n chars whether they are in this exact sequence or not.
Working demo of this option: http://jsfiddle.net/jfriend00/57GtJ/
If you want to remove these characters only when in this exact sequence (e.g. only when a \r is directly followed by a \n, you could use this:
value = value.replace(/\r\n/g, "");
Working demo of this option: http://jsfiddle.net/jfriend00/Ta3sn/
If you have text with a lot of \r\n and want to save all of them try this one
value.replace(/(?:\\[rn]|[\r\n])/g,"<br>")
http://jsfiddle.net/57GtJ/63/

Why is this regular expression so slow?

I am trying to trim leading and trailing whitespace and newlines from a string. The newlines are written as \n (two separate characters, slash and n). In other words, it is a string literal, not a CR LF special character.
For example, this:
\n \nRight after this is a perfectly valid newline:\nAnd here is the second line. \n
Should become this:
Right after this is a perfectly valid newline:\nAnd here is the second line.
I came up with this solution:
text = text
.replace(/^(\s*(\\n)*)*/, '') // Beginning
.replace(/(\s*(\\n)*)*$/, '') // End
These patterns match just fine according to RegexPal.
However, the second pattern (matching the end of the string) takes a very long time — about 32 seconds in Chrome on a string with only a couple of paragraphs and a few trailing spaces. The first pattern is quite fast (milliseconds) on the same string.
Here is a CodePen to demonstrate it.
Why is it so slow? Is there a better way to go about this?
The reason it takes so long is because you have a * quantifying two more *
A good explanation can be found in the PHP manual, but I don't think JavaScript supports once-only subpatterns.
I would suggest this regex instead:
text = text.replace(/^(?:\s|\\n)+|(?:\s|\\n)+$/g,"");
Not a good answer but one workaround would be to reverse the string and also reverse \n to n\ in the regular expression (for beginning), apply it, then reverse the string back.

Categories