I want to remove new lines from text, except when the sentence ends in a double space (I'm using JavaScript).
This:
This
is
a
test.
Should turn to this:
This is a test.
But this:
This
is //there is a double space here
a//but not here
test.
Should turn to this:
This is
a test.
My approach so far: I can replace multiple spaces followed by a new line with a single new line:
var doubleSpaceNewline = new RegExp(/(\s){2,}\n/g);
test = text.replace(doubleSpaceNewline, '\n');
But then how do I remove the newlines, without removing the one I want to remain?
I would prefer to remove all new lines except newlines preceded by double or more spaces, THEN replace double space + newline with single new line.
I need a regex that will match \s+ except when (\s){2,}\n. Can't seem to be able to combine both.
text = text.replace(" \n", '****************');
text = text.replace("\n", ' ');
text = text.replace('****************', " \n");
Is this what you're after? Doesn't use regex, but its a bit simpler of a procedure.
To find "one new line if not preceeded but 2 or more (judging by{2,} in your code) spaces" with the help of regular expressions, you can use negative lookbehind. Code for finding it is
(?<!\s{2,})\n
and then replace as usual.
Related
I am currently trying to parse out comments in a particular format using JavaScript. While I have a basic understanding of regular expressions, with this one I seem to have reached my current limit. This is what I am trying to do:
The comments
//
This is a
multiline comment
Code here
//
This is another
comment
Again, code here
For the Regex, it currently looks this like this:
\/\/\n(\s+[\s\S]+)
\/\/\n matches the //sequence including the new line.
Since I am interested in the comments, I am opening a capture group.
\s+ matches the indentation. I could probably be a bit more precise by only accepting tabs or spaces in a particular count – for me this is not relevant
[\s\S] is supposed to match the actual words and and spaces between the words.
This seems to currently match the whole file, which is not what I want. What I now can't wrap my head around is how to solve this?
I think my problem is related to me not knowing how to think about regexes. Is it like a program that matches line per line, so I need to work more on the quantifiers? Or is there maybe a way to stop at lines only consisting of a newline? When I try to match for the newline character, I of course receive matches at each line ending, which is not helpful.
You may use
/^\/\/((?:\r?\n[^\S\r\n].*)*)/gm
See the regex demo
Details
^ - start of a line (due to m modifier, ^ also matches line start positions)
\/\/ - a // string
((?:\r?\n[^\S\r\n].*)*) - Capturing group 1: zero or more repetitions of
\r?\n - a CRLF or LF line ending
[^\S\r\n] - any whitespace but CR and LF
.* - the rest of the line.
JS demo:
var text = "//\n This is a\n multiline comment\n\nCode here\n\n\n//\n This is another\n comment\n\nAgain, code here";
var regex = /^\/\/((?:\r?\n[^\S\r\n].*)*)/gm, m, results=[];
while (m = regex.exec(text)) {
results.push(m[1].trim());
}
console.log(results);
So I'm trying to parse a string similar to the way StackOverflow's tags work. So letters and numbers are allowed, but everything else should be stripped. Also spaces should be replaced with hyphens, but only if they are inside the word and not have disallowed characters before them.
This is what I have right now:
label = label.trim();
label = label.toLowerCase();
label = label.replace(/[^A-Za-z0-9\s]/g,'');
label = label.replace(/ /g, '-');
This works but with a few caveats, for example this:
/ this. is-a %&&66 test tag . <-- (4 spaces here, the arrow and this text is not part of the test string)
Becomes:
-this-is-a66-test-tag----
Expected:
this-is-a66-test-tag
I looked at this to get what I have now:
How to remove everything but letters, numbers, space, exclamation and question mark from string?
But like I said it doesn't fully give me what I'm looking for.
How do I tweak my code to give me what I want?
You need to make 2 changes:
Since you do not replace all whitespace with the first replace you need to replace all whitespace chars with the second regex (so, a plain space must be replaced with \s, and even better, with \s+ to replace multiple consecutive occurrences),
To get rid of leading/trailing hyphens in the end, use trim() after the first replace.
So, the actual fix will look like
var label = " / this. is-a %&&66 test tag . ";
label = label.replace(/[^a-z0-9\s-]/ig,'')
.trim()
.replace(/\s+/g, '-')
.toLowerCase();
console.log(label); // => this-isa-66-test-tag
Note that if you add - to the first regex, /[^a-z0-9\s-]/ig, you will also keep the original hyphens in the output and it will look like this-is-a-66-test-tag for the current test case.
Use trim just before changing all spaces with hyphens.
You can use this function:
function tagit(label) {
label = label.toLowerCase().replace(/[^A-Za-z0-9\s]/g,'');
return label.trim().replace(/ /g, '-'); }
var str = 'this. is-a %&&66 test tag .'
console.log(tagit(str));
//=> "this-isa-66-test-tag"
Using this bit of code trims out hidden characters like carriage returns and linefeeds with nothing using javascript just fine:
value = value.replace(/[\r\n]*/g, "");
but when the code actually contains \r\n text what do I do to trim it without affecting r's and n's in my content? I've tried this code:
value = value.replace(/[\\r\\n]+/g, "");
on this bit of text:
{"client":{"werdfasreasfsd":"asdfRasdfas\r\nMCwwDQYJKoZIhvcNAQEBBQADGw......
I end up with this:
{"cliet":{"wedfaseasfsd":"asdfRasdfasMCwwDQYJKoZIhvcNAQEBBQADGw......
Side note: It leaves the upper case versions of R and N alone because I didn't include the /i flag at the end and thats ok in this case.
What do I do to just remove \r\n text found in the string?
If you want to match literal \r and literal \n then you should use the following:
value = value.replace(/(?:\\[rn])+/g, "");
You might think that matching literal \r and \n with [\\r\\n] is the right way to do it and it is a bit confusing but it won't work and here is why:
Remember that in character classes, each single character represents a single letter or symbol, it doesn't represent a sequence of characters, it is just a set of characters.
So the character class [\\r\\n] actually matches the literal characters \, r and n as separate letters and not as sequences.
Edit: If you want to replace all carriage returns \r, newlines \n and also literal \r and '\n` then you could use:
value = value.replace(/(?:\\[rn]|[\r\n]+)+/g, "");
About (?:) it means a non-capturing group, because by default when you put something into a usual group () then it gets captured into a numbered variable that you can use elsewhere inside the regular expression itself, or latter in the matches array.
(?:) prevents capturing the value and causes less overhead than (), for more info see this article.
To just remove them, this seems to work for me:
value = value.replace(/[\r\n]/g, "");
You don't need the * after the character set because the g flag solves that for you.
Note, this will remove all \r or \n chars whether they are in this exact sequence or not.
Working demo of this option: http://jsfiddle.net/jfriend00/57GtJ/
If you want to remove these characters only when in this exact sequence (e.g. only when a \r is directly followed by a \n, you could use this:
value = value.replace(/\r\n/g, "");
Working demo of this option: http://jsfiddle.net/jfriend00/Ta3sn/
If you have text with a lot of \r\n and want to save all of them try this one
value.replace(/(?:\\[rn]|[\r\n])/g,"<br>")
http://jsfiddle.net/57GtJ/63/
I want to remove space in the beggining of each line.
I have data in each line with a set of spaces in the beginning so data appears in the middle, I want to remove spaces in the beginning of each line.
tmp = tmp.replace(/(<([^>]+)>)/g,"")
How can I add the ^\s condition into that replace()?
To remove all leading spaces:
str = str.replace(/^ +/gm, '');
The regex is quite simple - one or more spaces at the start. The more interesting bits are the flags - /g (global) to replace all matches and not just the first, and /m (multiline) so that the caret matches the beginning of each line, and not just the beginning of the string.
Working example: http://jsbin.com/oyeci4
var text = " this is a string \n"+
" \t with a much of new lines \n";
text.replace(/^\s*/gm, '');
this supports multiple spaces of different types including tabs.
If all you need is to remove one space, then this regex is all you need:
^\s
So in JavaScript:
yourString.replace(/(?<=\n) /gm,"");
How do you split a long piece of text into separate lines? Why does this return line1 twice?
/^(.*?)$/mg.exec('line1\r\nline2\r\n');
["line1", "line1"]
I turned on the multi-line modifier to make ^ and $ match beginning and end of lines. I also turned on the global modifier to capture all lines.
I wish to use a regex split and not String.split because I'll be dealing with both Linux \n and Windows \r\n line endings.
arrayOfLines = lineString.match(/[^\r\n]+/g);
As Tim said, it is both the entire match and capture. It appears regex.exec(string) returns on finding the first match regardless of global modifier, wheras string.match(regex) is honouring global.
Use
result = subject.split(/\r?\n/);
Your regex returns line1 twice because line1 is both the entire match and the contents of the first capturing group.
I am assuming following constitute newlines
\r followed by \n
\n followed by \r
\n present alone
\r present alone
Please Use
var re=/\r\n|\n\r|\n|\r/g;
arrayofLines=lineString.replace(re,"\n").split("\n");
for an array of all Lines including the empty ones.
OR
Please Use
arrayOfLines = lineString.match(/[^\r\n]+/g);
For an array of non empty Lines
Even simpler regex that handles all line ending combinations, even mixed in the same file, and removes empty lines as well:
var lines = text.split(/[\r\n]+/g);
With whitespace trimming:
var lines = text.trim().split(/\s*[\r\n]+\s*/g);
Unicode Compliant Line Splitting
Unicode® Technical Standard #18 defines what constitutes line boundaries. That same section also gives a regular expression to match all line boundaries. Using that regex, we can define the following JS function that splits a given string at any line boundary (preserving empty lines as well as leading and trailing whitespace):
const splitLines = s => s.split(/\r\n|(?!\r\n)[\n-\r\x85\u2028\u2029]/)
I don't understand why the negative look-ahead part ((?!\r\n)) is necessary, but that is what is suggested in the Unicode document 🤷♂️.
The above document recommends to define a regular expression meta-character for matching all line ending characters and sequences. Perl has \R for that. Unfortunately, JavaScript does not include such a meta-character. Alas, I could not even find a TC39 proposal for that.
First replace all \r\n with \n, then String.split.
http://jsfiddle.net/uq55en5o/
var lines = text.match(/^.*((\r\n|\n|\r)|$)/gm);
I have done something like this. Above link is my fiddle.