Javascript RegEx to find first line of each paragraph

Javascript RegEx to find first line of each paragraph - javascript

This is probably a simple one but I'm very much a regex novice.
I'm looking to select the first line of every paragraph within a textarea on a page using a regular expression. After thinking I was there I have hit a problem.
Using http://gskinner.com/RegExr/ I came up with this:
/\r\r.*\r/g
but then I place that into my javascript and ran it on the page:
var headingsArr = document.getElementById("text").value.match(/\r\r.*\r/g);
and the array returns null.
Have I got the regular expression right and if so where am I going wrong when using it in my javascript!?
Thank you

This depends on what your newline characters are. I think you may better go for
/(?:\r\n|[\r\n]){2}.*(?:\r\n|[\r\n])/g
I know in Regexr only a \r is a newline. But in Windows normally \r\n is used, but under .*ix its normally only the \n.
So (?:\r\n|[\r\n]) is an alternation, it tries at first to match \r\n if this is not found it matches either \r or \n.

For the sake of future searchers, if you just need to style the first line of text, you can use the css pseudoclass ::first-line :
textarea::first-line {
background-color: yellow;
}
http://www.w3schools.com/cssref/sel_firstline.asp

Related

How to append string after matching field with regex

I want to append a word after <body> tag, it should not modify/replace anything other than just append a word. I have done something like this, is it valid do empty parenthesis fir second capture group will match everything?
/(<body[^>]*>)()/, `$1${my_variable}$2`)

The second capture group, designed to capture nothing, will match "nothing" - it will form a match immediately after your closed body tag. There's nothing wrong with doing this for the regex, though you might want to be wary of using [^>]* - this negated character class will gladly match across lines and grab as much input as it can. Handy for matching multi-line tags, but often very dangerous.
Also, if you're on linux and for some reason have > symbols in filenames (which is valid!) your regex will break horribly, as shown here.
That being said, valid regex or not, it's usually a bad idea to use regex with html, since HTML isn't a regular language. Also, you could accidentally summon Cthulhu.

let page = "<html><body>Some info</body></html>";
page.replace("<body>", `<body>${my_variable}`);
or
page.replace(/<body>|<BODY>/, `<body>${my_variable}`);
If in the broweser you can also use document.querySelector("body").innerHTML
Also depending on which framework you're using there are better ways to accomplish this.

javascript regex replace multiline strings [duplicate]

var ss= "<pre>aaaa\nbbb\nccc</pre>ddd";
var arr= ss.match( /<pre.*?<\/pre>/gm );
alert(arr); // null
I'd want the PRE block be picked up, even though it spans over newline characters. I thought the 'm' flag does it. Does not.
Found the answer here before posting. SInce I thought I knew JavaScript (read three books, worked hours) and there wasn't an existing solution at SO, I'll dare to post anyways. throw stones here
So the solution is:
var ss= "<pre>aaaa\nbbb\nccc</pre>ddd";
var arr= ss.match( /<pre[\s\S]*?<\/pre>/gm );
alert(arr); // <pre>...</pre> :)
Does anyone have a less cryptic way?
Edit: this is a duplicate but since it's harder to find than mine, I don't remove.
It proposes [^] as a "multiline dot". What I still don't understand is why [.\n] does not work. Guess this is one of the sad parts of JavaScript..

DON'T use (.|[\r\n]) instead of . for multiline matching.
DO use [\s\S] instead of . for multiline matching
Also, avoid greediness where not needed by using *? or +? quantifier instead of * or +. This can have a huge performance impact.
See the benchmark I have made: https://jsben.ch/R4Hxu
Using [^]: fastest
Using [\s\S]: 0.83% slower
Using (.|\r|\n): 96% slower
Using (.|[\r\n]): 96% slower
NB: You can also use [^] but it is deprecated in the below comment.

[.\n] does not work because . has no special meaning inside of [], it just means a literal .. (.|\n) would be a way to specify "any character, including a newline". If you want to match all newlines, you would need to add \r as well to include Windows and classic Mac OS style line endings: (.|[\r\n]).
That turns out to be somewhat cumbersome, as well as slow, (see KrisWebDev's answer for details), so a better approach would be to match all whitespace characters and all non-whitespace characters, with [\s\S], which will match everything, and is faster and simpler.
In general, you shouldn't try to use a regexp to match the actual HTML tags. See, for instance, these questions for more information on why.
Instead, try actually searching the DOM for the tag you need (using jQuery makes this easier, but you can always do document.getElementsByTagName("pre") with the standard DOM), and then search the text content of those results with a regexp if you need to match against the contents.

You do not specify your environment and version of JavaScript (ECMAScript), and I realise this post was from 2009, but just for completeness:
With the release of ECMA2018 we can now use the s flag to cause . to match \n (see https://stackoverflow.com/a/36006948/141801).
Thus:
let s = 'I am a string\nover several\nlines.';
console.log('String: "' + s + '".');
let r = /string.*several.*lines/s; // Note 's' modifier
console.log('Match? ' + r.test(s)); // 'test' returns true
This is a recent addition and will not work in many current environments, for example Node v8.7.0 does not seem to recognise it, but it works in Chromium, and I'm using it in a Typescript test I'm writing and presumably it will become more mainstream as time goes by.

Now there's the s (single line) modifier, that lets the dot matches new lines as well :)
\s will also match new lines :D
Just add the s behind the slash
/<pre>.*?<\/pre>/gms

[.\n] doesn't work, because dot in [] (by regex definition; not javascript only) means the dot-character. You can use (.|\n) (or (.|[\n\r])) instead.

I have tested it (Chrome) and it's working for me (both [^] and [^\0]), by changing the dot (.) with either [^\0] or [^] , because dot doesn't match line break (See here: http://www.regular-expressions.info/dot.html).
var ss= "<pre>aaaa\nbbb\nccc</pre>ddd";
var arr= ss.match( /<pre[^\0]*?<\/pre>/gm );
alert(arr); //Working

In addition to above-said examples, it is an alternate.
^[\\w\\s]*$
Where \w is for words and \s is for white spaces

[\\w\\s]*
This one was beyond helpful for me, especially for matching multiple things that include new lines, every single other answer ended up just grouping all of the matches together.

Greedy Regex for varying number of new line [duplicate]

var ss= "<pre>aaaa\nbbb\nccc</pre>ddd";
var arr= ss.match( /<pre.*?<\/pre>/gm );
alert(arr); // null
I'd want the PRE block be picked up, even though it spans over newline characters. I thought the 'm' flag does it. Does not.
Found the answer here before posting. SInce I thought I knew JavaScript (read three books, worked hours) and there wasn't an existing solution at SO, I'll dare to post anyways. throw stones here
So the solution is:
var ss= "<pre>aaaa\nbbb\nccc</pre>ddd";
var arr= ss.match( /<pre[\s\S]*?<\/pre>/gm );
alert(arr); // <pre>...</pre> :)
Does anyone have a less cryptic way?
Edit: this is a duplicate but since it's harder to find than mine, I don't remove.
It proposes [^] as a "multiline dot". What I still don't understand is why [.\n] does not work. Guess this is one of the sad parts of JavaScript..

DON'T use (.|[\r\n]) instead of . for multiline matching.
DO use [\s\S] instead of . for multiline matching
Also, avoid greediness where not needed by using *? or +? quantifier instead of * or +. This can have a huge performance impact.
See the benchmark I have made: https://jsben.ch/R4Hxu
Using [^]: fastest
Using [\s\S]: 0.83% slower
Using (.|\r|\n): 96% slower
Using (.|[\r\n]): 96% slower
NB: You can also use [^] but it is deprecated in the below comment.

[.\n] does not work because . has no special meaning inside of [], it just means a literal .. (.|\n) would be a way to specify "any character, including a newline". If you want to match all newlines, you would need to add \r as well to include Windows and classic Mac OS style line endings: (.|[\r\n]).
That turns out to be somewhat cumbersome, as well as slow, (see KrisWebDev's answer for details), so a better approach would be to match all whitespace characters and all non-whitespace characters, with [\s\S], which will match everything, and is faster and simpler.
In general, you shouldn't try to use a regexp to match the actual HTML tags. See, for instance, these questions for more information on why.
Instead, try actually searching the DOM for the tag you need (using jQuery makes this easier, but you can always do document.getElementsByTagName("pre") with the standard DOM), and then search the text content of those results with a regexp if you need to match against the contents.

You do not specify your environment and version of JavaScript (ECMAScript), and I realise this post was from 2009, but just for completeness:
With the release of ECMA2018 we can now use the s flag to cause . to match \n (see https://stackoverflow.com/a/36006948/141801).
Thus:
let s = 'I am a string\nover several\nlines.';
console.log('String: "' + s + '".');
let r = /string.*several.*lines/s; // Note 's' modifier
console.log('Match? ' + r.test(s)); // 'test' returns true
This is a recent addition and will not work in many current environments, for example Node v8.7.0 does not seem to recognise it, but it works in Chromium, and I'm using it in a Typescript test I'm writing and presumably it will become more mainstream as time goes by.

Now there's the s (single line) modifier, that lets the dot matches new lines as well :)
\s will also match new lines :D
Just add the s behind the slash
/<pre>.*?<\/pre>/gms

[.\n] doesn't work, because dot in [] (by regex definition; not javascript only) means the dot-character. You can use (.|\n) (or (.|[\n\r])) instead.

I have tested it (Chrome) and it's working for me (both [^] and [^\0]), by changing the dot (.) with either [^\0] or [^] , because dot doesn't match line break (See here: http://www.regular-expressions.info/dot.html).
var ss= "<pre>aaaa\nbbb\nccc</pre>ddd";
var arr= ss.match( /<pre[^\0]*?<\/pre>/gm );
alert(arr); //Working

In addition to above-said examples, it is an alternate.
^[\\w\\s]*$
Where \w is for words and \s is for white spaces

[\\w\\s]*
This one was beyond helpful for me, especially for matching multiple things that include new lines, every single other answer ended up just grouping all of the matches together.

Allowing new line characters in javascript str.replace

This question is similar to "Allowing new line characters in javascript regex"
but the solution /m not runs with str.replace. You can test the code below at this page
<p id="demo"><i>I need to TRIM the italics here,
despite this line.</i>
</p>
<button onclick="myFunction()">Try it</button>
<script>
function myFunction()
{
var str=document.getElementById("demo").innerHTML;
var n=str.replace(/^(\s*)<i>(.+)<\/i>(\s*)$/m,"$1$2$3"); //tested also /s
alert(str)
document.getElementById("demo").innerHTML=n;
}
</script>

This answer is mostly to give you some insight into why your current approach does not work, and how you generally solve it.
The reason m doesn't help is that the other answer is wrong. This is not what m does. m simply makes the anchors match line beginnings and endings in addition to the string beginnings and endings. Some regex flavors have s for what you want to accomplish, but not ECMAScript. The simplest thing (and general solution) is to replace . (which matches everything except line breaks) with [\s\S] (which matches whitespace and non-whitespace, i.e. everything).
However, Casimir's approach is better in your case, as it avoids some other problems like greediness. Of course, as Casimir said, if there are tags in between the opening and closing <i> tags, then the approach will not work. In that case, something like <i>([\s\S]+?)</i> might be an option, but that's still not the full solution, in case you have nested i-tags or attributes in the opening tag, or capitalized I-tags and whatnot.
All in all, using regex to parse HTML is wrong! You should really use DOM manipulation. Especially, since you are using Javascript - THE language for DOM manipulation. What you should really do is traverse the DOM for all i tags in your demo element, and replace them with their inner HTML.

A way to avoid problems with newlines is to not use the dot, example:
var n=str.replace(/<i>([^<]+)<\/i>/,"$1");
I have replaced the dot by [^<] (all that is not a <, that include newlines)
the m modifier is not needed here, and you don't need to capture white characters too.
Note that my solution suppose that you don't have any < between <i> and </i>
In the other case, when you have nested tags for example, you can use this trick to avoid lazy quantifier:
var n=str.replace(/<i>((?:[^<]+|<+(?!\/i>)+)<\/i>/,"$1");

JavaScript/jQuery removing character 160 from a node's text() value - Regex

$('#customerAddress').text().replace(/\xA0/,"").replace(/\s+/," ");
Going after the value in a span (id=customerAddress) and I'd like to reduce all sections of whitespace to a single whitespace. The /\s+/ whould work except this app gets some character 160's between street address and state/zip
What is a better way to write this? this does not currently work.
UPDATE:
I have figured out that
$('.customerAddress').text().replace(/\s+/g," ");
clears the 160s and the spaces.
But how would I write a regex to just go after the 160s?
$('.customerAddress').text().replace(String.fromCharCode(160)," ");
didn't even work.
Note: I'm testing in Firefox / Firebug

Regarding just replacing char 160, you forgot to make a global regex, so you are only replacing the first match. Try this:
$('.customerAddress').text()
.replace(new RegExp(String.fromCharCode(160),"g")," ");
Or even simpler, use your Hex example in your question with the global flag
$('.customerAddress').text().replace(/\xA0/g," ");

\s does already contain the character U+00A0:
[\t\n\v\f\r \u00a0\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200a\u200b\u2028\u2029\u3000]
But you should add the g modifier to replace globally:
$('#customerAddress').text().replace(/\s+/g, " ")
Otherwise only the first match will be replaced.

Sorry if I'm being obvious (or wrong), but doesn't .text() when called w/o parameters just RETURNS the text? I mean, I don't know if you included the full code or just an excerpt, but to really replace the span you should do it like:
var t = $('#customerAddress').text().replace(/\xA0/,"").replace(/\s+/," ");
$('#customerAddress').text(t);
Other than that, the regex for collapsing the spaces seems OK, I'm just not sure about the syntax of your non-printable char there.

We Keep Coding

JavaScript is the programming language of the Web.

Javascript RegEx to find first line of each paragraph - javascript

For the sake of future searchers, if you just need to style the first line of text, you can use the css pseudoclass ::first-line : textarea::first-line { background-color: yellow; } http://www.w3schools.com/cssref/sel_firstline.asp

Related

How to append string after matching field with regex

javascript regex replace multiline strings [duplicate]

Greedy Regex for varying number of new line [duplicate]

Allowing new line characters in javascript str.replace

JavaScript/jQuery removing character 160 from a node's text() value - Regex

Categories

Resources