Regex match all except a pattern - javascript

I need a little help. I already tried to practice in several ways, but it didn't work as expected. For example, this one.
I want to match all single words except the pattern <br> in JS.
So I tried
(?!<br>)[\s\S]
(?!<|b|r|>)[\s\S]
The problem I have is, in the ?! quote, it's matching either the first word, < only, not the entire pattern <br>. In reverse, just <br> can match all <br> expect any other words. How can I let it know I want to match the entire word in the ?! quote?
Thank you so much!
Here is what I am trying.

The regular expression you are looking for might look like this:
([^>]|<(?!br>)[^>]+>)+(?=<br>|$)
It should work for any tag, try replacing br by p in the above pattern.
Regex101 link
However, It would be much easier and readable and faster to use:
content.split('<br>').filter(x => x.length)
Hope it helps.

Related

How to append string after matching field with regex

I want to append a word after <body> tag, it should not modify/replace anything other than just append a word. I have done something like this, is it valid do empty parenthesis fir second capture group will match everything?
/(<body[^>]*>)()/, `$1${my_variable}$2`)
The second capture group, designed to capture nothing, will match "nothing" - it will form a match immediately after your closed body tag. There's nothing wrong with doing this for the regex, though you might want to be wary of using [^>]* - this negated character class will gladly match across lines and grab as much input as it can. Handy for matching multi-line tags, but often very dangerous.
Also, if you're on linux and for some reason have > symbols in filenames (which is valid!) your regex will break horribly, as shown here.
That being said, valid regex or not, it's usually a bad idea to use regex with html, since HTML isn't a regular language. Also, you could accidentally summon Cthulhu.
let page = "<html><body>Some info</body></html>";
page.replace("<body>", `<body>${my_variable}`);
or
page.replace(/<body>|<BODY>/, `<body>${my_variable}`);
If in the broweser you can also use document.querySelector("body").innerHTML
Also depending on which framework you're using there are better ways to accomplish this.

Make a regex that spans over multiple lines?

I have looked at the flags and I cloudn't find what I am looking for. Basically if I am searching for:
aba
It should totally ignore the new lines, so the following things are valid:
a
b
a
a
b
a
ab
a
Edit: I am aiming at doing something a bit more elegant than putting \s? after every character in the regex (given that it is a constant if it is a range than I have no idea what so ever)
/a\s*b\s*a/
Place whitespace possibilities between each letter.
The simple case
For your example where the exact letters are aba, I would go with
a\s*b\s*a
See demo
The more intricate case
In a comment, you ask about an expression such as [a-z]{1,5}, where you presumably want to inject potential spaces between the letters. For this, I would go with
(?:[a-z]\s*){1,5}
See demo
:)) It's an interesting problem. For this situations I use another method.
First I remove all line ending chars:
someText = someText.replace(/(\r\n|\n|\r)/gm,"");
Use a normal regex

Regular Expression: exclude html tags from "content"

One friend asked me this and as my knowledge on RegExp is not so good yet here I am.
How can exclude the HTML tags from this string?
re<br>na<br>to<br>galvao
I've tried some RegExp but it didn't work as I was expecting.
(.*)<.*>(.*)
But this RegExp gets the first < and the last >.
Any ideas?
this is a quick way to do it:
var content = "re<br>na<br>to<br>galvao";
content = content.replace(/<[^>]*>/g,'');
You could use a non-greedy match. According to the answer to this question, in javascript it is *?
So, assuming this is the only problem with your regex, it should work with
(.*?)<.*?>(.*?)
Match all html tags with this regex:
<("[^"]*?"|'[^']*?'|[^'">])*>
see demo here: http://regex101.com/r/fA0oT4

Allowing new line characters in javascript str.replace

This question is similar to "Allowing new line characters in javascript regex"
but the solution /m not runs with str.replace. You can test the code below at this page
<p id="demo"><i>I need to TRIM the italics here,
despite this line.</i>
</p>
<button onclick="myFunction()">Try it</button>
<script>
function myFunction()
{
var str=document.getElementById("demo").innerHTML;
var n=str.replace(/^(\s*)<i>(.+)<\/i>(\s*)$/m,"$1$2$3"); //tested also /s
alert(str)
document.getElementById("demo").innerHTML=n;
}
</script>
This answer is mostly to give you some insight into why your current approach does not work, and how you generally solve it.
The reason m doesn't help is that the other answer is wrong. This is not what m does. m simply makes the anchors match line beginnings and endings in addition to the string beginnings and endings. Some regex flavors have s for what you want to accomplish, but not ECMAScript. The simplest thing (and general solution) is to replace . (which matches everything except line breaks) with [\s\S] (which matches whitespace and non-whitespace, i.e. everything).
However, Casimir's approach is better in your case, as it avoids some other problems like greediness. Of course, as Casimir said, if there are tags in between the opening and closing <i> tags, then the approach will not work. In that case, something like <i>([\s\S]+?)</i> might be an option, but that's still not the full solution, in case you have nested i-tags or attributes in the opening tag, or capitalized I-tags and whatnot.
All in all, using regex to parse HTML is wrong! You should really use DOM manipulation. Especially, since you are using Javascript - THE language for DOM manipulation. What you should really do is traverse the DOM for all i tags in your demo element, and replace them with their inner HTML.
A way to avoid problems with newlines is to not use the dot, example:
var n=str.replace(/<i>([^<]+)<\/i>/,"$1");
I have replaced the dot by [^<] (all that is not a <, that include newlines)
the m modifier is not needed here, and you don't need to capture white characters too.
Note that my solution suppose that you don't have any < between <i> and </i>
In the other case, when you have nested tags for example, you can use this trick to avoid lazy quantifier:
var n=str.replace(/<i>((?:[^<]+|<+(?!\/i>)+)<\/i>/,"$1");

Need help in regex pattern

i have this Regex pattern
\=[a-zA-Z\.\:\[\]_\(\)\&\$\%#\-\#\!0-9;=\?/\+\xBF\~]+[?\s+|?>]
and i have this HTML
1.esc#xyz.com
2.johnross#zys.com
3.johnross#wen.com
Here the problem is,
I need to avoid first and second as it has white space as well and it is valid attributes.
But only the third one is working as it does't has white spaces.
means nothing should be selected with the above pattern.
here is direct link to test
http://regexr.com?31r61
Please help!
Thanks,
EDIT:
If you just want to match unquoted attributes, this should work:
[<\s]+[\w]+(=[^\"][^\s>]*)
Kind of inelegant but let me know if that does what you want.
Which pattern are you trying to match? All three? And if so, which portion? The subject or the email? If you're just trying to match the subject, try using this as the pattern to match:
\=\"mailto:[^?]*\?subject=([^\"]*)\"\>
That will return a match where the group is the subject itself.
That is a wicked character class....
why don't you try something a bit more reasonable. Try this...
\=".*?(?<!\\)"
that will match anything in the parenthesis after href if that's what you're trying to get. If you're looking for more than that, this regex can easily by modified.

Categories