Regular expression to replace HTML content

Regular expression to replace HTML content - javascript

I am trying to replace HTML content with regular expression.
from
test test ZZZ<SPAN>ZZZ test test</SPAN>
to
test test AAA<SPAN>AAA test test</SPAN>
note that only words outside HTML tags are replaced from ZZZ to AAA.
Any idea? Thanks a lot in advance.

You could walk all nodes, replacing text in text ones (.nodeType == 3):
Something like:
element.find('*:contains(ZZZ)').contents().each(function () {
if (this.nodeType === 3)
this.nodeValue = this.nodeValue.replace(/ZZZ/g,'AAA')
})
Or same without jQuery:
function replaceText(element, from, to) {
for (var child = element.firstChild; child !== null; child = child.nextSibling) {
if (child.nodeType === 3)
this.nodeValue = this.nodeValue.replace(from,to)
else if (child.nodeType === 1)
replaceText(child, from, to);
}
}
replaceText(element, /ZZZ/g, 'AAA');

The best idea in this case is most certainly to not use regular expressions to do this. At least not on their own. JavaScript surely has a HTML Parser somewhere?
If you really must use regular expressions, you could try to look for every instance of ZZZ that is followed by a "<" before any ">". That would look like
ZZZ(?=[^>]*<)
This might break horribly if the code contains HTML comments or script blocks, or is not well formed.

Assuming a well-formed html document with outer/enclosing tags like <html>, I would think the easiest way would be to look for the > and < signs:
/(\>[^\>\<]*)ZZZ([^\>\<]*\<)/$1AAA$2/
If you're dealing with HTML fragments that may not have enclosing tags, it gets a little more complicated, you'd have to allow for start of string and end of string
Example JS (sorry, missed the tag):
alert('test test ZZZ<SPAN>ZZZ test test</SPAN>'.replace(/(\>[^\>\<]*)ZZZ([^\>\<]*\<)/g, "$1AAA$2"));
Explanation: for each match that
starts with >: \>
follows with any number of characters that are neither > nor <: [^\>\<]*
then has "ZZZ"
follows with any number of characters that are neither > nor <: [^\>\<]*
and ends with <: \<
Replace with
everything before the ZZZ, marked with the first capture group (parentheses): $1
AAA
everything after the ZZZ, marked with the second capture group (parentheses): $2
Using the "g" (global) option to ensure that all possible matches are replaced.

Try this:
var str = '<DIV>ZZZ test test</DIV>test test ZZZ';
var rpl = str.match(/href=\"(\w*)\"/i)[1];
console.log(str.replace(new RegExp(rpl + "(?=[^>]*<)", "gi"), "XXX"));

have you tried this:
replace:
>([^<>]*)(ZZZ)([^<>]*)<
with:
>$1AAA$3<
but beware all the savvy suggestions in the post linked in the first comment to your question!

Related

Regex excluding matches wrapped in specific bbcode tags

I'm trying to replace double quotes with curly quotes, except when the text is wrapped in certain tags, like [quote] and [code].
Sample input
[quote="Name"][b]Alice[/b] said, "Hello world!"[/quote]
<p>"Why no goodbye?" replied [b]Bob[/b]. "It's always Hello!"</p>
Expected output
[quote="Name"][b]Alice[/b] said, "Hello world!"[/quote]
<p>“Why no goodbye?” replied [b]Bob[/b]. “It's always Hello!”</p>
I figured how to elegantly achieve what I want in PHP by using (*SKIP)(*F), however my code will be run in javascript, and the javascript solution is less than ideal.
Right now I'm splitting the string at those tags, running the replace, then putting the string together:
var o = 3;
a = a
.split(/(\[(?<first>(?:icode|quote|code))[^\]]*?\](?:[\s]*?.)*?[\s]*?\[\/(?:\k<first>)\])/i)
.map(function(x,i) {
if (i == o-1 && x) {
x = '';
}
else if (i == o && x)
{
x = x.replace(/(?![^<]*>|[^\[]*\])"([^"]*?)"/gi, '“$1”')
o = o+3;
}
return x;
}).join('');
Javascript Regex Breakdown
Inside split():
(\[(?<first>icode|quote|code)[^\]]*?\](?:.)*?\[\/(\k<first>)\]) - captures the pattern inside parentheses:
\[(?<first>quote|code|icode)[^\]]*?\] - a [quote], [code], or [icode] opening tag, with or without parameters like =html, eg [code=html]
(?:[\s]*?.)*? - any 0+ (as few as possible) occurrences of any char (.), preceded or not by whitespace, so it doesn't break if the opening tag is followed by a line break
[\s]*? - 0+ whitespaces
\[\/(\k<first>)\] - [\quote], [\code], or [\icode] closing tags. Matches the text captured in the (?<first>) group. Eg: if it's a quote opening tag, it'll be a quote closing tag
Inside replace():
(?![^<]*>|[^\[]*\])"([^"]*?)" - captures text inside double quotes:
(?![^<]*>|[^\[]*\]) - negative lookahead, looks for characters (that aren't < or [) followed by either > or ] and discards them, so it won't match anything inside bbcode and html tags. Eg: [spoiler="Name"] or <span style="color: #24c4f9">. Note that matches wrapped in tags are left untouched.
" - literal opening double quotes character.
([^"]*?) - any 0+ character, except double quotes.
" - literal closing double quotes character.
SPLIT() REGEX DEMO: https://regex101.com/r/Ugy3GG/1
That's awful, because the replace is executed multiple times.
Meanwhile, the same result can be achieved with a single PHP regex. The regex I wrote was based on Match regex pattern that isn't within a bbcode tag.
(\[(?<first>quote|code|icode)[^\]]*?\](?:[\s]*?.)*?[\s]*?\[\/(\k<first>)\])(*SKIP)(*F)|(?![^<]*>|[^\[]*\])"([^"]*?)"
PHP Regex Breakdown
(\[(?<first>quote|code|icode)[^\]]*?\](?:[\s]*?.)*?[\s]*?\[\/(\k<first>)\])(*SKIP)(*F) - matches the pattern inside capturing parentheses just like javascript split() above, then (*SKIP)(*F) make the regex engine omit the matched text.
| - or
(?![^<]*>|[^\[]*\])"([^"]*?)" - captures text inside double quotes in the same way javascript replace() does
PHP DEMO: https://regex101.com/r/fB0lyI/1
The beauty of this regex is that it only needs to be run once. No splitting and joining of strings. Is there a way to implement it in javascript?

Because JS lacks backtracking verbs you will need to consume those bracketed chunks but later replace them as is. By obtaining the second side of the alternation from your own regex the final regex would be:
\[(quote|i?code)[^\]]*\][\s\S]*?\[\/\1\]|(?![^<]*>|[^\[]*\])"([^"]*)"
But the tricky part is using a callback function with replace() method:
str.replace(regex, function($0, $1, $2) {
return $1 ? $0 : '“' + $2 + '”';
})
Above ternary operator returns $0 (whole match) if first capturing group exists otherwise it encloses second capturing group value in curly quotes and returns it.
Note: this may fail in different cases.
See live demo here

Nested markup is hard to parse with rx, and JS's RegExp in particular. Complex regular expressions also hard to read, maintain, and debug. If your needs are simple, a tag content replacement with some banned tags excluded, consider a simple code-based alternative to run-on RegExps:
function curly(str) {
var excludes = {
quote: 1,
code: 1,
icode: 1
},
xpath = [];
return str.split(/(\[[^\]]+\])/) // breakup by tag markup
.map(x => { // for each tag and content:
if (x[0] === "[") { // tag markup:
if (x[1] === "/") { // close tag
xpath.pop(); // remove from current path
} else { // open tag
xpath.push(x.slice(1).split(/\W/)[0]); // add to current path
} //end if open/close tag
} else { // tag content
if (xpath.every(tag =>!excludes[tag])) x = x.replace(/"/g, function repr() {
return (repr.z = !repr.z) ? "“" : "”"; // flip flop return value (naive)
});
} //end if markup or content?
return x;
}) // end term map
.join("");
} /* end curly() */
var input = `[quote="Name"][b]Alice[/b] said, "Hello world!"[/quote]
<p>"Why no goodbye?" replied [b]Bob[/b]. "It's always Hello!"</p>`;
var wants = `[quote="Name"][b]Alice[/b] said, "Hello world!"[/quote]
<p>“Why no goodbye?” replied [b]Bob[/b]. “It's always Hello!”</p>`;
curly(input) == wants; // true
To my eyes, even though it a bit longer, code allows documentation, indentation, and explicit naming that makes these sort of semi-complicated logical operations easier to understand.
If your needs are more complex, use a true BBCode parser for JavaScript and map/filter/reduce it's model as needed.

How to replace text containing string with a HTML tag?

I am trying to work out how to replace any instance of text on page matching a certain phrase with a HTML tag - the contents should remain the same.
bold("Hello")
The code above should be replaced with:
<b>Hello</b>
Here is the code that I am currently using:
node.data = node.data.replace('bold("','<b>');
node.data = node.data.replace('")','</b>');
The problem with this is that rather than creating a <b> tag, it converts the < and > into their HTML entities - < >.
Also, it seems unpractical to replace all instances of ") with </b>.
What would be the best way to achieve this.
I would appreciate any help with this, thank you!

I assume you don't literally mean all text on a page, but rather all text inside of a given element. DucFilans is right on the money with regex, I'll slightly modify his answer to provide a complete working solution:
var parserRules = [
{ pattern: /bold\("(.*?)"\)/g, replacement: '<b>$1</b>' },
{ pattern: /italics\("(.*?)"\)/g, replacement: '<i>$1</i>' }
];
document.querySelectorAll('.parse').forEach(function(tag) {
var inner = tag.innerHTML;
parserRules.forEach(function(rule) {
inner = inner.replace(rule.pattern, rule.replacement)
});
tag.innerHTML = inner;
});
<div class='parse'>
<div>bold("Hello") world <p> Rose italics("pink") slow bold("!") </p> </div>
<div>bold("astiago") cheeeeeeese!</div>
</div>
You can define the regex patterns and replacement rules in the parserRules array, and then any element marked with the 'parse' class will be parsed. One warning is that this will replace all of the subelements, so if you are relying on listeners attached to subelements or something similar you'll need to take care of that as well.

Regex helps in your concern of replacement.
var str = 'bold("Hello")';
var regex = /bold\("(.*?)"\)/;
str = str.replace(regex, "<b>$1</b>");
console.log(str);

javascript replace with regexp has strange behaviour

Maybe someone can give me a hint...
I have the following code and experience a strange behaviour in javascript (node.js):
var a = "img{http://my.image.com/imgae.jpg} img{http://my.image.com/imgae.jpg}"
var html = a.replace(/img\{(.*)\}/g, '<img src="$1" class="image">');
//result: <img src="http://my.image.com/imgae.jpg" class="image"">
As you can see, the occurrence in the string (a markup thing) is replaced by an img tag with source as expected.
But now something strange. In the markup are probably several elements of type img{src}
var a = "img{http://my.image.com/imgae.jpg} some text between img{http://my.image.com/imgae.jpg}"
var html = a.replace(/img\{(.*)\}/g, '<img src="$1" class="image">');
//result: <img src="http://my.image.com/imgae.jpghttp://my.image.com/imgae.jpg" class="image"">
The result is strange. in $1 all matches are stored and accumulated... And there is only one image tag.
I am confused...

Try: a.replace(/img\{(.*?)\}/g, '<img src="$1" class="image">');
I found out about adding ? makes regex non-greedy here

Use this to stop at the first closing curly bracket.
var html = a.replace(/img{([^}]*)}/g, '<img src="$1" class="image">');

I think it's probably more important that you understand how this is working. .* can be a dangerous regular expression if you don't understand what it will do because it is greedy and will consume as much as it can, and some linters will warn against it.
So if you break down your regex you will find that the img\{ part matches the first part of the string (.*) matches http://my.image.com/imgae.jpg} some text between img{http://my.image.com/imgae.jpg and the final } matches the closing } because this is the largest string that matches the expression.
The best solution is to use ([^}]*), which matches anything except } because you know that anything between the image {} will will not be a closing brace.
You can test your regex to see what it is matching:
var reg = /img\{(.*)\}/g
var a = "img{http://my.image.com/imgae.jpg} img{http://my.image.com/imgae.jpg}"
var groups = a.match(reg)
// we can see what the first group matched
// groups[0] === "http://my.image.com/imgae.jpg} img{http://my.image.com/imgae.jpg"

Regular expression in javascript to match outside of XML tags

I want find all matches of "a" in <span class="get">habbitant morbi</span> triastbbitique , except "a" in tags (See below "a" between **).
<span class="get">h*a*bbit*a*nt morbi</span> tri*a*stbbitique.
If I find them, I want to replace them and also I want to save original tags.
This expression doesn't work:
var variable = "a";
var reg = new RegExp("[^<]."+variable+".[^>]$",'gi');

I would recommend to not use a regular expression to parse HTML; it's not a regular grammar, and you will experience pain for all but simple cases.
Your question is still a bit unclear, but let me try rephrasing to see if I have it right:
You'd like to get all matches of a given string in a HTML document, except for matches in <tag> bodies?
Assuming you're using jQuery or similar:
// Let the browser parse it for you:
var container = document.createElement()
container.innerHTML = '<span class="get">habbitant morbi</span> triastbbitique'
var doc_text = $(container).text()
// And then you can just regex away normally:
doc_text.match(/a/gi)
(Even better would be to use DOMParser, but that doesn't have wide browser support yet)
If you're in Node, then you want to look for some libraries that help you parse HTML nodes (like jsdom); and then just splat out all the next nodes.

Note that this question isn't about parsing. This is lexing. Something that regex are regularly and properly used for.
If you want to go with regex there are a couple of ways you could do this.
A simple hack lookahead like:
a(?![^<>]*>)
note that this wont handle < and > quoted in tags/unescaped outside of tags properly.
A full blown tokenizer of the form:
(expression for tag|comments|etc)|(stuff outside that that i'm interested in)
Replaced with a function that does different things depending on which part was matched. If $1 matched it would be replaced by it self, if $2 matchehd replace it with *$2*
The full tokenizer way is of course not a trivial task, the spec isn't small.
But if simplifying to only match the basic tags, ignore CDATA, comments, script/style tags, etc, you could use the following:
var str = '<span class="a <lal> a" attr>habbitant 2 > morbi. 2a < 3a</span> triastbbitique';
var re = /(<[a-z\/](?:"[^"]*"|'[^']*'|[^'">]+)*>)|(a)/gi;
var res = str.replace(re, function(m, tag, a){
return tag ? tag : "*" + a + "*";
});
Result:
<span class="a <lal> a" attr>h*a*bbit*a*nt 2 > morbi. 2*a* < 3*a*</span> tri*a*stbbitique
Live Example:
var str = '<span class="a <lal> a" attr>habbitant 2 > morbi. 2a < 3a</span> triastbbitique';
var re = /(<[a-z\/](?:"[^"]*"|'[^']*'|[^'">]+)*>)|(a)/gi;
var res = str.replace(re, function(m, tag, a){
return tag ? tag : "*" + a + "*";
});
console.log(res);
This handles messy tags, quotes and unescaped </> in the HTML.
Couple examples of tokenizing HTML tags with regex (which should translate fine to JS regex):
Remove on* JS event attributes from HTML tags
Regex to allow only set of HTML Tags and Attributes

replace special characters within a block using regex

I need to replace all < and > between [code] block. I DO NOT want to select and replace all content within [code] I just want to select < and > within that and then temporary replace it to another characters. do other replacement and then back them to < > within [code].
solution that I use:
replace(/<(?=[^\[]*\[\/code\])/gi,"&_lt_;");
replace(/>(?=[^\[]*\[\/code\])/gi,"&_gt_;");
DO OTHER REPLACEMENT/CUSTOMIZATION HERE
replace(/&_lt_;/gi,"<");
replace(/&_gt_;/gi,">");
only problem is that if content between [code] contain character [ it do not work before that character in block. how can I fix this?
example that works:
<b>
[code]
<form action="nd.php" method="post">
<b>
<strong>
[/code]
<b>
example that do not works:
<b>
[code]
<form action="nd.php" method="post">
<b>
$_POST[
<strong>
[/code]
<b>
EDIT: please only provide simple regex replace solution. I can not use callback function for this issue.

The accepted-answer for the linked question doesn't work for me for the "example that works". However, the other answer does - it also works for the "example that does not work" (there was a typo though).
Try the following regex:
/(\[code\][\s\S]*?\[\/code\])|<[\s\S]*?>/g
In the replace() function, you would use:
.replace(/(\[code\][\s\S]*?\[\/code\])|<[\s\S]*?>/g, '$1');
EDIT
If I understand correctly, your end-goal is to keep all of the content within [code][/code] the same - but be able to do replacements on all HTML tags that are outside of these tags (which may or may not mean to fully strip the characters)?
If this is the case, there is no need for a long list of regexes; The above regex can be used (with a slight modification) and it can cover many cases. Combine the regex/replace with a callback function to handle your extra replacements:
var replaceCallback = function(match) {
// if the match's first characters are '[code]', we have a '[code][/code]' block
if (match.substring(0, 6) == '[code]') {
// do any special replacements on this block; by default, return it untouched
return match;
}
// the match you now have is an HTML tag; it can be `<tag>` or `</tag>`
// do any special replacements; by default, return an empty string
return '';
}
str = str.replace(/(\[code\][\s\S]*?\[\/code\])|(<[\s\S]*?>)/g, replaceCallback);
The one regex modification was to add a group around the html-tag section (the second part of the regex). This will allow it to be passed to the callback function.
UPDATE ([code] isn't literal)
Per a comment, I've realized that the tag [code] isn't literal - you want to cover all BBCode style tags. This is just-as-easy as the above example (even easier in the callback). Instead of the word code in the regex, you can use [a-z]+ to cover all alphabetical characters. Then, inside the callback you can just check the very first character; if it's a [, you're in a code block - otherwise you have an HTML tag that's outside a code block:
var replaceCallback = function(match) {
// if the match's first character is '[', we have a '[code][/code]' block
if (match.substring(0, 1) == '[') {
// do any special replacements on this block; by default, return it untouched
return match;
}
// the match you now have is an HTML tag; it can be `<tag>` or `</tag>`
// do any special replacements; by default, return an empty string
return '';
}
str = str.replace(/(\[[a-z]+\][\s\S]*?\[\/[a-z]+\])|(<[\s\S]*?>)/gi, replaceCallback);
Also note that I added an i to the regex's options to ignore case (otherwise you'll need [a-zA-Z] to handle capital letters).

Here's my edited answer. Sorry again.
str = str.replace(/(\[code\])(.*?)(\[\/code\])/gm,function(a,b,c,d) {
return b + c.replace(/</g,'<').replace(/>/g,'>') + d;
});

We Keep Coding

JavaScript is the programming language of the Web.

Regular expression to replace HTML content - javascript

I am trying to replace HTML content with regular expression. from test test ZZZ<SPAN>ZZZ test test</SPAN> to test test AAA<SPAN>AAA test test</SPAN> note that only words outside HTML tags are replaced from ZZZ to AAA. Any idea? Thanks a lot in advance.

Try this: var str = '<DIV>ZZZ test test</DIV>test test ZZZ'; var rpl = str.match(/href=\"(\w)\"/i)[1]; console.log(str.replace(new RegExp(rpl + "(?=[^>]<)", "gi"), "XXX"));

have you tried this: replace: >([^<>])(ZZZ)([^<>])< with: >$1AAA$3< but beware all the savvy suggestions in the post linked in the first comment to your question!

Related

Regex excluding matches wrapped in specific bbcode tags

How to replace text containing string with a HTML tag?

javascript replace with regexp has strange behaviour

Regular expression in javascript to match outside of XML tags

replace special characters within a block using regex

Categories

Resources

We Keep Coding

JavaScript is the programming language of the Web.

Regular expression to replace HTML content - javascript

I am trying to replace HTML content with regular expression. from test test ZZZ<SPAN>ZZZ test test</SPAN> to test test AAA<SPAN>AAA test test</SPAN> note that only words outside HTML tags are replaced from ZZZ to AAA. Any idea? Thanks a lot in advance.

Try this: var str = '<DIV>ZZZ test test</DIV>test test ZZZ'; var rpl = str.match(/href=\"(\w*)\"/i)[1]; console.log(str.replace(new RegExp(rpl + "(?=[^>]*<)", "gi"), "XXX"));

have you tried this: replace: >([^<>]*)(ZZZ)([^<>]*)< with: >$1AAA$3< but beware all the savvy suggestions in the post linked in the first comment to your question!

Related

Regex excluding matches wrapped in specific bbcode tags

How to replace text containing string with a HTML tag?

javascript replace with regexp has strange behaviour

Regular expression in javascript to match outside of XML tags

replace special characters within a block using regex

Categories

Resources

Try this: var str = '<DIV>ZZZ test test</DIV>test test ZZZ'; var rpl = str.match(/href=\"(\w)\"/i)[1]; console.log(str.replace(new RegExp(rpl + "(?=[^>]<)", "gi"), "XXX"));

have you tried this: replace: >([^<>])(ZZZ)([^<>])< with: >$1AAA$3< but beware all the savvy suggestions in the post linked in the first comment to your question!