replace special characters within a block using regex

replace special characters within a block using regex - javascript

I need to replace all < and > between [code] block. I DO NOT want to select and replace all content within [code] I just want to select < and > within that and then temporary replace it to another characters. do other replacement and then back them to < > within [code].
solution that I use:
replace(/<(?=[^\[]*\[\/code\])/gi,"&_lt_;");
replace(/>(?=[^\[]*\[\/code\])/gi,"&_gt_;");
DO OTHER REPLACEMENT/CUSTOMIZATION HERE
replace(/&_lt_;/gi,"<");
replace(/&_gt_;/gi,">");
only problem is that if content between [code] contain character [ it do not work before that character in block. how can I fix this?
example that works:
<b>
[code]
<form action="nd.php" method="post">
<b>
<strong>
[/code]
<b>
example that do not works:
<b>
[code]
<form action="nd.php" method="post">
<b>
$_POST[
<strong>
[/code]
<b>
EDIT: please only provide simple regex replace solution. I can not use callback function for this issue.

The accepted-answer for the linked question doesn't work for me for the "example that works". However, the other answer does - it also works for the "example that does not work" (there was a typo though).
Try the following regex:
/(\[code\][\s\S]*?\[\/code\])|<[\s\S]*?>/g
In the replace() function, you would use:
.replace(/(\[code\][\s\S]*?\[\/code\])|<[\s\S]*?>/g, '$1');
EDIT
If I understand correctly, your end-goal is to keep all of the content within [code][/code] the same - but be able to do replacements on all HTML tags that are outside of these tags (which may or may not mean to fully strip the characters)?
If this is the case, there is no need for a long list of regexes; The above regex can be used (with a slight modification) and it can cover many cases. Combine the regex/replace with a callback function to handle your extra replacements:
var replaceCallback = function(match) {
// if the match's first characters are '[code]', we have a '[code][/code]' block
if (match.substring(0, 6) == '[code]') {
// do any special replacements on this block; by default, return it untouched
return match;
}
// the match you now have is an HTML tag; it can be `<tag>` or `</tag>`
// do any special replacements; by default, return an empty string
return '';
}
str = str.replace(/(\[code\][\s\S]*?\[\/code\])|(<[\s\S]*?>)/g, replaceCallback);
The one regex modification was to add a group around the html-tag section (the second part of the regex). This will allow it to be passed to the callback function.
UPDATE ([code] isn't literal)
Per a comment, I've realized that the tag [code] isn't literal - you want to cover all BBCode style tags. This is just-as-easy as the above example (even easier in the callback). Instead of the word code in the regex, you can use [a-z]+ to cover all alphabetical characters. Then, inside the callback you can just check the very first character; if it's a [, you're in a code block - otherwise you have an HTML tag that's outside a code block:
var replaceCallback = function(match) {
// if the match's first character is '[', we have a '[code][/code]' block
if (match.substring(0, 1) == '[') {
// do any special replacements on this block; by default, return it untouched
return match;
}
// the match you now have is an HTML tag; it can be `<tag>` or `</tag>`
// do any special replacements; by default, return an empty string
return '';
}
str = str.replace(/(\[[a-z]+\][\s\S]*?\[\/[a-z]+\])|(<[\s\S]*?>)/gi, replaceCallback);
Also note that I added an i to the regex's options to ignore case (otherwise you'll need [a-zA-Z] to handle capital letters).

Here's my edited answer. Sorry again.
str = str.replace(/(\[code\])(.*?)(\[\/code\])/gm,function(a,b,c,d) {
return b + c.replace(/</g,'<').replace(/>/g,'>') + d;
});

Related

Regex excluding matches wrapped in specific bbcode tags

I'm trying to replace double quotes with curly quotes, except when the text is wrapped in certain tags, like [quote] and [code].
Sample input
[quote="Name"][b]Alice[/b] said, "Hello world!"[/quote]
<p>"Why no goodbye?" replied [b]Bob[/b]. "It's always Hello!"</p>
Expected output
[quote="Name"][b]Alice[/b] said, "Hello world!"[/quote]
<p>“Why no goodbye?” replied [b]Bob[/b]. “It's always Hello!”</p>
I figured how to elegantly achieve what I want in PHP by using (*SKIP)(*F), however my code will be run in javascript, and the javascript solution is less than ideal.
Right now I'm splitting the string at those tags, running the replace, then putting the string together:
var o = 3;
a = a
.split(/(\[(?<first>(?:icode|quote|code))[^\]]*?\](?:[\s]*?.)*?[\s]*?\[\/(?:\k<first>)\])/i)
.map(function(x,i) {
if (i == o-1 && x) {
x = '';
}
else if (i == o && x)
{
x = x.replace(/(?![^<]*>|[^\[]*\])"([^"]*?)"/gi, '“$1”')
o = o+3;
}
return x;
}).join('');
Javascript Regex Breakdown
Inside split():
(\[(?<first>icode|quote|code)[^\]]*?\](?:.)*?\[\/(\k<first>)\]) - captures the pattern inside parentheses:
\[(?<first>quote|code|icode)[^\]]*?\] - a [quote], [code], or [icode] opening tag, with or without parameters like =html, eg [code=html]
(?:[\s]*?.)*? - any 0+ (as few as possible) occurrences of any char (.), preceded or not by whitespace, so it doesn't break if the opening tag is followed by a line break
[\s]*? - 0+ whitespaces
\[\/(\k<first>)\] - [\quote], [\code], or [\icode] closing tags. Matches the text captured in the (?<first>) group. Eg: if it's a quote opening tag, it'll be a quote closing tag
Inside replace():
(?![^<]*>|[^\[]*\])"([^"]*?)" - captures text inside double quotes:
(?![^<]*>|[^\[]*\]) - negative lookahead, looks for characters (that aren't < or [) followed by either > or ] and discards them, so it won't match anything inside bbcode and html tags. Eg: [spoiler="Name"] or <span style="color: #24c4f9">. Note that matches wrapped in tags are left untouched.
" - literal opening double quotes character.
([^"]*?) - any 0+ character, except double quotes.
" - literal closing double quotes character.
SPLIT() REGEX DEMO: https://regex101.com/r/Ugy3GG/1
That's awful, because the replace is executed multiple times.
Meanwhile, the same result can be achieved with a single PHP regex. The regex I wrote was based on Match regex pattern that isn't within a bbcode tag.
(\[(?<first>quote|code|icode)[^\]]*?\](?:[\s]*?.)*?[\s]*?\[\/(\k<first>)\])(*SKIP)(*F)|(?![^<]*>|[^\[]*\])"([^"]*?)"
PHP Regex Breakdown
(\[(?<first>quote|code|icode)[^\]]*?\](?:[\s]*?.)*?[\s]*?\[\/(\k<first>)\])(*SKIP)(*F) - matches the pattern inside capturing parentheses just like javascript split() above, then (*SKIP)(*F) make the regex engine omit the matched text.
| - or
(?![^<]*>|[^\[]*\])"([^"]*?)" - captures text inside double quotes in the same way javascript replace() does
PHP DEMO: https://regex101.com/r/fB0lyI/1
The beauty of this regex is that it only needs to be run once. No splitting and joining of strings. Is there a way to implement it in javascript?

Because JS lacks backtracking verbs you will need to consume those bracketed chunks but later replace them as is. By obtaining the second side of the alternation from your own regex the final regex would be:
\[(quote|i?code)[^\]]*\][\s\S]*?\[\/\1\]|(?![^<]*>|[^\[]*\])"([^"]*)"
But the tricky part is using a callback function with replace() method:
str.replace(regex, function($0, $1, $2) {
return $1 ? $0 : '“' + $2 + '”';
})
Above ternary operator returns $0 (whole match) if first capturing group exists otherwise it encloses second capturing group value in curly quotes and returns it.
Note: this may fail in different cases.
See live demo here

Nested markup is hard to parse with rx, and JS's RegExp in particular. Complex regular expressions also hard to read, maintain, and debug. If your needs are simple, a tag content replacement with some banned tags excluded, consider a simple code-based alternative to run-on RegExps:
function curly(str) {
var excludes = {
quote: 1,
code: 1,
icode: 1
},
xpath = [];
return str.split(/(\[[^\]]+\])/) // breakup by tag markup
.map(x => { // for each tag and content:
if (x[0] === "[") { // tag markup:
if (x[1] === "/") { // close tag
xpath.pop(); // remove from current path
} else { // open tag
xpath.push(x.slice(1).split(/\W/)[0]); // add to current path
} //end if open/close tag
} else { // tag content
if (xpath.every(tag =>!excludes[tag])) x = x.replace(/"/g, function repr() {
return (repr.z = !repr.z) ? "“" : "”"; // flip flop return value (naive)
});
} //end if markup or content?
return x;
}) // end term map
.join("");
} /* end curly() */
var input = `[quote="Name"][b]Alice[/b] said, "Hello world!"[/quote]
<p>"Why no goodbye?" replied [b]Bob[/b]. "It's always Hello!"</p>`;
var wants = `[quote="Name"][b]Alice[/b] said, "Hello world!"[/quote]
<p>“Why no goodbye?” replied [b]Bob[/b]. “It's always Hello!”</p>`;
curly(input) == wants; // true
To my eyes, even though it a bit longer, code allows documentation, indentation, and explicit naming that makes these sort of semi-complicated logical operations easier to understand.
If your needs are more complex, use a true BBCode parser for JavaScript and map/filter/reduce it's model as needed.

Regex example to match pseudo element's content property

I am trying to parse the pseudo selector content in javascript.
Html content can be
content: counter(item)" " attr(data) "" counter(item1,decimal) url('test.jpeg') "hi" attr(xyz);
To parse this content i am using below regex (logic of matching parenthesis copied from internet )
counter\((?:[^)(]+|\((?:[^)(]+|\([^)(]*\))*\))*\)
This selects all the counter with "(" but counter can not have nested parentheses (as far as i know, correct me if i am wrong).Similarly same regex i am using to select other content also.
Attr : attr\((?:[^)(]+|\((?:[^)(]+|\([^)(]*\))*\))*\)
Quotes: openQuote\((?:[^)(]+|\((?:[^)(]+|\([^)(]*\))*\))*\)
String: anything inside double/single quotes: (current regex is not working ".*")
I have below questions here
1. Regex to match single parenthesis (no nested parenthesis is possible in pseudo selector content property)
2.Single regex that will match the counter, attribute , url and string content in the given order (order is important because i want to replace them later with evaluated values)
Please let me know if any more information is required from side.
Thanks

Your first regex does indeed match nested parentheses (but not escaped parentheses). Is that desirable?
Without nesting or escaping, these become much simpler.
Here's a variant of your first regex that ignores nesting possibilities:
counter\([^)]*\)
It matches a literal counter( and then zero or more non-close-parentheses, then finally a close parenthesis. (Full explanations of your first regex and my simpler version at regex101.)
I believe that answers your first question, though if you're literally looking for a "regex to match [a] single parenthesis," that's just [()], which will match either an open or a close parenthesis character. You could alternatively explicitly match \( or \) if you know which one you want to match.
Matching quotes (without regard to nesting or escaped quotes) is similarly easy:
"[^"]*"
This matches a literal double quote character ("), then zero or more non-doublequote characters, then another literal double quote character.
Your second request was for a "single regex that will match the counter, attribute , url and string content in the given order (order is important because i want to replace them later with evaluated values)."
I'm not sure how you intend to get the CSS content property's value, given how that's typically in an ::after or ::before pseudo-class, which are not available from the DOM, but here's some dummy code populating it so we can manipulate it:
var css = `content: counter(item)" " attr(data) "" counter(item1,decimal) url('test.jpeg') "hi" attr(xyz); color:red;`;
// harvest last `content` property (this is tricked by `content: "content: blah"`)
var content = css.match(/.*\bcontent:\s*([^;"']*(?:"[^"]*"[^;"']*|'[^']*'[^;"']*)*)/);
if (content) {
var part_re = /(?:"([^"]*)"|'([^']*)'|(?:counter|attr|url)\(([^)]*)\))/g;
while ( part = part_re.exec(content[1]) ) { // parse on just the value
if (part[0].match(/^"/)) { /* do stuff to part[1] */ }
else if (part[0].match(/^'/)) { /* do stuff to part[2] */ }
else if (part[0].match(/^counter/)) { /* do stuff to part[3] */ }
else if (part[0].match(/^attr/)) { /* do stuff to part[3] */ }
else if (part[0].match(/^url/)) { /* do stuff to part[3] */ }
// silently skips other values, like `open-quote` or `counters(name, string)`
}
}
The first regex (line 4) extracts the last content property from the CSS (last because it'll override previous instances, though note the fact that this'll stupidly extract content: blah from content: "content: blah"). After finding the last instance of a word break and then content:, it absorbs any whitespace and then matches the rest of the line until a semicolon, double quote, or single quote. A non-capture group allows for any content between double quotes or a single quote, much in the same way we matched quotes near the top of this answer. (Full explanation of this CSS content regex at regex101.)
The second regex (line 7, assigned to part_re) is in a while loop so we can work on each individual value in the content property in order. It matches double-quoted strings or single-quoted strings or certain named values (counter or attr or url). See the conditionals and comments for where the values' data are stored. Full explanation of this value parsing regex at regex101 (see "Match Information" in the middle of the right column to see how I'm storing the values' data).

Replace words of text area

I have made a javascript function to replace some words with other words in a text area, but it doesn't work. I have made this:
function wordCheck() {
var text = document.getElementById("eC").value;
var newText = text.replace(/hello/g, '<b>hello</b>');
document.getElementById("eC").innerText = newText;
}
When I alert the variable newText, the console says that the variable doesn't exist.
Can anyone help me?
Edit:
Now it replace the words, but it replaces it with <b>hello</b>, but I want to have it bold. Is there a solution?

Update:
In response to your edit, about your wanting to see the word "hello" show up in bold. The short answer to that is: it can't be done. Not in a simple textarea, at least. You're probably looking for something more like an online WYSIWYG editor, or at least a RTE (Richt Text Editor). There are a couple of them out there, like tinyMCE, for example, which is a decent WYSIWYG editor. A list of RTE's and HTML editors can be found here.
First off: As others have already pointed out: a textarea element's contents is available through its value property, not the innerText. You get the contents alright, but you're trying to update it through the wrong property: use value in both cases.
If you want to replace all occurrences of a string/word/substring, you'll have to resort to using a regular expression, using the g modifier. I'd also recommend making the matching case-insensitive, to replace "hello", "Hello" and "HELLO" all the same:
var txtArea = document.querySelector('#eC');
txtArea.value = txtArea.value.replace(/(hello)/gi, '<b>$1</b>');
As you can see: I captured the match, and used it in the replacement string, to preserve the caps the user might have used.
But wait, there's more:
What if, for some reason, the input already contains <b>Hello</b>, or contains a word containing the string "hello" like "The company is called hellonearth?" Enter conditional matches (aka lookaround assertions) and word boundaries:
txtArea.value = txtArea.value.replace(x.value.replace(/(?!>)\b(hello)\b(?!<)/gi, '<b>$1</b>');
fiddle
How it works:
(?!>): Only match the rest if it isn't preceded by a > char (be more specific, if you want to and use (?!<b>). This is called a negative look-ahead
\b: a word boundary, to make sure we're not matching part of a word
(hello): match and capture the string literal, provided (as explained above) it is not preceded by a > and there is a word boundary
(?!<): same as above, only now we don't want to find a matching </b>, so you can replace this with the more specific (?!<\/b>)
/gi: modifiers, or flags, that affect the entire pattern: g for global (meaning this pattern will be applied to the entire string, not just a single match). The i tells the regex engine the pattern is case-insensitive, ie: h matches both the upper and lowercase character.
The replacement string <b>$1</b>: when the replacement string contains $n substrings, where n is a number, they are treated as backreferences. A regex can group matches into various parts, each group has a number, starting with 1, depending on how many groups you have. We're only grouping one part of the pattern, but suppose we wrote:
'foobar hello foobar'.replace(/(hel)(lo)/g, '<b>$1-$2</b>');
The output would be "foobar <b>hel-lo</b> foobar", because we've split the match up into 2 parts, and added a dash in the replacement string.
I think I'll leave the introduction to RegExp at that... even though we've only scratched the surface, I think it's quite clear now just how powerful regex's can be. Put some time and effort into learning more about this fantastic tool, it is well worth it.

If <textarea>, then you need to use .value property.
document.getElementById("eC").value = newText;
And, as mentioned Barmar, replace() replaces only first word. To replace all word, you need to use simple regex. Note that I removed quotes. /g means global replace.
var newText = text.replace(/hello/g, '<b>hello</b>');
But if you want to really bold your text, you need to use content editable div, not text area:
<div id="eC" contenteditable></div>
So then you need to access innerHTML:
function wordCheck() {
var text = document.getElementById("eC").innerHTML;
var newText = text.replace(/hello/g, '<b>hello</b>');
newText = newText.replace(/<b><b>/g,"<b>");//These two lines are there to prevent <b><b>hello</b></b>
newText = newText.replace(/<\/b><\/b>/g,"</b>");
document.getElementById("eC").innerHTML = newText;
}

Regular expression in javascript to match outside of XML tags

I want find all matches of "a" in <span class="get">habbitant morbi</span> triastbbitique , except "a" in tags (See below "a" between **).
<span class="get">h*a*bbit*a*nt morbi</span> tri*a*stbbitique.
If I find them, I want to replace them and also I want to save original tags.
This expression doesn't work:
var variable = "a";
var reg = new RegExp("[^<]."+variable+".[^>]$",'gi');

I would recommend to not use a regular expression to parse HTML; it's not a regular grammar, and you will experience pain for all but simple cases.
Your question is still a bit unclear, but let me try rephrasing to see if I have it right:
You'd like to get all matches of a given string in a HTML document, except for matches in <tag> bodies?
Assuming you're using jQuery or similar:
// Let the browser parse it for you:
var container = document.createElement()
container.innerHTML = '<span class="get">habbitant morbi</span> triastbbitique'
var doc_text = $(container).text()
// And then you can just regex away normally:
doc_text.match(/a/gi)
(Even better would be to use DOMParser, but that doesn't have wide browser support yet)
If you're in Node, then you want to look for some libraries that help you parse HTML nodes (like jsdom); and then just splat out all the next nodes.

Note that this question isn't about parsing. This is lexing. Something that regex are regularly and properly used for.
If you want to go with regex there are a couple of ways you could do this.
A simple hack lookahead like:
a(?![^<>]*>)
note that this wont handle < and > quoted in tags/unescaped outside of tags properly.
A full blown tokenizer of the form:
(expression for tag|comments|etc)|(stuff outside that that i'm interested in)
Replaced with a function that does different things depending on which part was matched. If $1 matched it would be replaced by it self, if $2 matchehd replace it with *$2*
The full tokenizer way is of course not a trivial task, the spec isn't small.
But if simplifying to only match the basic tags, ignore CDATA, comments, script/style tags, etc, you could use the following:
var str = '<span class="a <lal> a" attr>habbitant 2 > morbi. 2a < 3a</span> triastbbitique';
var re = /(<[a-z\/](?:"[^"]*"|'[^']*'|[^'">]+)*>)|(a)/gi;
var res = str.replace(re, function(m, tag, a){
return tag ? tag : "*" + a + "*";
});
Result:
<span class="a <lal> a" attr>h*a*bbit*a*nt 2 > morbi. 2*a* < 3*a*</span> tri*a*stbbitique
Live Example:
var str = '<span class="a <lal> a" attr>habbitant 2 > morbi. 2a < 3a</span> triastbbitique';
var re = /(<[a-z\/](?:"[^"]*"|'[^']*'|[^'">]+)*>)|(a)/gi;
var res = str.replace(re, function(m, tag, a){
return tag ? tag : "*" + a + "*";
});
console.log(res);
This handles messy tags, quotes and unescaped </> in the HTML.
Couple examples of tokenizing HTML tags with regex (which should translate fine to JS regex):
Remove on* JS event attributes from HTML tags
Regex to allow only set of HTML Tags and Attributes

Regular expression to replace HTML content

I am trying to replace HTML content with regular expression.
from
test test ZZZ<SPAN>ZZZ test test</SPAN>
to
test test AAA<SPAN>AAA test test</SPAN>
note that only words outside HTML tags are replaced from ZZZ to AAA.
Any idea? Thanks a lot in advance.

You could walk all nodes, replacing text in text ones (.nodeType == 3):
Something like:
element.find('*:contains(ZZZ)').contents().each(function () {
if (this.nodeType === 3)
this.nodeValue = this.nodeValue.replace(/ZZZ/g,'AAA')
})
Or same without jQuery:
function replaceText(element, from, to) {
for (var child = element.firstChild; child !== null; child = child.nextSibling) {
if (child.nodeType === 3)
this.nodeValue = this.nodeValue.replace(from,to)
else if (child.nodeType === 1)
replaceText(child, from, to);
}
}
replaceText(element, /ZZZ/g, 'AAA');

The best idea in this case is most certainly to not use regular expressions to do this. At least not on their own. JavaScript surely has a HTML Parser somewhere?
If you really must use regular expressions, you could try to look for every instance of ZZZ that is followed by a "<" before any ">". That would look like
ZZZ(?=[^>]*<)
This might break horribly if the code contains HTML comments or script blocks, or is not well formed.

Assuming a well-formed html document with outer/enclosing tags like <html>, I would think the easiest way would be to look for the > and < signs:
/(\>[^\>\<]*)ZZZ([^\>\<]*\<)/$1AAA$2/
If you're dealing with HTML fragments that may not have enclosing tags, it gets a little more complicated, you'd have to allow for start of string and end of string
Example JS (sorry, missed the tag):
alert('test test ZZZ<SPAN>ZZZ test test</SPAN>'.replace(/(\>[^\>\<]*)ZZZ([^\>\<]*\<)/g, "$1AAA$2"));
Explanation: for each match that
starts with >: \>
follows with any number of characters that are neither > nor <: [^\>\<]*
then has "ZZZ"
follows with any number of characters that are neither > nor <: [^\>\<]*
and ends with <: \<
Replace with
everything before the ZZZ, marked with the first capture group (parentheses): $1
AAA
everything after the ZZZ, marked with the second capture group (parentheses): $2
Using the "g" (global) option to ensure that all possible matches are replaced.

Try this:
var str = '<DIV>ZZZ test test</DIV>test test ZZZ';
var rpl = str.match(/href=\"(\w*)\"/i)[1];
console.log(str.replace(new RegExp(rpl + "(?=[^>]*<)", "gi"), "XXX"));

have you tried this:
replace:
>([^<>]*)(ZZZ)([^<>]*)<
with:
>$1AAA$3<
but beware all the savvy suggestions in the post linked in the first comment to your question!

We Keep Coding

JavaScript is the programming language of the Web.

replace special characters within a block using regex - javascript

Here's my edited answer. Sorry again. str = str.replace(/(\[code\])(.*?)(\[\/code\])/gm,function(a,b,c,d) { return b + c.replace(/</g,'<').replace(/>/g,'>') + d; });

Related

Regex excluding matches wrapped in specific bbcode tags

Regex example to match pseudo element's content property

Replace words of text area

Regular expression in javascript to match outside of XML tags

Regular expression to replace HTML content

Categories

Resources