How to detect and remove invalid attributes using Javascript Regex?

How to detect and remove invalid attributes using Javascript Regex? - javascript

Here's a sample HTML I have (Actual HTML is pretty big & complex and I am not posting it for the sake of simplicity):
<!DOCTYPE html>
<html>
<head>
</head>
<body style="background-color: #000000;font-family:'Open Sans'">
<div class:'abc' id="cde"></div>
<div class:"abc" id="fed"></div>
<div class:abc id="ce"></div>
<div class:"abc"><p class="content" autocomplete> I am some text which might contain attribute:"invalid value" and I must not be removed</p></div>
</body>
</html>
Goal here is to remove the invalid attributes from the HTML without disturbing the rest of the html. Obviously, the invalid attributes can be anything other than attribute="value", or attribute=value or attribute='value' or even attribute (e.g. <input id="abc" type="text" value="test" disabled>) and the regex should remove it. This content cannot be loaded into DOM so please suggest regex based solutions only.
For starters, I am trying /[a-zA-Z]+:"?'?[a-zA-Z]+"?'?/gi but I know I am nowhere even close!
Here's the fiddle for you to play with.
Expected output:
<!DOCTYPE html>
<html>
<head>
</head>
<body style="background-color: #000000;font-family:'Open Sans'">
<div id="cde"></div>
<div id="fed"></div>
<div id="ce"></div>
<div ><p class="content" autocomplete> I am some text which might contain attribute:"invalid value" and I must not be removed</p></div>
</body>
</html>
Update 1:
The regex should work with any element and not just the divs. divs
were given just for example. it should work with span etc too.
No detection of unclosed tag is needed. We just need to remove attribute:value/attribute;value/attribute:"value" (basically anything other than the valid attributes supported) etc if they are inside <element>.

I'd go with .replace function twice:
var html = `<!DOCTYPE html>
<html>
<head>
</head>
<body style="background-color: #000000;font-family:'Open Sans'">
<div class:'abc' id="cde"></div>
<div class:"abc" id="fed"></div>
<div class:abc id="ce"></div>
<div class:"abc"><p class="content" autocomplete required blah=blah> I am some text which might contain attribute:"invalid value" and I must not be removed</p></div>
</body>
</html>`;
var htmlCleaned = html.replace(/(<\w+)(\s[^>]*)>+/g, function($m, $1, $2) {
return $1 + $2.replace(/\s*?(\s?\w+(?:=(?:'[^'\\]*(?:\\.[^'\\]*)*'|"[^"\\]*(?:\\.[^"\\]*)*"|\w+)|(?!\S)))|\s*\S+/g, '$1') + ">";
});
console.log(htmlCleaned)

Though generally not advisable, you could use two expressions on the DOM, one to filter potentially elements, one to actually eradicate the attributes in question:
var html = `<!DOCTYPE html>
<html>
<head>
</head>
<body style="background-color: #000000;font-family:'Open Sans'">
<div class:'abc' id="cde"></div>
<div class:"abc" id="fed"></div>
<div class:abc id="ce"></div>
<div class:"abc"><p class="content" autocomplete> I am some text which might contain attribute:"invalid value" and I must not be removed</p></div>
<!-- another one here -->
<div class:'abc defg' id="ce"></div>
</body>
</html>`;
var cleaned = html.replace(/<(?:(?!>).)*\b\w+:['"]?\w+['"]?.*?>/g, function(match) {
return match.replace(/\s+\w+:(?:(?:'[^']*')|(?:"[^"]*")|\w+)\s*(?!\w)/g, '');
});
console.log(cleaned);
Broken down, this says for the first expression (demo on regex101.com):
< # <
(?:(?!>).)* # anything where > is not immediately ahead
\b\w+: # a word boundary +1 word characters and :
['"]? # quotes, optional
\w+ # another 1+ word characters
['"]? # as above
.*? # anything else lazily afterwards
> # >
... and for the second (inner) one:
\s+\w+: # 1+ whitespaces, 1+ word characters
(?: # non-capturing group
(?:'[^']*') # '...'
| # or
(?:"[^"]*") # "..."
| # or
\w+ # 1+ word characters
)
\s*(?!\w) # 0+ whitespaces, make sure there's no
# word character ahead
Note that this won't take into account sth. like data-attribute='some weird <> characters here: """'> or data-key="hey, i'am \"escaped, yippeh!">, which are both totally valid.
If you expect such input, really use a parser instead.

This is quite a task that deserves a dedicated library. In order to identify invalid attributes you need to find first the valid tags that is also not that easy and clear. E.g what needs to be done when some tag is not closed? Should the uncloseable tags like input be supposed to be closed? Should href be an attribute of div? Etc etc etc
This is nearly impossible with plain regexp. Even if it will be it won't cover all the cases or will be too complex = unsupportable.
Just give it out to the library that does it for you e.g. this one https://github.com/dave-kennedy/clean-html

This code parses your html splits the relevant parts and checks to see whether the attributes are valid.
You could probably make it more efficient as it's looping multiple times, but this way is easier to understand in it's component parts.
That said. Don't use this code. if you can't parse your element to the DOM figure out a way, if you are in Node you can parse as xml and work with the nodes to ensure everything works correctly.
My little console app doesn't display the autocomplete attribute but it's there in the string.
this code will probably fail in a production environment!
const html = document.querySelector('#input').innerHTML
const isElement = x =>
/^<.*>$/.test(x)
const isValidAttribute = x =>
/^(([a-zA-Z-]+)=?((?:\"|\')[^\'\"]*(?:\"|\'))*|\w+)$/.test(x)
const similarToAttribute = x =>
/=.*((?:\"|\').*(?:\"|\'))/.test(x)
const isOpeningOrClosingBracket = x =>
/(^<|>$)/.test(x)
const output =
html
// .replace(/(\n|\r)+/gm, '') // uncomment to remove new lines
.split(/(<[^>]+>)/) // split the elements
.filter(x => x !== "") // remove empty elements
.map( x => !isElement(x)
? x // it's not an element node, return it
: x.split(/(<\w+|>|\s)/) // split the parts of elements
.filter(x => x !== " " && x !== "") // remove empty elements
.reduce((acc, x) => {
return isOpeningOrClosingBracket(x) || isValidAttribute(x)
? acc.concat(x) // return valid components
: acc // failed check, dont return the attribute
}, [])
)
.map(x => Array.isArray(x) // arrays are elements
? x.slice(0, x.length - 1).join(' ') + x[x.length -1] // join the element string
: x // return anything else
)
.join('') // join the entire array into a string
const div = document.createElement('section')
div.innerHTML = output
console.log(output)
console.log(div)
/* UNIT TESTS */
expect('string is valid element format', () => {
assert(isElement('<div>')).equal(true)
assert(isElement('</div>')).equal(true)
assert(isElement('not an element')).equal(false)
})
expect('string is valid attribute format', () => {
assert(isValidAttribute('class="thing"')).equal(true)
assert(isValidAttribute('class:\'abc\'="thing"')).equal(false)
assert(isValidAttribute('class:\'abc\'="thing"')).equal(false)
assert(isValidAttribute('autocomplete')).equal(true)
})
expect('string has similar properties to an attribute', () => {
assert(similarToAttribute('this is not an attribute')).equal(false)
assert(similarToAttribute('class:\'abc\'="thing"')).equal(true)
assert(similarToAttribute('class:\'abc\'="thing"')).equal(true)
})
expect('string is opening or closing tag', () => {
assert(isOpeningOrClosingBracket('<div')).equal(true)
assert(isOpeningOrClosingBracket('>')).equal(true)
assert(isOpeningOrClosingBracket('class="thing"')).equal(false)
})
<script src="https://codepen.io/synthet1c/pen/KyQQmL.js"></script>
<pre id="input">
<div class:'abc' id="cde"></div>
<div class:"abc" id="fed"></div>
<div class:abc id="ce"></div>
<div class:"abc"><p class="content" autocomplete> I am some text which might contain attribute:"invalid value" and I must not be removed</p></div>
</pre>

Use This (?<=(div ))[a-zA-Z]+:"?'?[a-zA-Z]+"?'?
Demo
In jsfiddle

Related

replaceAll() in JavaScript failed to find in HTML page

I am not familiar with JavaScript and html. But I tried to implement a function using JavaScript.
I want to replace all <em> and </em> in a html page. So I insert a piece of javascript code in the page:
function rep()
{
document.body.innerHTML
= document.body.innerHTML
.replaceAll("<em>", "_");
document.body.innerHTML
= document.body.innerHTML
.replaceAll("</em>", "_");
}
window.onload=rep()
<!DOCTYPE html>
<html lang="en">
<!-- ... -->
<article>
<div class="container">
<div class="row">
<div class="col-lg-8 col-lg-offset-2 col-md-10 col-md-offset-1 post-container">
<p>(Weierstrass) 设 $z_{0}$ 是 $f$ 的本性奇点,那么对任意 $A \in \mathbb{C}<em>{\infty}$, 必存在趋于 $z</em>{0}$ 的点列 $\left{z_{n}\right}$, 使得 $\lim <em>{n \rightarrow \infty} f\left(z</em>{n}\right)=A$.</p>
</div>
</div>
</div>
<!-- ... -->
</html>
It succeeded in replacing <em> with "_", but all </em> did not change. What's wrong with the code?
Thank you!

Let's see what happens when browsers see invalid html like:
test</em>
console.log(document.body.innerHTML)
test</em>
The above prints test (and the script)
That's because the browser strips invalid structures when parsing
When you do
document.body.innerHTML
= document.body.innerHTML
.replaceAll("<em>", "_");
You replace all <em> tags correctly, but the closing tags are removed
This will work on the other hand:
document.body.innerHTML = document.body.innerHTML
.replaceAll("<em>", "_")
.replaceAll("</em>", "_");
<em>test</em>

It maybe better to use the available DOM methods for this.
Pick up all the em elements with querySelectorAll.
For each element create a text node. Bookend the element's original text content with underscores, and add that to the text node. Use replaceWith to replace the em element with the text node.
const ems = document.querySelectorAll('em');
ems.forEach(em => {
const text = `_${em.textContent}_`;
const node = document.createTextNode(text);
em.replaceWith(node);
});
<p>(Weierstrass) 设 $z_{0}$ 是 $f$ 的本性奇点,那么对任意 $A \in \mathbb{C}<em>{\infty}$, 必存在趋于 $z</em>{0}$ 的点列 $\left{z_{n}\right}$, 使得 $\lim <em>{n \rightarrow \infty} f\left(z</em>{n}\right)=A$.</p>
<ul>
<li><em>This is some italised text</em></li>
<li>And this is not.</li>
<li><em>But this is</em>.</li>
</ul>
Additional documentation
querySelectorAll
replaceWith
forEach
Template/string literals

Processing html with regexes or string functions is a bad idea (html is not a string), but if you must, it should be done like this:
let html = document.body.innerHTML
html = html.replace(...)
html = html.replace(...) etc
document.body.innerHTML = html
In other words, do not use a partially processed string to set innerHTML.

Simpler but not efficient:
document.body.innerHTML.replace(/\<em\>|\<\/em\>/gm, '_');
Result:
//body before: <em>test</em>
//body after: _test_
The regex will pass over the entire body and will replace all <em> or </em> occurrences with _
The regex options g for global and m for multiline allow to cover the whole body and multiple occurrences.

regex to replace " " string with empty string- Javascript [duplicate]

This question already has answers here:
How to keep Quill from inserting blank paragraphs (`<p><br></p>`) before headings with a 10px top margin?
(3 answers)
Closed 1 year ago.
I have some HTML as a string
var str= "<p><br/></p>"
How do I strip the p tags from this string using JS.
here is what I have tried so far:
str.replace(/<p[^>]*>(?:\s| )*<\/p>/, "") // o/p: <p><br></p>'
str.replace("/<p[^>]*><\\/p[^>]*>/", "")// o/p: <p><br></p>'
str.replace(/<p><br><\/p>/g, "")// o/p: <p><br></p>'
all of them return me same str as above, expected o/p is:
str should be ""
what im doing wrong here?
Thanks

You probably should not be using RegExp to parse HTML - it's not particularly useful with (X)HTML-style markup as there are way too many edge cases.
Instead, parse the HTML as you would an element in the DOM, then compare the trim()med innerText value of each <p> with a blank string, and remove those that are equal:
var str = "<p><br/></p><p>This paragraph has text</p>"
var ele = document.createElement('body');
ele.innerHTML = str;
[...ele.querySelectorAll('p')].forEach(para => {
if (para.innerText.trim() === "") ele.removeChild(para);
});
console.log(ele.innerHTML);

You should be able to use the following expression: <p[^>]*>( |\s+|<br\s*\/?>)*<\/p>
The expression above looks at expressions enclosed in <p>...</p> and matches them against , whitespace (\s+) and <br> (and / variations).
I think you were mostly there with /<p[^>]*>(?:\s| )*<\/p>/, but you just needed to remove ?: (not sure what you were trying to do here), and adding an additional case for <br>.
const str = `
<p><br></p>
<p><br/></p>
<p><br /></p>
<p> <br/> </p>
<p> </p>
<p> </p>
<p><br/> </p>
<p>
<br>
</p><!-- multiline -->
<p><br/> don't replace me</p>
<p>don't replace me</p>
`;
const exp = /<p[^>]*>( |\s+|<br\s*\/?>)*<\/p>/g;
console.log(str.replace(exp, ''));

UBB Code [textarea] - do not replace \n by within tags [textarea][/textarea]

I currently load a value from my database straight into a hidden textarea.
<textarea name="text" id="text" style="visibility:hidden">
[textarea]Content showing raw [b]HTML[/b] or any other code
Including line breaks </a>[/textarea]
</textarea>
From there I pick up the textarea's content and run it trough several replace arguments with a simple Javascript, like
<script type="text/javascript">
document.addEventListener('DOMContentLoaded', function parser() {
post_text=post_text.replace(/\r?\n/g, "<br>");
post_text=post_text.replace(/\[size=1\]/g, "<span style=\"font-size:80%\">");
post_text=post_text.replace(/\[url=(.+?)\](.+?)\[\/url\]/g, "$2 <img src=\"images/link.gif\" style=\"border:0px\">");
post_text=post_text.replace(/\[url\](.+?)\[\/url\]/g, "$1 <img src=\"images/link.gif\" style=\"border:0px\">");
document.getElementById('vorschau').innerHTML = post_text;
}, false);
</script>
<div id="vorschau"></div>
to render it into HTML which is then parsed by the Browser, so I do all the formatting of the entries on the Frontend/client side.
However, the textarea may also contain such an UBB tag:
[textarea]Content showing raw [b]HTML[/b] or any other code
Including line breaks </a>[/textarea]
I currently just replace the textarea UBB elements like any other content
post_text=post_text.replace(/\[textarea\]/g, "<textarea id=\"codeblock\" style=\"width:100%;min-height:200px;\">");
post_text=post_text.replace(/\[\/textarea\]/g, "</textarea>");
The issue with this is that my other code
post_text=post_text.replace(/\r?\n/g, "<br>");
post_text=post_text.replace(/\</g, "<");
post_text=post_text.replace(/\>/g, ">");
Does not skip the content within the [textarea][/textarea] elements resulting in a textarea filled with this:
Content showing raw <b>HTML</b> or any other code<br>Including line breaks </a>
Above example
So how do I prevent to replace anything within [textarea][/textarea] (which can occur more than once in id="text")?

What you might do, is use a dynamic pattern that captures from [textarea] till [/textarea] in group 1, and use an alternation to match what you want to replace.
Then use a callback function for replace. Check if group 1 exists, and if it does return it unmodified. If it does not, we have a match outside of the text area.
An example of the pattern with the alternation and match for <
(\[textarea][^]*\[\/textarea])|<
(\[textarea][^]*\[\/textarea]) Capture group 1, match from [textarea] till [/textarea]
| Or
< Match literally
Regex demo
Note to double escape the backslash in the RegExp constructor.
(Assuming this is the right order of replacements:)
const replacer = (text, find, replace) => text.replace(
new RegExp(`(\\[textarea][^]*\\[\\/textarea])|${find}`, "g"),
(m, g1) => g1 ? g1 : replace
);
document.addEventListener('DOMContentLoaded', function parser() {
let post_text = document.getElementById('text').value;
post_text = post_text.replace(/\[size=1]/g, "<span style=\"font-size:80%\">");
post_text = post_text.replace(/\[url=(.+?)](.+?)\[\/url\]/g, "$2 <img src=\"images/link.gif\" style=\"border:0px\">");
post_text = post_text.replace(/\[url](.+?)\[\/url]/g, "$1 <img src=\"images/link.gif\" style=\"border:0px\">");
post_text = replacer(post_text, "\\r?\\n", "<br>");
post_text = replacer(post_text, "<", "<");
post_text = replacer(post_text, ">", ">");
post_text = post_text.replace(/\[textarea]/g, "<textarea id=\"codeblock\" style=\"width:100%;min-height:200px;\">");
post_text = post_text.replace(/\[\/textarea]/g, "</textarea>");
document.getElementById('vorschau').innerHTML = post_text;
}, false);
<textarea name="text" id="text" rows="10" cols="60">
[textarea]Content showing raw [b]HTML[/b] or any other code
Including line breaks </a>[/textarea]
< here and > here and
</textarea>
<div id="vorschau"></div>

Select All Tags with Data Name - Get Value - Set Class

I have some HTML where I've dynamically printed a bunch of elements, some containing a specific data attribute. Because my templating language can't efficiently make use of regular expressions, I need to use JavaSript (or JQuery) to select the data values, build a string, then add that string as a class to that original element.
Example of HTML:
<div class="item" data-ses-cat="This Cool Thing (Yes)"></div>
Example of Desired HTML after JavaScript:
<div class="item this-cool-thing-yes" data-ses-cat="This Cool Thing (Yes)"></div>
I just need to add a class to all tags that contain data-ses-cat then get the value for that data attribute, run regex, then add that new string as a class.
I feel like it should be fairly simple, but I haven't touched a lot of JQuery in a while.
Thanks for any help!

Remove every character that is not alphanumeric or a space, then lowercase it, then split on space, and join on dash.
$('.item[data-ses-cat]').each(function(){
var newClass = $(this).data('ses-cat')
.replace( /[^a-zA-Z0-9 ]/g, '' )
.toLowerCase()
.split( ' ' )
.join( '-' );
this.classList.add( newClass );
});
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>
<div class="item" data-ses-cat="This Cool Thing (Yes)">Test</div>
And from your comments, here is a version that uses arrow functions.
$('.item[data-ses-cat]').each((index, element)=>{
var newClass = $(element).data('ses-cat')
.replace( /[^a-zA-Z0-9 ]/g, '' )
.toLowerCase()
.split( ' ' )
.join( '-' );
element.classList.add( newClass );
});
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>
<div class="item" data-ses-cat="This Cool Thing (Yes)">Test</div>

A vanilla JS version of the code would look something like this:
function processElement(element) {
const clazz =
element.dataset.sesCat.toLowerCase()
.replace(/\(\)/g, '') // Remove brackets.
.replace(/ /g, '-'); // Replace spaces with dashes.
element.classList.add(clazz);
}
const sesCatElements = document.querySelectorAll('[data-ses-cat]');
sesCatElements.forEach(processElement);
Of course, you can tweak your RegExp exactly how you want it.
Here is some info on how Dataset API works.
And this, is how you work with CSS class names.

How to take the last character of string split using innerHTML

I'm trying to alert the last character of a string split, using innerHTML, but it's showing nothing in alert box.
this is my code
Html
<html>
<head>
<title>JavaScript basic animation</title>
<script type="text/javascript">
</script>
<script type="text/javascript" src="myfunction_2.js"></script>
</head> <body>
<div id="target">w3resource </div>
<button onclick="shubham()">click</button>
</body>
</html>
Function
function shubham()
{
var x=document.getElementById('target').innerHTML;
var y=x.split('');
var z=y[0];
var m=y[9];
var n=y[1]
var last=y[y.length-1]
alert(last);
}
it works properly if I take var x as
var x='w3resource';
but i need to take x value as
var x=document.getElementById('target').innerHTML;
so what should i do for this???

You need to use textContent instead of innerHTML. innerHTML gets you the actual HTML markup, including the tag angled brackets (<>), whereas textContent will give you just the text.
var x=document.getElementById('target').textContent.trim();

Your code code exactly what it should do - it alerts a last character of #target element (which is a whitespace in your case).
If you changed <div id="target">w3resource </div> to <div id="target">w3resource</div> (removed the space at the end) the result would be 'e'.
If you want to find the very last text character you would have to use:
function shubham() {
// Element reference
const element = document.getElementById('target');
// Text without spaces at the beggining and the end
const text = element.innerText.trim();
// Get the last character
const lastCharacter = text[text.length - 1];
// Alert the last character
alert(lastCharacter);
}
<div id="target">w3resource </div>
<button onclick="shubham()">click</button>

I see that you have a space in the target div:
<div id="target">w3resource </div>
Hence the last character is a blank space, remove all the blank space and it should work, use the function below :
function shubham3()
{
var x=document.getElementById('target').innerHTML.replace(/ /g,'');
var y=x.split('');
var z=y[0];
var m=y[9];
var n=y[1]
var last=y[y.length-1]
alert(last);
}

We Keep Coding

JavaScript is the programming language of the Web.

How to detect and remove invalid attributes using Javascript Regex? - javascript

Use This (?<=(div ))[a-zA-Z]+:"?'?[a-zA-Z]+"?'? Demo In jsfiddle

Related

replaceAll() in JavaScript failed to find </em> in HTML page

regex to replace "<p><br/></p>" string with empty string- Javascript [duplicate]

UBB Code [textarea] - do not replace \n by <br> within tags [textarea][/textarea]

Select All Tags with Data Name - Get Value - Set Class

How to take the last character of string split using innerHTML

Categories

Resources