Regex using js to strip js from html

Regex using js to strip js from html - javascript

I'm using jQuery to sort a column of emails, though they are base64 encoded in js... so I need a regex command to ignore the <script>.*?<script> tags and only sort what is after them (within the <noscript> tags).
Column HTML
<td>
<script type="text/javascript">
document.write(Base64.decode('PG5vYnI+PGEgaHJlZj0ibWFpbHRvOmJpY2VAdWNzYy5lZHUiIHRpdGxlPSJiaWNlQHVjc2MuZWR1Ij5iaWNlPC9hPjwvbm9icj48YnIgLz4K'));
</script>
<noscript>username</noscript>
</td>
Regex that needs some love
a.replace(/<script.*?<\/script>(.*?)/i,"$1");

Assuming that the structure of the html doesn't change, you can use this:
$(a).contents().filter(function(){
return this.nodeType === 3
}).eq(1).text();
It gets all text nodes and then filters to the one at index 1 and get's it's text value.
And if you want to stick with regexp, here's one:
a.replace(/(<script type="text\/javascript">[^>]+>|<noscript>.*<\/noscript>)/ig,"");

I know this isn't exactly what you're asking for (though I'm a little confused what you're asking for, to be honest...), but have you looked at using document.getElementsByTagName('noscript')? This function should return an array, the first element of which will be your noscript element.
Also, I'm not really clear on your overall approach to this problem, but it seems like you're misunderstanding the purpose of a noscript element. noscript elements only execute when the browser does not support Javascript, which means the only time noscript content would be displayed to the user is when the Javascript that you're using to modify the noscript content wouldn't run.
Perhaps you could clarify what exactly you're trying to do?

Related

How to replace html tags with our own tags

In my MVC web application, I have a text area inside View, in that user will put HTML text so in that text, I want to replace html tags with my own custom tags.
For Example:
HTML tag:
<input type='text' name='MyList.First_Name' data-val='true' data-val-required='Please enter first name' />
Replace with:
[~TextFieldTag|MyList.First_Name|||0|data-val=>true|data-val-required=>Please enter first name|~]
Can anyone suggest what is the best approach to do this?

I was going to recommend a simple string replacement at first, but given the seemingly complicated nature of your replacements, that might not be the best approach.
Probably the best approach would be to take the HTML, convert it to DOM elements, which can be done simply by throwing it into an elements innerHTML:
document.querySelector('button').addEventListener('click', () => {
document.querySelector('#renderSpace').innerHTML = document.querySelector('textarea').value;
});
Add some HTML:
<textarea></textarea>
<button>Press Me</button>
<div id="renderSpace"></div>
You can position the area you render it inside of off-screen so users don't actually see it.
From there, I would walk the DOM tree (basically, start at the root, then look at all of its children, then their children, etc., recursively), reading off any properties that you deem appropriate and then writing your string replacement as you go along.
That does require that they have entered valid HTML (which is generally a requirement, but can be difficult to rely on users to enter), so you'll want to have some good, user-friendly, error handling in there.

How do I allow <img> and <a> tags for innerHTML, but no others? (Making a forum)

I am currently programming a forum using only javascript (No JQuery please). I am doing very well, however, there is one issue I would love help with.
Currently I am getting the post from a database, assigning it to variable MainPost, and then attaching it to a div via a text node:
var theDiv = document.getElementById("MainBody");
var content = document.createTextNode(MainPost);
theDiv.appendChild(content);
This is working quite well, however, I would LOVE to be able to do this:
document.getElementById("MainBody").innerHTML += MainPost;
But I know this would allow people to use ANY html tag they want, even something like "script" followed by javascript code. This would be bad for business, obviously, but I do like the idea of allowing posters to use the "img" tag as well as the "a href" tags. Is there a way to somehow disable all tags except these two for the innerHTML?
Thank you all so much for any help you can offer.

Ok, the first thought that came to my mind when I read this question was to find a regular expression to exclude a specific string in a word. Simple search gave a lot of results from SO.
Starting point - To remove all the HTML tags from a string (from this answer):
var regex = /(<([^>]+)>)/ig
, body = "<p>test</p>"
, result = body.replace(regex, "");
console.log(result);
To exclude a string you would do something like this (again from all the source mentioned above):
(?!StringToBeExcluded)
Since you want to exlcude the <a href and <img tags. The suitable regex in your case could be:
(<(?![\/]?a)(?![\/]?img)([^>]+)>)
Explanation :
Think of it as three capturing groups in succession:
(?![\/]?a) : Negative Lookahead to assert that it is impossible to match the regex containing the string "a" prefixed by zero or one backslashes (Should take care of the a href tags)
(?![\/]?img) : Same as 1, just here it looks for the string "img". I don't know why I allowed the </img> tag. Yes, <img> doesn't have a closing tag. You could remove the [\/]? bit from it to fix this.
([^>]+) : Makes sure to not match > zero or one times to take care of tags that have opening and closing tags.
Now all these capture groups lie between < and >. You might want to try a regex demo that I've created incorporating these three capture groups to take care of ignoring all HTML elements except the image and link tags.
Sidenote - I haven't thoroughly given this regex a try. Feel free to play around with it and tweak it according to your needs. In any case, I hope this gets you started in the right direction.

Use jquery to only insert a start tag or end tag

I'm learning that using
replaceWith('<section>')
or
after('<section>')
Will actually insert the full element in each case:
<section></section>
And that when using end tags
replaceWith('</section>')
such calls seem to be ignored.
Is there someway to disable this behavior? I need to at one point in the DOM insert a start tag, and at another point insert an end tag.
wrapAll()
I can't get to work either. I think probably something to do with what is being wrapped aren't all siblings.....

i don't beleive jquery is going to allow this kind of behavior, because this will actually be allowing you to invalidate your markup.
dealing with the dom - it treats those groups of tags as "nodes". they are group up as objects with attributes and values and many other objects reliant upon them. so simply "moving the text" of the closing tag isn't a desired effect...
why not grab all of the stuff you want in "the middle" of your half tags... create a new element and then place your "filling" into it with append? something like this:
var theFilling = $('ul#theFilling'),
theCookieCrust = document.createElement('section');
$(theFilling).appendTo($(theCookieCrust));
$('ul#theFilling').remove();
$(theCookieCrust).appendTo('body');

JavaScript Library/Function to find Unclosed HTML Tags

I am currently looking for a solution to find and list out any unclosed HTML tags from an arbitrary slice of raw HTML. I don't feel like this should be an awful problem, but I cannot seem to find something that does it in JS. Unfortunately, this needs to be client-side since it is being used for rendering annotations to HTML pages. Obviously, annotations are somewhat nasty business, since they select or apply formatting that may apply to only part of an HTML element (i.e., a markup overlaid onto an existing HTML markup).
One simple use-case is where you might want to only render part of an HTML page, but then inject the rest later. For example, imagine a hypothetical segment:
<p>This is my text <StartDelayedInject/> with a comment I added. </p>
<p> But it doesn't exist until now. </p> <StopDelayedInject/>
I'll be doing some pre-processing to rebuild the HTML so that I wrap partial elements into span-type elements that apply the appropriate formatting. Initially this would be parsed in the form:
<p><span>This is my text</span></p>
After some user action, it would then be modified to a form such as:
<p><span>This is my text</span><span>with a comment I added.</span></p>
<p>But it doesn't exist until now.</p>
This is a very simplified example case (obviously things like ul elements and tables get hairier), but gives the general principle. However, to do this effectively, I need to be able to check a segment of HTML and figure out there are tags that have opened (but not closed). If I know that information, I can wrap the last unterminated text data into a span, close the unclosed tag, and know to return to that point to inject the remainder of the content when needed. However, I need to know the tags that were still open, so that when I inject or modify another segment of content, I can make sure to put it in the right place (e.g., get "with a comment I added." in the first paragraph).
From my understanding of context-free grammars, this should be a relatively trivial task. Each time you open/enter or close/exit a tag, you could just keep a stack of the tags opened but not yet closed. With that said, I'd much rather use a library that's a bit more of a mature solution than make naive parser for that purpose. I'd assume there's some JS HTML parser around that would do this, right? Plenty of them know how to close tags, so so clearly at some point they calculated this.

The problem is that JavaScript only has access to the html in two ways:
In a sense that each element is an object with properties and methods created by the browser on page load.
In a sense that it is a string of text.
Using the first method of interfacing with html, there is no way to detect unclosed tags as you only have access to the objects that the browser creates for you after it parses the html.
Using the second method, you would have to run the entire string of html through an html parser. Some people might assume you could do it simply with regexp, however, this is not feasible. I refer you to this fantastic stackoverflow question.
Even if you found a really robust html parser to use, you would still run into the problem created by the fact that, before your JavaScript even touches it, the browser will have attempted to parse the potentially broken html and there could be errors everywhere.
Edit:
If you like the parser idea, John Resig created this example one you might want to reference.

Not perfect but here's my quick method for checking for mismatch between open/close tags:
function find_unclosed_tags(str) {
str = str.toLowerCase();
var tags = ["a", "span", "div", "ul", "li", "h1", "h2", "h3", "h4", "h5", "h6", "p", "table", "tr", "td", "b", "i", "u"];
var mismatches = [];
tags.forEach(function(tag) {
var pattern_open = '<'+tag+'( |>)';
var pattern_close = '</'+tag+'>';
var diff_count = (str.match(new RegExp(pattern_open,'g')) || []).length - (str.match(new RegExp(pattern_close,'g')) || []).length;
if(diff_count != 0) {
mismatches.push("Open/close mismatch for tag " + tag + ".");
}
});
return mismatches;
}

Strip <script> tags from innerHTML using Prototype

Using Prototype, I'm trying to extract a piece of text from the DOM - this would normal be a simple $().innerHTML job, but the HTML is nested slightly.
<td class="time-record">
<script type="text/javascript">
//<![CDATA[
document.write('XXX ago'.gsub('XXX', i18n_time_ago_in_words(1229311439000)));
//]]>
</script>
about 11 months ago by <span class="author"><strong>Justin</strong></span>
</td>
In this case, innerHTML is going to pick up the JavaScript, which will cause all sort of problems.
What's the best/efficient/fastest way to extract about 11 months ago by <span class="author"><strong>Justin</strong></span> without the JavaScript?

Use innerHTML, and run it through stripScripts:
var html = $$('td.time-record')[0].innerHTML.stripScripts()
That would be useful for grabbing the html of the single cell. A more general solution that does the same but for all td.time-record elements would be:
$$('td.time-record').pluck('innerHTML').invoke('stripScripts');
which would return to you an array of each cell's html (with <script> elements removed) that you could then .join('') or iterate over.

I don't use Prototype's stripScripts or stripTags, as they're trivial, naïve regex hacks that don't get anywhere near handling all possible markup constructs correctly. For a simple case like this you can probably get away with stripScripts, but using these functions for anything security-sensitive is a mistake.
Personally I'd simply remove the script element from the DOM before taking the innerHTML. Once an inline script has been executed there's no reason you need to keep the HTMLScriptElement in the document.
$$('.time-record script').invoke('remove');

We Keep Coding

JavaScript is the programming language of the Web.

Regex using js to strip js from html - javascript

Related

How to replace html tags with our own tags

How do I allow <img> and <a> tags for innerHTML, but no others? (Making a forum)

Use jquery to only insert a start tag or end tag

JavaScript Library/Function to find Unclosed HTML Tags

Strip <script> tags from innerHTML using Prototype

Categories

Resources