JS Regexp: get the inline javascripts from html - javascript

I need to get all script tags from an html string, separated the inline scripts and the "linked" scripts. By inline scripts I mean script tags without the src attribute.
Here is how I get the "linked scripts":
<script(.)+src=(.)+(/>|</script>)
so, having <script followed by one or more any character, followed by src=, followed by /> or </script>.
This works as expected.
Now I want to get all the script tags without the src tag, having some javascript code between <script .....> and </script>, but I can't figure it out how to do that. I just started understanding regular expressions, so the help of a more experienced r.e. guru is needed :)
UPDATE
Ok, so dear downvoters. I have the html code for a whole html page in a variable. I want to extract script tags from it. How to do it, using jquery for example?
var dom = $(html);
console.log(html.find('script');
will not work. So, what is the way to accomplish that?
UPDATE 2
I don't need to solve this problem with regex, but because now I am learning about them, I thought I will try it. I am opened for any other solution.

Create a DOM element using document.createElement, then set its innerHTML to the contents of your HTML string. This will automatically parse your HTML using the browser's built-in parser and fill your newly-created element with children.
dummyDoc = document.createElement("html");
dummyDoc.innerHTML = "<body><script>alert('foo');</script></body>"; // or myInput.value
var dom = $(dummyDoc);
var scripts = dom.find('script');
(I only use jQuery because you do so in your question. This is certainly also possible without jQuery.)

If you are in the position where no dom access is available (nodejs?), you'd be forced to use regex. Here is a solution that worked for me in the similar circumstances:
function scrapeInlineScripts(sHtml) {
var a = sHtml.split(/<script[^>]*>/).join('</script>').split('</script>'),
s = '';
for (var n=1; n<a.length; n+=2) {
s += a[n];
}
return s;
}

Related

How to convert html correctly in javascript?

I need to convert snippets of text that contain html tags into plain text using Javascript / Node.Js.
I currently use String.Js library for that, but the problem is that when it removes the tags (using strip_tags() functions), it also removes the new line.
E.g.
<div>Some text</div><div>another text</div>
becomes
Some textanother text
Do you know how I could get rid of this problem? Maybe another library?
Thanks!
Try using Cheerio. It will expose a jQuery like interface for you on the server side. Then it's just:
var html = $(htmlstring).html();
Then just traverse the DOM for whatever elements you want and call $(element).text();
Hi this is very simple solution of your problem because I'm using reg exp and you can do what you want.
In this case we remove all tags except br tags.If you want you can remove br tag and add another tag maybe \n \t or what you want.
I hope this can help you.
Chears!!!
var html = "<div>Some text</div><div>another text</div><br />test<div>10</div>";
var removeHtmlTags = html.replace(/(<([^>!br]+)>)/ig,"");
console.log(removeHtmlTags);

Can't make Bootstrap tooltip work when creating elements with javascript

I'm pretty new to jquery in particular and js in general, so I hope I didn't make a silly mistake.
I have the following js code:
var speechText = "Purpose of use:<br/>";
speechText += "<script>$(function(){$(\".simple\").tooltip();});</script>";
speechText += "simple use";
speechElement.innerHTML = speechText;
This function changes the content of an element on the html page.
When calling this function everything works, including displaying the link "simple use", but the tooltip doesn't appear.
I tried writing the exact thing on the html document itself, and it worked.
What am I missing?
First, Script tags inserted into the DOM using innerHTML as text, will not execute. See Can Scripts be inserted with innerHTML. Just some background on script tags, they're parsed on pageload by default in a synchronous manner meaning beginning with earlier script tags and descending down the DOM tree. However this behaviour can be altered via defer and async attributes, best explained on David Walsh's post.
You can however create a script node, assign it attribute nodes and content and append this node to an node in the DOM (or another node that is inserted into the DOM) as suggested by an answer in the aforementioned SO link (here: SO answer).
Secondly, You don't need to inject that piece of JavaScript into the DOM, you can just use that plugin assignment in the context of the string concatenation. So as an example you might refactor your code like this:
HTML
var speechText = "Purpose of use:<br/>";
speechText += "simple use";
speechElement.innerHTML = speechText;
$(function(){ $(".simple").tooltip(); });

jQuery parse HTML without loading images

I load HTML from other pages to extract and display data from that page:
$.get('http://example.org/205.html', function (html) {
console.log( $(html).find('#c1034') );
});
That does work but because of the $(html) my browser tries to load images that are linked in 205.html. Those images do not exist on my domain so I get a lot of 404 errors.
Is there a way to parse the page like $(html) but without loading the whole page into my browser?
Actually if you look in the jQuery documentation it says that you can pass the "owner document" as the second argument to $.
So what we can then do is create a virtual document so that the browser does not automatically load the images present in the supplied HTML:
var ownerDocument = document.implementation.createHTMLDocument('virtual');
$(html, ownerDocument).find('.some-selector');
Use regex and remove all <img> tags
html = html.replace(/<img[^>]*>/g,"");
Sorry for resuscitating an old question, but this is the first result when searching for how to try to stop parsed html from loading external assets.
I took Nik Ahmad Zainalddin's answer, however there is a weakness in it in that any elements in between <script> tags get wiped out.
<script>
</script>
Inert text
<script>
</script>
In the above example Inert text would be removed along with the script tags. I ended up doing the following instead:
html = html.replace(/<\s*(script|iframe)[^>]*>(?:[^<]*<)*?\/\1>/g, "").replace(/(<(\b(img|style|head|link)\b)(([^>]*\/>)|([^\7]*(<\/\2[^>]*>)))|(<\bimg\b)[^>]*>|(\b(background|style)\b=\s*"[^"]*"))/g, "");
Additionally I added the capability to remove iframes.
Hope this helps someone.
Using the following way to parse html will load images automatically.
var wrapper = document.createElement('div'),
html = '.....';
wrapper.innerHTML = html;
If use DomParser to parse html, the images will not be loaded automatically. See https://github.com/panzi/jQuery-Parse-HTML/blob/master/jquery.parsehtml.js for details.
You could either use jQuerys remove() method to select the image elements
console.log( $(html).find('img').remove().end().find('#c1034') );
or remove then from the HTML string. Something like
console.log( $(html.replace(/<img[^>]*>/g,"")) );
Regarding background images, you could do something like this:
$(html).filter(function() {
return $(this).css('background-image') !== '';
}).remove();
The following regex replace all occurance of <head>, <link>, <script>, <style>, including background and style attribute from data string returned by ajax load.
html = html.replace(/(<(\b(img|style|script|head|link)\b)(([^>]*\/>)|([^\7]*(<\/\2[^>]*>)))|(<\bimg\b)[^>]*>|(\b(background|style)\b=\s*"[^"]*"))/g,"");
Test regex: https://regex101.com/r/nB1oP5/1
I wish there is a a better way to work around (other than using regex replace).
Instead of removing all img elements altogether, you can use the following regex to delete all src attributes instead:
html = html.replace(/src="[^"]*"/ig, "");

Javascript beginner: how to replace a href text if it matches a specified string?

When someone posts a link to another page on my website, I'd like to shorten the a href text from something like: http://mywebsite.com/posts/8 to /posts/8 or http://mywebsite.com/tags/8 to /tags/8. Since I'm learning javascript I don't want to depend on a library like prototype or jquery. Is it recommended to use javascript's replace method?
I found w3schools' page here but my code was replacing all instances of the string, not just the href text.
Here's what I have so far:
<script type="text/javascript" charset="utf-8">
var str="http://www.mywebsite.com";
document.write(str.replace("http://www.", ""));
</script>
str = str.replace(/^http:\/\/www.mywebsite.com/, "");
someElement.appendChild(document.createTextNode(str));
Note that you're introducing a Cross-Site Scripting vulnerability by directly calling document.write with user input (you could also say you're not treating the URL http://<script>alert('XSS');</script> correctly).
Instead of using document.write, replace someElement in the above code with an element in your code that should contain the user content. Notice that this code can not be at the JavaScript top level, but should instead called when the load event fires.

How to embed HTML via JS embed code

I need to give the user a snippet of js code that will insert some HTML code into the page.
I'm wondering what the best method to do so is. Should I use document.write, should I just create all the HTML elements via DOM programmatically?
Is it possible to use a js library? I can see conflicts occurring if the webpage the code is embedded in already contains the library.
Using a library is probably too heavyweight, inserting DOM elements is very verbose and document.write may not work if the target site uses the application/xhtml+xml content type. I think your best bet is to construct one element using document.createElement and then setting innerHTML on that.
A suggestion:
Insert this DIV wherever you want the output to appear:
<div id="uniqueTargetID" style="display: none;"></div>
Then at bottom of page have this:
<script src="snippet.js"></script>
This file (remotely hosted or otherwise) contains could output simple text this way:
var html = [];
html.push('<h1>This is a title</h1>');
html.push('<p>So then she said, thats not a monkey, its a truck!</p>');
html.push('<p>You shoulda seen his face...</p>');
var target = document.getElementById('uniqueTargetID');
target.innerHTML = html.join('');
target.style.display = 'block';
I would avoid using document.write() if you can help it.
Javascript::
//to avoid global bashing
(function(){
var target = document.getElementById('ScriptName'),
parent = target.parentElement,
oput = document.createElement('div');
oput.innerHTML = "<p>Some Content</p>";
parent.insertBefore(oput, target);
}());
HTML to give to client/people::
<script type="text/javascript" id="ScriptName" src="/path/to/ScriptName.js"><script>
ScriptName should be something unique to your script.
If its simple insertion you can use pure js, otherwise if you want to provide some complex functionality you can use library. The best choice in this case will be the lib that does not extend root objects (Array, Function, String) to prevent conflicts (jQuery in noConflict mode, YUI, etc.).
Anyway it will be better to avoid using document.write u'd better use setting of innerHTML of existing element or create new one.

Categories