Related
Is there an easy way to take a string of html in JavaScript and strip out the html?
If you're running in a browser, then the easiest way is just to let the browser do it for you...
function stripHtml(html)
{
let tmp = document.createElement("DIV");
tmp.innerHTML = html;
return tmp.textContent || tmp.innerText || "";
}
Note: as folks have noted in the comments, this is best avoided if you don't control the source of the HTML (for example, don't run this on anything that could've come from user input). For those scenarios, you can still let the browser do the work for you - see Saba's answer on using the now widely-available DOMParser.
myString.replace(/<[^>]*>?/gm, '');
Simplest way:
jQuery(html).text();
That retrieves all the text from a string of html.
I would like to share an edited version of the Shog9's approved answer.
As Mike Samuel pointed with a comment, that function can execute inline javascript code.
But Shog9 is right when saying "let the browser do it for you..."
so.. here my edited version, using DOMParser:
function strip(html){
let doc = new DOMParser().parseFromString(html, 'text/html');
return doc.body.textContent || "";
}
here the code to test the inline javascript:
strip("<img onerror='alert(\"could run arbitrary JS here\")' src=bogus>")
Also, it does not request resources on parse (like images)
strip("Just text <img src='https://assets.rbl.ms/4155638/980x.jpg'>")
As an extension to the jQuery method, if your string might not contain HTML (eg if you are trying to remove HTML from a form field)
jQuery(html).text();
will return an empty string if there is no HTML
Use:
jQuery('<p>' + html + '</p>').text();
instead.
Update:
As has been pointed out in the comments, in some circumstances this solution will execute javascript contained within html if the value of html could be influenced by an attacker, use a different solution.
Converting HTML for Plain Text emailing keeping hyperlinks (a href) intact
The above function posted by hypoxide works fine, but I was after something that would basically convert HTML created in a Web RichText editor (for example FCKEditor) and clear out all HTML but leave all the Links due the fact that I wanted both the HTML and the plain text version to aid creating the correct parts to an STMP email (both HTML and plain text).
After a long time of searching Google myself and my collegues came up with this using the regex engine in Javascript:
str='this string has <i>html</i> code i want to <b>remove</b><br>Link Number 1 ->BBC Link Number 1<br><p>Now back to normal text and stuff</p>
';
str=str.replace(/<br>/gi, "\n");
str=str.replace(/<p.*>/gi, "\n");
str=str.replace(/<a.*href="(.*?)".*>(.*?)<\/a>/gi, " $2 (Link->$1) ");
str=str.replace(/<(?:.|\s)*?>/g, "");
the str variable starts out like this:
this string has <i>html</i> code i want to <b>remove</b><br>Link Number 1 ->BBC Link Number 1<br><p>Now back to normal text and stuff</p>
and then after the code has run it looks like this:-
this string has html code i want to remove
Link Number 1 -> BBC (Link->http://www.bbc.co.uk) Link Number 1
Now back to normal text and stuff
As you can see the all the HTML has been removed and the Link have been persevered with the hyperlinked text is still intact. Also I have replaced the <p> and <br> tags with \n (newline char) so that some sort of visual formatting has been retained.
To change the link format (eg. BBC (Link->http://www.bbc.co.uk) ) just edit the $2 (Link->$1), where $1 is the href URL/URI and the $2 is the hyperlinked text. With the links directly in body of the plain text most SMTP Mail Clients convert these so the user has the ability to click on them.
Hope you find this useful.
An improvement to the accepted answer.
function strip(html)
{
var tmp = document.implementation.createHTMLDocument("New").body;
tmp.innerHTML = html;
return tmp.textContent || tmp.innerText || "";
}
This way something running like this will do no harm:
strip("<img onerror='alert(\"could run arbitrary JS here\")' src=bogus>")
Firefox, Chromium and Explorer 9+ are safe.
Opera Presto is still vulnerable.
Also images mentioned in the strings are not downloaded in Chromium and Firefox saving http requests.
This should do the work on any Javascript environment (NodeJS included).
const text = `
<html lang="en">
<head>
<style type="text/css">*{color:red}</style>
<script>alert('hello')</script>
</head>
<body><b>This is some text</b><br/><body>
</html>`;
// Remove style tags and content
text.replace(/<style[^>]*>.*<\/style>/gm, '')
// Remove script tags and content
.replace(/<script[^>]*>.*<\/script>/gm, '')
// Remove all opening, closing and orphan HTML tags
.replace(/<[^>]+>/gm, '')
// Remove leading spaces and repeated CR/LF
.replace(/([\r\n]+ +)+/gm, '');
I altered Jibberboy2000's answer to include several <BR /> tag formats, remove everything inside <SCRIPT> and <STYLE> tags, format the resulting HTML by removing multiple line breaks and spaces and convert some HTML-encoded code into normal. After some testing it appears that you can convert most of full web pages into simple text where page title and content are retained.
In the simple example,
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<!--comment-->
<head>
<title>This is my title</title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<style>
body {margin-top: 15px;}
a { color: #D80C1F; font-weight:bold; text-decoration:none; }
</style>
</head>
<body>
<center>
This string has <i>html</i> code i want to <b>remove</b><br>
In this line BBC with link is mentioned.<br/>Now back to "normal text" and stuff using <html encoding>
</center>
</body>
</html>
becomes
This is my title
This string has html code i want to remove
In this line BBC (http://www.bbc.co.uk) with link is mentioned.
Now back to "normal text" and stuff using
The JavaScript function and test page look this:
function convertHtmlToText() {
var inputText = document.getElementById("input").value;
var returnText = "" + inputText;
//-- remove BR tags and replace them with line break
returnText=returnText.replace(/<br>/gi, "\n");
returnText=returnText.replace(/<br\s\/>/gi, "\n");
returnText=returnText.replace(/<br\/>/gi, "\n");
//-- remove P and A tags but preserve what's inside of them
returnText=returnText.replace(/<p.*>/gi, "\n");
returnText=returnText.replace(/<a.*href="(.*?)".*>(.*?)<\/a>/gi, " $2 ($1)");
//-- remove all inside SCRIPT and STYLE tags
returnText=returnText.replace(/<script.*>[\w\W]{1,}(.*?)[\w\W]{1,}<\/script>/gi, "");
returnText=returnText.replace(/<style.*>[\w\W]{1,}(.*?)[\w\W]{1,}<\/style>/gi, "");
//-- remove all else
returnText=returnText.replace(/<(?:.|\s)*?>/g, "");
//-- get rid of more than 2 multiple line breaks:
returnText=returnText.replace(/(?:(?:\r\n|\r|\n)\s*){2,}/gim, "\n\n");
//-- get rid of more than 2 spaces:
returnText = returnText.replace(/ +(?= )/g,'');
//-- get rid of html-encoded characters:
returnText=returnText.replace(/ /gi," ");
returnText=returnText.replace(/&/gi,"&");
returnText=returnText.replace(/"/gi,'"');
returnText=returnText.replace(/</gi,'<');
returnText=returnText.replace(/>/gi,'>');
//-- return
document.getElementById("output").value = returnText;
}
It was used with this HTML:
<textarea id="input" style="width: 400px; height: 300px;"></textarea><br />
<button onclick="convertHtmlToText()">CONVERT</button><br />
<textarea id="output" style="width: 400px; height: 300px;"></textarea><br />
var text = html.replace(/<\/?("[^"]*"|'[^']*'|[^>])*(>|$)/g, "");
This is a regex version, which is more resilient to malformed HTML, like:
Unclosed tags
Some text <img
"<", ">" inside tag attributes
Some text <img alt="x > y">
Newlines
Some <a
href="http://google.com">
The code
var html = '<br>This <img alt="a>b" \r\n src="a_b.gif" />is > \nmy<>< > <a>"text"</a'
var text = html.replace(/<\/?("[^"]*"|'[^']*'|[^>])*(>|$)/g, "");
from CSS tricks:
https://css-tricks.com/snippets/javascript/strip-html-tags-in-javascript/
const originalString = `
<div>
<p>Hey that's <span>somthing</span></p>
</div>
`;
const strippedString = originalString.replace(/(<([^>]+)>)/gi, "");
console.log(strippedString);
Another, admittedly less elegant solution than nickf's or Shog9's, would be to recursively walk the DOM starting at the <body> tag and append each text node.
var bodyContent = document.getElementsByTagName('body')[0];
var result = appendTextNodes(bodyContent);
function appendTextNodes(element) {
var text = '';
// Loop through the childNodes of the passed in element
for (var i = 0, len = element.childNodes.length; i < len; i++) {
// Get a reference to the current child
var node = element.childNodes[i];
// Append the node's value if it's a text node
if (node.nodeType == 3) {
text += node.nodeValue;
}
// Recurse through the node's children, if there are any
if (node.childNodes.length > 0) {
appendTextNodes(node);
}
}
// Return the final result
return text;
}
If you want to keep the links and the structure of the content (h1, h2, etc) then you should check out TextVersionJS You can use it with any HTML, although it was created to convert an HTML email to plain text.
The usage is very simple. For example in node.js:
var createTextVersion = require("textversionjs");
var yourHtml = "<h1>Your HTML</h1><ul><li>goes</li><li>here.</li></ul>";
var textVersion = createTextVersion(yourHtml);
Or in the browser with pure js:
<script src="textversion.js"></script>
<script>
var yourHtml = "<h1>Your HTML</h1><ul><li>goes</li><li>here.</li></ul>";
var textVersion = createTextVersion(yourHtml);
</script>
It also works with require.js:
define(["textversionjs"], function(createTextVersion) {
var yourHtml = "<h1>Your HTML</h1><ul><li>goes</li><li>here.</li></ul>";
var textVersion = createTextVersion(yourHtml);
});
const htmlParser= new DOMParser().parseFromString("<h6>User<p>name</p></h6>" , 'text/html');
const textString= htmlParser.body.textContent;
console.log(textString)
A lot of people have answered this already, but I thought it might be useful to share the function I wrote that strips HTML tags from a string but allows you to include an array of tags that you do not want stripped. It's pretty short and has been working nicely for me.
function removeTags(string, array){
return array ? string.split("<").filter(function(val){ return f(array, val); }).map(function(val){ return f(array, val); }).join("") : string.split("<").map(function(d){ return d.split(">").pop(); }).join("");
function f(array, value){
return array.map(function(d){ return value.includes(d + ">"); }).indexOf(true) != -1 ? "<" + value : value.split(">")[1];
}
}
var x = "<span><i>Hello</i> <b>world</b>!</span>";
console.log(removeTags(x)); // Hello world!
console.log(removeTags(x, ["span", "i"])); // <span><i>Hello</i> world!</span>
For easier solution, try this => https://css-tricks.com/snippets/javascript/strip-html-tags-in-javascript/
var StrippedString = OriginalString.replace(/(<([^>]+)>)/ig,"");
It is also possible to use the fantastic htmlparser2 pure JS HTML parser. Here is a working demo:
var htmlparser = require('htmlparser2');
var body = '<p><div>This is </div>a <span>simple </span> <img src="test"></img>example.</p>';
var result = [];
var parser = new htmlparser.Parser({
ontext: function(text){
result.push(text);
}
}, {decodeEntities: true});
parser.write(body);
parser.end();
result.join('');
The output will be This is a simple example.
See it in action here: https://tonicdev.com/jfahrenkrug/extract-text-from-html
This works in both node and the browser if you pack your web application using a tool like webpack.
I made some modifications to original Jibberboy2000 script
Hope it'll be usefull for someone
str = '**ANY HTML CONTENT HERE**';
str=str.replace(/<\s*br\/*>/gi, "\n");
str=str.replace(/<\s*a.*href="(.*?)".*>(.*?)<\/a>/gi, " $2 (Link->$1) ");
str=str.replace(/<\s*\/*.+?>/ig, "\n");
str=str.replace(/ {2,}/gi, " ");
str=str.replace(/\n+\s*/gi, "\n\n");
After trying all of the answers mentioned most if not all of them had edge cases and couldn't completely support my needs.
I started exploring how php does it and came across the php.js lib which replicates the strip_tags method here: http://phpjs.org/functions/strip_tags/
function stripHTML(my_string){
var charArr = my_string.split(''),
resultArr = [],
htmlZone = 0,
quoteZone = 0;
for( x=0; x < charArr.length; x++ ){
switch( charArr[x] + htmlZone + quoteZone ){
case "<00" : htmlZone = 1;break;
case ">10" : htmlZone = 0;resultArr.push(' ');break;
case '"10' : quoteZone = 1;break;
case "'10" : quoteZone = 2;break;
case '"11' :
case "'12" : quoteZone = 0;break;
default : if(!htmlZone){ resultArr.push(charArr[x]); }
}
}
return resultArr.join('');
}
Accounts for > inside attributes and <img onerror="javascript"> in newly created dom elements.
usage:
clean_string = stripHTML("string with <html> in it")
demo:
https://jsfiddle.net/gaby_de_wilde/pqayphzd/
demo of top answer doing the terrible things:
https://jsfiddle.net/gaby_de_wilde/6f0jymL6/1/
Here's a version which sorta addresses #MikeSamuel's security concern:
function strip(html)
{
try {
var doc = document.implementation.createDocument('http://www.w3.org/1999/xhtml', 'html', null);
doc.documentElement.innerHTML = html;
return doc.documentElement.textContent||doc.documentElement.innerText;
} catch(e) {
return "";
}
}
Note, it will return an empty string if the HTML markup isn't valid XML (aka, tags must be closed and attributes must be quoted). This isn't ideal, but does avoid the issue of having the security exploit potential.
If not having valid XML markup is a requirement for you, you could try using:
var doc = document.implementation.createHTMLDocument("");
but that isn't a perfect solution either for other reasons.
I think the easiest way is to just use Regular Expressions as someone mentioned above. Although there's no reason to use a bunch of them. Try:
stringWithHTML = stringWithHTML.replace(/<\/?[a-z][a-z0-9]*[^<>]*>/ig, "");
Below code allows you to retain some html tags while stripping all others
function strip_tags(input, allowed) {
allowed = (((allowed || '') + '')
.toLowerCase()
.match(/<[a-z][a-z0-9]*>/g) || [])
.join(''); // making sure the allowed arg is a string containing only tags in lowercase (<a><b><c>)
var tags = /<\/?([a-z][a-z0-9]*)\b[^>]*>/gi,
commentsAndPhpTags = /<!--[\s\S]*?-->|<\?(?:php)?[\s\S]*?\?>/gi;
return input.replace(commentsAndPhpTags, '')
.replace(tags, function($0, $1) {
return allowed.indexOf('<' + $1.toLowerCase() + '>') > -1 ? $0 : '';
});
}
I just needed to strip out the <a> tags and replace them with the text of the link.
This seems to work great.
htmlContent= htmlContent.replace(/<a.*href="(.*?)">/g, '');
htmlContent= htmlContent.replace(/<\/a>/g, '');
The accepted answer works fine mostly, however in IE if the html string is null you get the "null" (instead of ''). Fixed:
function strip(html)
{
if (html == null) return "";
var tmp = document.createElement("DIV");
tmp.innerHTML = html;
return tmp.textContent || tmp.innerText || "";
}
A safer way to strip the html with jQuery is to first use jQuery.parseHTML to create a DOM, ignoring any scripts, before letting jQuery build an element and then retrieving only the text.
function stripHtml(unsafe) {
return $($.parseHTML(unsafe)).text();
}
Can safely strip html from:
<img src="unknown.gif" onerror="console.log('running injections');">
And other exploits.
nJoy!
const strip=(text) =>{
return (new DOMParser()?.parseFromString(text,"text/html"))
?.body?.textContent
}
const value=document.getElementById("idOfEl").value
const cleanText=strip(value)
With jQuery you can simply retrieving it by using
$('#elementID').text()
I have created a working regular expression myself:
str=str.replace(/(<\?[a-z]*(\s[^>]*)?\?(>|$)|<!\[[a-z]*\[|\]\]>|<!DOCTYPE[^>]*?(>|$)|<!--[\s\S]*?(-->|$)|<[a-z?!\/]([a-z0-9_:.])*(\s[^>]*)?(>|$))/gi, '');
simple 2 line jquery to strip the html.
var content = "<p>checking the html source </p><p>
</p><p>with </p><p>all</p><p>the html </p><p>content</p>";
var text = $(content).text();//It gets you the plain text
console.log(text);//check the data in your console
cj("#text_area_id").val(text);//set your content to text area using text_area_id
Long story short, I have a website made under Wix.com editor, and coding was made possible a few months ago.
I have set up a custom comment box, so users can post their comments, and read others'.
Now the thing is, the "comment Input" takes plain text, and whenever a link is posted, it is displayed as plain text, no color, no clickability.
I want a code that 'reads' the list of comments, and convert every text that begins with 'https' or 'http' or 'www' ... orange and clickable (opening in a new tab)
Any solution please ?
Thanks !
I have tried many things such as :
$w('#text95').html =
(/((http:|https:)[^\s]+[\w])/g, '$1').replace;
text95 = the displayed comments (it is a text that repeats itself for as many comments as there are)
It looks like your replace syntax is wrong. Try something like this. I'm pretty sure this will work.
function linkify(inputText) {
var replacedText, replacePattern1, replacePattern2, replacePattern3;
//URLs starting with http://, https://, or ftp://
replacePattern1 = /(\b(https?|ftp):\/\/[-A-Z0-9+&##\/%?=~_|!:,.;]*[-A-Z0-9+&##\/%=~_|])/gim;
replacedText = inputText.replace(replacePattern1, '$1');
//URLs starting with "www." (without // before it, or it'd re-link the ones done above).
replacePattern2 = /(^|[^\/])(www\.[\S]+(\b|$))/gim;
replacedText = replacedText.replace(replacePattern2, '$1$2');
//Change email addresses to mailto:: links.
replacePattern3 = /(([a-zA-Z0-9\-\_\.])+#[a-zA-Z\_]+?(\.[a-zA-Z]{2,6})+)/gim;
replacedText = replacedText.replace(replacePattern3, '$1');
return replacedText;
}
Calling it with:
$w('#text95').innerHTML = linkify($w('#text95').html);
Here is my answer (improved Version including video links).
See also this Codepen here.
const convertLinks = ( input ) => {
let text = input;
const linksFound = text.match( /(?:www|https?)[^\s]+/g );
const aLink = [];
if ( linksFound != null ) {
for ( let i=0; i<linksFound.length; i++ ) {
let replace = linksFound[i];
if ( !( linksFound[i].match( /(http(s?)):\/\// ) ) ) { replace = 'http://' + linksFound[i] }
let linkText = replace.split( '/' )[2];
if ( linkText.substring( 0, 3 ) == 'www' ) { linkText = linkText.replace( 'www.', '' ) }
if ( linkText.match( /youtu/ ) ) {
let youtubeID = replace.split( '/' ).slice(-1)[0];
aLink.push( '<div class="video-wrapper"><iframe src="https://www.youtube.com/embed/' + youtubeID + '" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe></div>' )
}
else if ( linkText.match( /vimeo/ ) ) {
let vimeoID = replace.split( '/' ).slice(-1)[0];
aLink.push( '<div class="video-wrapper"><iframe src="https://player.vimeo.com/video/' + vimeoID + '" frameborder="0" webkitallowfullscreen mozallowfullscreen allowfullscreen></iframe></div>' )
}
else {
aLink.push( '' + linkText + '' );
}
text = text.split( linksFound[i] ).map(item => { return aLink[i].includes('iframe') ? item.trim() : item } ).join( aLink[i] );
}
return text;
}
else {
return input;
}
}
This replaces long and clumsy links within plain texts to short clickable links within that text. (And also wraps videos in responsive iframes)
Example:
This clumsy link https://stackoverflow.com/questions/49634850/javascript-convert-plain-text-links-to-clickable-links/52544985#52544985 is very clumsy and this http://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/split is not much better. This one www.apple.com is nice but www can be removed.
Becomes:
This clumsy link stackoverflow.com is very clumsy and this developer.mozilla.org is not much better. This one apple.com is nice but www can be removed.
The linkified text then displays as follows:
This clumsy link stackoverflow.com is very clumsy and this developer.mozilla.org is not much better. This one apple.com is nice but www can be removed.
I'm not sure what $w is or if you can really assign the html like that, but i'm guessing this is jquery since the $ most commonly refers to the jquery object.
Your try was close, it would be..
$('#text95').html($('#text95').html().replace(/((http:|https:)[^\s]+[\w])/g, '$1'));
try it..
$('#text95').html($('#text95').html().replace(/((http:|https:)[^\s]+[\w])/g, '$1'));
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<div id=text95>
stuff and stuff and http://ww.stuff.com stuff
</div>
I really like the solution by #philipeachille. It’s lightweight and does the essentials. However, it has a couple of issues I needed to address:
if a link is followed immediately by a punctuation mark, the punctuation is included in the link
if the same link is included more than once, the logic gets confused
some links don’t start with either www or http, for example microsoft.com
I derived the following from his code, fixing these issues and omitting the video embedding stuff, which I didn’t want:
const linkify = t => {
const isValidHttpUrl = s => {
let u
try {u = new URL(s)}
catch (_) {return false}
return u.protocol.startsWith("http")
}
const m = t.match(/(?<=\s|^)[a-zA-Z0-9-:/]+\.[a-zA-Z0-9-].+?(?=[.,;:?!-]?(?:\s|$))/g)
if (!m) return t
const a = []
m.forEach(x => {
const [t1, ...t2] = t.split(x)
a.push(t1)
t = t2.join(x)
const y = (!(x.match(/:\/\//)) ? 'https://' : '') + x
if (isNaN(x) && isValidHttpUrl(y))
a.push('' + y.split('/')[2] + '')
else
a.push(x)
})
a.push(t)
return a.join('')
}
To explain the main regular expression:
(?<=\s|^) looks behind (before) the link to determine where the link starts, which is either any white space or the beginning of the string
[a-zA-Z0-9-:/]+\.[a-zA-Z0-9] matches the start of a link – a pattern like xxx.x or even xxx://xxx.x
[a-zA-Z0-9-:/]+ a combination of one or more letters, numbers, hyphens, colons or slashes
\. followed immediately by a dot
[a-zA-Z0-9] followed immediately by another letter or number
.+? matches the rest of the link.
(?=[.,;:?!-]?(?:\s|$)) looks ahead (after) the link to determine where the link ends
?= positive lookahead
(?:\s|$) the link is ended either by any white space or by the end of the string
[.,;:?!-]? unless the white space or end of string is immediately preceded by one of these seven punctuation marks, in which case this punctuation mark ends the link.
Here is a snippet if you’d like to try some different blocks of text to see how they get linkified:
const linkify = t => {
const isValidHttpUrl = s => {
let u
try {u = new URL(s)}
catch (_) {return false}
return u.protocol.startsWith("http")
}
const m = t.match(/(?<=\s|^)[a-zA-Z0-9-:/]+\.[a-zA-Z0-9-].+?(?=[.,;:?!-]?(?:\s|$))/g)
if (!m) return t
const a = []
m.forEach(x => {
const [t1, ...t2] = t.split(x)
a.push(t1)
t = t2.join(x)
const y = (!(x.match(/:\/\//)) ? 'https://' : '') + x
if (isNaN(x) && isValidHttpUrl(y))
a.push('' + y.split('/')[2] + '')
else
a.push(x)
})
a.push(t)
return a.join('')
}
document.querySelectorAll('.linkify-this').forEach(o => {
o.innerHTML = linkify(o.innerHTML)
})
<p class="linkify-this">
Any links I put into this paragraph will be linkified, such as apple.com, http://google.com and www.facebook.com.
</p>
<p class="linkify-this">
https://microsoft.com will be matched even at the start of the text.
</p>
<p class="linkify-this">
If I refer to a domain name suffix only, such as .com or .co.uk, it won't be linkified, only complete domain names like https://www.gov.uk will be linkified.
</p>
<p class="linkify-this">
Some links contain numbers, like w3.org, but we don't want straight decimal numbers like 2.25 to be linkified. We also want to ignore non-http URLs like ftp://some.host.com and injection attempts like https://x.com"style="color:red".
</p>
Update
Following the comment from #newbie.user88 about the lack of injection protection, I thought it wise to add validation of each potential URL by attempting to construct a URL object with it. Credit to #pavlo for the logic.
I correct errors philipeachille's code because youtubeID parameter is not correct. I also correct direct youtube links.
convertLinks = input => {
let text = input;
const aLink = [];
const linksFound = text.match(/(?:www|https?)[^\s]+/g);
if (linksFound != null) {
for (let i = 0; i < linksFound.length; i++) {
let replace = linksFound[i];
if (!(linksFound[i].match(/(http(s?)):\/\//))) {
replace = 'http://' + linksFound[i]
}
let linkText = replace.split('/')[2];
if (linkText.substring(0, 3) == 'www') {
linkText = linkText.replace('www.', '')
}
if (linkText.match(/youtu/)) {
const youtubeID = replace.split('/').slice(-1)[0].split('=')[1];
if (youtubeID === undefined || youtubeID === '') {
aLink.push('' + linkText + '');
} else {
aLink.push('<span class="video-wrapper"><iframe src="https://www.youtube.com/embed/' + youtubeID + '" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe></span>');
}
} else {
aLink.push('' + linkText + '');
}
text = text.split(linksFound[i]).map(item => {
return aLink[i].includes('iframe') ? item.trim() : item
}).join(aLink[i]);
}
return text;
}
else {
return input;
}
};
Usage:
const text = 'Hello. This is a link https://www.google.com and this is youtube video https://www.youtube.com/watch?v=O-hnSlicxV4';
convertLinks(text);
If string contains URL anywhere, convert that string into link.
I try above code but this is not working properly for me. After adding some conditions it worked.
Thank you for helping me out #user9590073
function convertLink(inputText) {
var replacedText, replacePattern1, replacePattern2, replacePattern3;
//URLs starting with http://, https://, or ftp://
replacePattern1 = /(\b(https?|ftp):\/\/[-A-Z0-9+&#\/%?=~_|!:,.;]*[-A-Z0-9+&#\/%=~_|])/gim;
if (replacePattern1.test(inputText))
inputText = inputText.replace(replacePattern1, '$1');
//URLs starting with "www." (without // before it, or it'd re-link the ones done above).
replacePattern2 = /(^|[^\/])(www\.[\S]+(\b|$))/gim;
if (replacePattern2.test(inputText))
inputText = inputText.replace(replacePattern2, '$1$2');
//Change email addresses to mailto:: links.
replacePattern3 = /(([a-zA-Z0-9\-\_\.])+[a-zA-Z\_]+?(\.[a-zA-Z]{2,6})+)/gim;
if (replacePattern3.test(inputText))
replacedText = inputText.replace(replacePattern3, '$1');
return inputText;
}
And then I pass my text into ConverLink Func to open my modal with clickable URL.
$modalBody.find('div.news-content').html('<p>' + convertLink(response.NewsContent) + '</p>');
Here is a version (just for http/s and ftp links) that doesn't replace the url with a link if it looks like it's already in a link (or rather that is preceded with a " or ')
function linkifyBareHttp(inputText){
//URLs starting with http://, https://, or ftp://
const replacePattern1 = /\b(?<!(\'|\"))(((https?|ftp):\/\/[-A-Z0-9+&##\/%?=~_|!:,.;]*[-A-Z0-9+&##\/%=~_|]))/gim;
return inputText.replace(replacePattern1, '');
}
here is a little tester, showing what kind of things it handles:
function testLinkify(){
console.log('starting test');
test(`https://example.com`, ``);
test(`\nhttps://example.com`,`\n`);
test(``,``);
test(` https://example.com`,` `);
test(`https://example.com\nhttps://example.net BAZ`,`\n BAZ`);
}
function test(input,expect){
const testFunction = linkifyBareHttp;
const output = testFunction(input);
console.log (output === expect ? 'PASS':'FAIL');
console.log(` INPUT: ${input}`);
if(output !== expect) {
console.log(`EXPECT: ${expect}`);
console.log(`OUTPUT: ${output}`)
}
}
$(".hkt-chat-chatbot-paragraph-msg").map(function() {
$(this).html(linkify($(this).text().replace(/[\u00A0-\u9999<>\&]/g, function(i) { return '&#'+i.charCodeAt(0)+';';})))
});
function linkify(inputText) {
var replacedText, replacePattern1, replacePattern2, replacePattern3;
//URLs starting with http://, https://, or ftp://
replacePattern1 = /(\b(https?|ftp):\/\/[-A-Z0-9+&##\/%?=~_|!:,.;]*[-A-Z0-9+&##\/%=~_|])/gim;
replacedText = inputText.replace(replacePattern1, '$1');
//URLs starting with "www." (without // before it, or it'd re-link the ones done above).
replacePattern2 = /(^|[^\/])(www\.[\S]+(\b|$))/gim;
replacedText = replacedText.replace(replacePattern2, '$1$2');
//Change email addresses to mailto:: links.
replacePattern3 = /(([a-zA-Z0-9\-\_\.])+#[a-zA-Z\_]+?(\.[a-zA-Z]{2,6})+)/gim;
replacedText = replacedText.replace(replacePattern3, '$1');
return replacedText;
}
.hkt-chat-chatbot-paragraph-msg{
border:1px solid green;
padding:5px;
margin-bottom:5px;
}
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>
<div class="hkt-chat-chatbot-paragraph-msg" >
www.google.com
https://bing.com
ftp://192.138.1.1:80/home
normal text
paragraph click on next link http://example.com
</div>
<div class="hkt-chat-chatbot-paragraph-msg" >
normal text
</div>
<div class="hkt-chat-chatbot-paragraph-msg" >
www.google.com
https://bing.com
ftp://192.138.1.1:80/home
normal text
paragraph click on next link http://example.com
</div>
This worked for me.
I was having some htmlentity which need to be preserves so first get the text() have converted them first to encoded form and then added hyperlinks and replace the initial text with new html code.
I am using jQuery and Regex to search a text string for http or https and convert the string to a URL. I need the code to skip the string if it starts with a quote.
below is my code:
// Get the content
var str = jQuery(this).html();
// Set the regex string
var exp = /(\b(https?|ftp|file):\/\/[-A-Z0-9+&##\/%?=~_|!:,.;]*[-A-Z0-9+&##\/%=~_|])/ig;
var replaced_text = str.replace(exp, function(url) {
clean_url = url.replace(/https?:\/\//gi,'');
return '' + clean_url + '';
})
jQuery(this).html(replaced_text);
Here is an example of my issue:
Text The School of Computer Science and Informatics. She blogs at http://www.wordpress.com and can be found on Twitter #Abcdef.
The current code successfully finds the text that starts with http or https and converts it to a URL but it also converts the twitter URL. I need to ignore the text if it starts with a quote or is within an a tag, etc...
Any help is much appreciated
What about adding [^"'] to the exp variable?
var exp = /(\b[^"'](https?|ftp|file):\/\/[-A-Z0-9+&##\/%?=~_|!:,.;]*[-A-Z0-9+&##\/%=~_|])/ig;
Snippet:
// Get the content
var str = jQuery("#text2replace").html();
// Set the regex string
var exp = /(\b[^"'](https?|ftp|file):\/\/[-A-Z0-9+&##\/%?=~_|!:,.;]*[-A-Z0-9+&##\/%=~_|])/ig;
var replaced_text = str.replace(exp, function(url) {
clean_url = url.replace(/https?:\/\//gi,'');
return '' + clean_url + '';
})
jQuery("#text2replace").html(replaced_text);
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<div id="text2replace">
The School of Computer Science and Informatics. She blogs at http://www.wordpress.com and can be found on Twitter #Abcdef.
</div>
If you really just want to ignore the quotation marks, this could help:
var replaced_text = $("#selector").html().replace(/([^"])(\b(https?|ftp|file):\/\/[-A-Z0-9+&##\/%?=~_|!:,.;]*[-A-Z0-9+&##\/%=~_|])/ig, '$1$2');
This works for me:
This will recognize urls and convert them to hyperlinks, but will ignore urls, wrapped in " (quotes).
See the code below or this jsfiddle for a working example.
Example HTML:
<ul class="js-replaceUrls">
<li>
www.link-only-www.com
</li>
<li>
http://link-starts-with-HTTP.com
</li>
<li>
https://www.link-starts-with-https-and-www.com
</li>
<a href="https://link-starts-with-https.com">
Link in anchor tag
</a>
</ul>
RegEX:
/(([a-z]+:\/\/)?(([a-z0-9\-]+\.)+([a-z]{2}|aero|arpa|biz|com|coop|edu|gov|info|int|jobs|mil|museum|name|nato|net|org|pro|travel|local|internal))(:[0-9]{1,5})?(\/[a-z0-9_\-\.~]+)*(\/([a-z0-9_\-\.]*)(\?[a-z0-9+_\-\.%=&]*)?)?(#[a-zA-Z0-9!$&'()*+.=-_~:#/?]*)?)(\s+|$)/gmi
jQuery:
// RECOGNIZE URLS AND CONVERT THEM TO HYPERLINKS
// Ignore if hyperlink is found in HTML attr, like "href"
$('.js-replaceUrls').each(function(){
// GET THE CONTENT
var str = $(this).html();
// SET THE REGEX STRING
var regex = /(([a-z]+:\/\/)?(([a-z0-9\-]+\.)+([a-z]{2}|aero|arpa|biz|com|coop|edu|gov|info|int|jobs|mil|museum|name|nato|net|org|pro|travel|local|internal))(:[0-9]{1,5})?(\/[a-z0-9_\-\.~]+)*(\/([a-z0-9_\-\.]*)(\?[a-z0-9+_\-\.%=&]*)?)?(#[a-zA-Z0-9!$&'()*+.=-_~:#/?]*)?)(\s+|$)/gmi;
// REPLACE PLAIN TEXT LINKS BY HYPERLINKS
var replaced_text = str.replace(regex, "<a href='$1' class='js-link'>$1</a>");
// ECHO LINK
$(this).html(replaced_text);
});
// DEFINE URLS WITHOUT "http" OR "https"
var linkHasNoHttp = $(".js-link:not([href*=http],[href*=https])");
// ADD "http://" TO "href"
$(linkHasNoHttp).each(function() {
var linkHref = $(this).attr("href");
$(this).attr("href" , "http://" + linkHref);
});
See this jsfiddle for a working example.
How could I go about replacing a string:
Hello my name is <a href='/max'>max</a>!
<script>alert("DANGEROUS SCRIPT INJECTION");</script>
with
Hello my name is <a href='/max'>max</a>!
<script>alert("DANGEROUS SCRIPT INJECTION");</script>
I can easily have all the <,> replaced with <,> with:
string = string.replace(/</g, "<").replace(/>/g, ">");
but I still want to be able to have <a> links.
I have also looked into preventing script injection with:
var html = $(string.bold());
html.find('script').remove();
But I want to be able to still read the script tags rather than them being removed.
One approach to this problem is to use a regular expression with a strict look-behind pattern that only allows anchors that follow a certain format very closely.
Let's say you want to only allow links that exactly follow this example:
text
and
text
Build a regular expression that matches only "<" characters that are not followed by this valid pattern (negative lookbehind):
<(?!a href="https?:\/\/\w[\w.-\/\?#]+">\w+<\/a>)
One problem with this regular expression is that if you match it against your entire string, the < will still match the closing a element (</a>), so if you replace every match with a < you will break the anchor after all.
You can allow all closing </a> tags by appending an alternative to the negative look-behind:
<(?!a href="https?:\/\/\w[\w.-\/\?#]+">\w+<\/a>|\/a>)
Perhaps someone else has a better solution for that sub-problem.
Here is the final string.replace:
string.replace(/<(?!a href="https?:\/\/\w[\w.-\/\?#]+">\w+<\/a>|\/a>)/g, '<');
Note: All these input checks must always be done on the server side, on the client side the check can simply be circumvented and you'll have malicious data sent to your server despite the check.
This code snippet should do the trick. You can add additional tag names you wish to let pass as HTML tags in the array allowedTagNames.
// input
var html = "Hello my name is <a href='/max'>max</a>! <script>alert('DANGEROUS SCRIPT INJECTION');</script>";
var allowedTagNames = ["a"];
// output
var processedHTML = "";
var processingStart = 0;
// this block finds the next tag and processes it
while (true) {
var tagStart = html.indexOf("<", processingStart);
if (tagStart === -1) { break; }
var tagEnd = html.indexOf(">", tagStart);
if (tagEnd === -1) { break; }
var tagNameStart = tagStart + 1;
if (html[tagNameStart] === "/") {
// for closing tags
++tagNameStart;
}
// we expect there to be either a whitespace or a > after the tagName
var tagNameEnd = html.indexOf(" ", tagNameStart);
if (tagNameEnd === -1 || tagNameEnd > tagEnd) {
tagNameEnd = tagEnd;
}
var tagName = html.slice(tagNameStart, tagNameEnd);
// copy in text which is between this tag and the end of last tag
processedHTML += html.slice(processingStart, tagStart);
if (allowedTagNames.indexOf(tagName) === -1) {
processedHTML += "<" + html.slice(tagStart + 1, tagEnd) + ">";
} else {
processedHTML += html.slice(tagStart, tagEnd + 1);
}
processingStart = tagEnd + 1;
}
// copy the rest of input which wasn't processed
processedHTML += html.slice(processingStart);
NOTE: it won't work if there's a < or > inside a property of a tag.
For example: <a href=">">
You can use capture groups and lookarounds in Regex to achieve this
string = string.replace(/<((?!a )[^>]*)>/g, "<$1>").replace(/<\/a>/g, "</a>");
The first part replaces all the HTML tags (except anchor start tags <a>) from <tag> to <tag> and the second part replaces all the altered anchor end tags(</a>) from </a> back to </a>
If you want to replace only the <script... tags, the following code will do the trick ( you can run it in browser console ) and all other tags will not be changed. In my sample I added an extra line just to demonstrate how it works with multiple <script... tags inside.
let s = "Hello my name is <a href='/max'>max</a>!<script>alert(\"DANGEROUS SCRIPT INJECTION\");</script>";
s += "Hello my name is <a href='/bob'>bob</a>!<script>alert(\"DANGEROUS SCRIPT INJECTION\");</script>";
s.match(/<script.*?<\/script>/g).forEach(scr => s = s.replace(scr, scr.replace(/</g, "<").replace(/>/g, ">")));
console.log(s);
// OUTPUT: Hello my name is <a href='/max'>max</a>!<script>alert("DANGEROUS SCRIPT INJECTION");</script>Hello my name is <a href='/bob'>bob</a>!<script>alert("DANGEROUS SCRIPT INJECTION");</script>
I'm attempting to duplicate the original img tag's functionality in custom img tag that will be added to the pagedown converter.
e.g I'm copy the original behavior:
![image_url][1] [1]: http://lolink.com gives <img src="http://lolink.com">
into a custom one:
?[image_url][1] [1]: http://lolink.com gives <img class="lol" src="http://lolink.com">
Looking at the docs the only way to do this is through using the preblockgamut hook and then adding another "block level structure." I attempted doing this and got an Uncaught Error: Recursive call to converter.makeHtml
here's the code of me messing around with it:
converter.hooks.chain("preBlockGamut", function (text, dosomething) {
return text.replace(/(\?\[(.*?)\][ ]?(?:\n[ ]*)?\[(.*?)\])()()()()/g, function (whole, inner) {
return "<img src=" + dosomething(inner) + ">";
});
});
I'm not very experienced with hooks and everything so what would I do to fix it? Thanks.
UPDATE: found out that _DoImages runs after prespangamut, will use that instead of preblockgamut
Figured it out! The solution is very clunky and involves editing the source code because I am very bad at regex and the _DoImage() function uses a lot of internal functions only in the source.
solution:
All edits will be made to the markdown.converter file.
do a ctrl+f for the _DoImage function, you will find that it is named in two places, one in the RunSpanGamut and one defining the function. The solution is simple, copy over the DoImage function and related stuff to a new one in order to mimic the original function and edit it to taste.
next to DoImage function add:
function _DoPotatoImages(text) {
text = text.replace(/(\?\[(.*?)\][ ]?(?:\n[ ]*)?\[(.*?)\])()()()()/g, writePotatoImageTag);
text = text.replace(/(\?\[(.*?)\]\s?\([ \t]*()<?(\S+?)>?[ \t]*((['"])(.*?)\6[ \t]*)?\))/g, writePotatoImageTag);
return text;
}
function writePotatoImageTag(wholeMatch, m1, m2, m3, m4, m5, m6, m7) {
var whole_match = m1;
var alt_text = m2;
var link_id = m3.toLowerCase();
var url = m4;
var title = m7;
if (!title) title = "";
if (url == "") {
if (link_id == "") {
link_id = alt_text.toLowerCase().replace(/ ?\n/g, " ");
}
url = "#" + link_id;
if (g_urls.get(link_id) != undefined) {
url = g_urls.get(link_id);
if (g_titles.get(link_id) != undefined) {
title = g_titles.get(link_id);
}
}
else {
return whole_match;
}
}
alt_text = escapeCharacters(attributeEncode(alt_text), "*_[]()");
url = escapeCharacters(url, "*_");
var result = "<img src=\"" + url + "\" alt=\"" + alt_text + "\"";
title = attributeEncode(title);
title = escapeCharacters(title, "*_");
result += " title=\"" + title + "\"";
result += " class=\"p\" />";
return result;
}
if you look at the difference between the new _DoPotatoImages() function and the original _DoImages(), you will notice I edited the regex to have an escaped question mark \? instead of the normal exclamation mark !
Also notice how the writePotatoImageTag calls g_urls and g_titles which are some of the internal functions that are called.
After that, add your text = _DoPotatoImages(text); to runSpanGamut function (MAKE SURE YOU ADD IT BEFORE THE text = _DoAnchors(text); LINE BECAUSE THAT FUNCTION WILL OVERRIDE IMAGE TAGS) and now you should be able to write ?[image desc](url) along with ![image desc](url)
done.
The full line (not only the regex) in Markdown.Converter.js goes like this:
text = text.replace(/(!\[(.*?)\][ ]?(?:\n[ ]*)?\[(.*?)\])()()()()/g, writeImageTag);
so check the function writeImageTag. There you can see how the regex matching text is replaced with a full img tag.
You can change the almost-last line before its return from
result += " />";
to
result += ' class="lol" />';
Thanks for the edit to the main post.
I see what you mean now.
It is a bit weird how it uses empty capture groups to specify tags, but if it works, it works.
It looks like you would need to add on an extra () onto the regex string, then specify m8 as a new extra variable to be passed into the function, and then specify it as class = m8; like the other variables at the top of the function.
Then where it says var result =, instead of class =\"p\" you would just put class + title=\"" + .......