Convert HTML to plain text keeping links, bold and italic in Javascript - javascript

I am configuring an API to send an email using the content of a publication as the body of the email. The text editor used for the publication save the text in HTML so I needed to convert the result into plain text. There are other questions that have gave me a solution, but I would like to keep from the original text the bold text, italic and the links. So this is what I have:
Body of a test publicacion:
This is bold text.This is regular text.This is italic.This is a link.
Then in the script I have the following function:
function htmlToText(html){
//remove code brakes and tabs
html = html.replace(/\n/g, "");
html = html.replace(/\t/g, "");
//keep html brakes and tabs
html = html.replace(/<\/td>/g, "\t");
html = html.replace(/<\/table>/g, "\n");
html = html.replace(/<\/tr>/g, "\n");
html = html.replace(/<\/p>/g, "\n");
html = html.replace(/<\/div>/g, "\n");
html = html.replace(/<\/h>/g, "\n");
html = html.replace(/<br>/g, "\n"); html = html.replace(/<br( )*\/>/g, "\n");
html = html.replace(/<a.*href="(.*?)".*>(.*?)<\/a>/gi, " $2 (Link->$1) ");
//parse html into text
var dom = (new DOMParser()).parseFromString('<!doctype html><body>' + html, 'text/html');
return dom.body.textContent;
}
That gives me some plain text with nice line breaks, but I was wondering if I could get the bold, italic and links.
Thanks.

I had some time on my hands and played around. This is what I came up with:
const copy=document.createElement("div");
copy.innerHTML=container.innerHTML.replace(/\n/g," ").replace(/[\t\n]+/g,"");
const tags={B:["**","**",1], // [<prefix>, <postfix>, <sequence-number> ]
I:["*","*",2],
H2:["##","\n",3],
P:["\n","\n",4],
DIV:["","\n",5],
TD:["","\t",6]};
[...copy.querySelectorAll(Object.keys(tags).join(","))]
.sort((a,b)=>tags[a.tagName][2]-tags[b.tagName][2])
.forEach(e=>{
const [a,b]=tags[e.tagName];
e.innerHTML=(e.matches("TD:first-child") ? "\n": a) + e.innerHTML + b;
});
console.log(copy.textContent.replace(/^ */mg,""));
<div id="container">
<H2>Second level heading</H2>
<div><div>
A <b>first div</b> with a
link (abc) and a
<p>paragraph having itself another link (def) in it.</p>
</div>
</div>
And here is some more <i>"lost" text</i> ...
<table>
<tr><td>one</td><td><b>two</b></td><td>three</td></tr>
<tr><td>a</td><td>b</td><td>c</td></tr>
<tr><td>d</td><td>e</td><td>f</td></tr>
</table>
</div>
Instead of using regexp to "parse" the html I chose to actually treat it in a DOM way: I create a new div element (copy) into which I insert the original .innerHTML. For particular element types I then define some pre- and postfixes that should surround the original .innerHTML. These are stored in tags and applied on the freshly created div element.
This is done by selecting all of the "special" elements (as specified by the tags-keys) and processing them in a given sequential order. Afterwards I simply return the .textContent of the modified copy element.
Plain text cannot really render bold or italics text decoration. For this reason I used modifiers in the markdown style (*:italics, **:bold)

Related

How to remove content within the &lt and &gt javascript

I have a content that contains a string of elements along with images. ex:
var str= <p><img src=\"v\">fwefwefw</img></p><p><br></p><p><br></p>
the text that is within the &lt and &gt is a dirty tag and I would like to remove it along with the content that is within it. the tag is generated dynamically and hence could be any tag i.e <div>, <a>, <h1> etc....
the expected output : <p></p><p><br></p><p><br></p>
however with this code, im only able to remove the tags and not the content inside it.
str.replaceAll(/<.*?>/g, "");
it renders like this which is not what im looking for:
<p>fwefwefw</p><p><br></p><p><br></p><p><br></p>
how can I possibly remove the & tags along with the content so that I get rid of dirty tags and text inside it?
fiddle: https://jsfiddle.net/3rozjn8m/
thanks
A safe way is to use a DOM parser, visiting each text node, where then each text can be cleaned separately. This way you are certain the DOM structure is not altered; only the texts:
let str= "<p><img src=\"v\">fwefwefw</img></p><p><br></p><p><br></p>";
let doc = new DOMParser().parseFromString(str, "text/html");
let walk = doc.createTreeWalker(doc.body, 4, null, false);
let node = walk.nextNode();
while (node) {
node.nodeValue = node.nodeValue.replace(/<.*>/gs, "");
node = walk.nextNode();
}
let clean = doc.body.innerHTML;
console.log(clean);
This will also work when you have more than one <p> element that has such content.
Remove the question mark.
var str= "<p><img src=\"v\">fwefwefw</img></p><p><br></p><p><br></p>";
console.log(str.replaceAll(/<.*>/g, ""));

replacing text from a paste when looping over html elements

I am trying to replace html links (and eventually other elements) with bbcode when a user does a paste from a document (like gdocs or libre office). So we are dealing with rich html already formatted (which is why it needs to copy HTML and not text).
Essentially, I want to be able to copy stuff pre-written from a document into a textarea on my website without having to manually write BBCode tags in the original document (as it's messy for proof-reading).
Thanks to the help here Adjust regex to ignore anything else inside link HTML tags I have gotten mostly there, but I am stuck on replacing the found tags with the original text.
Here's what I have:
function fragmentFromString(strHTML) {
return document.createRange().createContextualFragment(strHTML);
}
$('textarea').on('paste',function(e) {
e.preventDefault();
var text = (e.originalEvent || e).clipboardData.getData('text/html') || prompt('Paste something..');
var fragment = fragmentFromString(text);
var aTags = Array.from(fragment.querySelectorAll('a'));
aTags.forEach(a => {
text = text.replace(a, "[url="+a.href+"]"+a.textContent+"[/url]");
});
window.document.execCommand('insertText', false, text);
});
You can see it loops over the found a tags and I am essentially trying to replace them from the original text with the new stuff.
Here's an example of the type of content that could be pasted (this is a single link from google docs):
<span style="font-size:14.666666666666666px;font-family:Arial;color:#1155cc;background-color:transparent;font-weight:700;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre-wrap;">Link test</span>
Expected to be replaced with:
[url=https://www.test.com]Link test[/url]
So I want that HTML replaced, with the BBCode within the original text that's then sent to the textarea from the paste.
The aTags foreach currently does nothing. You need to create a new text node, and replace the existing anchor tag with it.
aTags.forEach(a => {
var new_text = document.createTextNode("[url=" + a.href + "]" + a.textContent + "[/url]");
a.parentNode.insertBefore(new_text, a);
a.parentNode.removeChild(a);
});
window.document.execCommand('insertText', false, text.innerText);
This will replace every a tag into the given text.

How can i get the string including HTML element after highlighting the text in HTML using Javascript?

Say, I highlighted this text The Title is Superman and Batman in my page.
How can i get the text including it's HTML element?
Based on my example, I should get this:
The Title is <i>Superman</i> and <i>Batman</i>
Use jquery html selector to get the value with HTML selector.
HTML:
<div id="test">this is my <i>test</i></div>
JS:
$('#test').html()
You should take the values adding a class or id.
HTML:
<div class="test"><i>Superman</i></div>
<div class="test"><i>Batman</i></div>
JS:
$('.test').html()
Live Example
Since everyone is requiring OP to use jQuery, here's the native JS equivalent. You can select the html content of an element like so :
var html = document.getElementById('text-container').innerHTML;
You might want to redisplay all the HTML from the container as different values, eg. as HTML markup, as text, as HTML-encoded text. With that I mean HTML entities (eg. > for > (greater than sign)). Here are the methods for displaying different types of output each time:
Here's a variable for the subsequent code:
var target = document.getElementById('text-output'); // for later
1. HTML in a container element
Output: Rendered HTML
Javascript:
target.innerHTML = html;
2. Text in a container element
Output: Text, HTML entities encoded
Javascript:
// will automatically encode HTML entities
var text = document.createTextNode(html);
target.innerHTML = text;
3. HTML in a textarea element
Output: Text, HTML entities non-encoded
Javascript:
yourTextArea.value = html;
4. Text in a textarea element
Output: Text, HTML entities encoded
Javascript:
// The virtual container automatically encodes entities when its .innerHTML
// method is called after appending a textnode.
var virtualContainer = document.createElement('div');
var text = document.createTextNode(html);
virtualContainer.appendChild(text);
yourTextArea.value = virtualContainer.innerHTML;
Demo: http://jsbin.com/mozibezi/1/edit
PS: It is impossible to display the output from #4 in a non-form input.

Getting non-html text from CKeditor

In my application, in insert news section, i use a sub string of news content for news Summary. for getting news content text from users,i use CKEditor and for news summary i use substring method to get a certain length of news content.but when i'm working with CKEditor i get text with html tags and not plain text and when i use substring method, my news summary become messed! how do i get raw text from this control?
i read this but i can't use getText() method
Try code like this:
CKEDITOR.instances.editor1.document.getBody().getText();
It works fine for me. You can test it on http://ckeditor.com/demo. It's not ideal (text in table cells is joined together without spaces), but may be enough for your needs.
EDIT (20 Dec 2017): The CKEditor 4 demo was moved to https://ckeditor.com/ckeditor-4/ and uses different editor names, so the new code to execute is:
CKEDITOR.instances.ckdemo.document.getBody().getText();
It's also important that it will work in the "Article editor" and in the "Inline editor" you need to get text of a different element:
CKEDITOR.instances.editor1.editable().getText();
do it like this
//getSnapshot() retrieves the "raw" HTML, without tabs, linebreaks etc
var html=CKEDITOR.instances.YOUR_TEXTAREA_ID.getSnapshot();
var dom=document.createElement("DIV");
dom.innerHTML=html;
var plain_text=(dom.textContent || dom.innerText);
alert(plain_text);
viola, grab the portion of plain_text you want.
UPDATE / EXAMPLE
add this javascript
<script type="text/javascript">
function createTextSnippet() {
//example as before, replace YOUR_TEXTAREA_ID
var html=CKEDITOR.instances.YOUR_TEXTAREA_ID.getSnapshot();
var dom=document.createElement("DIV");
dom.innerHTML=html;
var plain_text=(dom.textContent || dom.innerText);
//create and set a 128 char snippet to the hidden form field
var snippet=plain_text.substr(0,127);
document.getElementById("hidden_snippet").value=snippet;
//return true, ok to submit the form
return true;
}
</script>
in your HTML, add createTextSnippet as onsubmit-handler to the form, eg
<form action="xxx" method="xxx" onsubmit="createTextSnippet();" />
inside the form, between <form> and </form> insert
<input type="hidden" name="hidden_snippet" id="hidden_snippet" value="" />
When the form is submitted, you can serverside access hidden_snippet along with the rest of the fields in the form.
i personally use this method to compact the code and remove also double spaces and line feeds:
var TextGrab = CKEDITOR.instances['editor1'].getData();
TextGrab = $(TextGrab).text(); // html to text
TextGrab = TextGrab.replace(/\r?\n|\r/gm," "); // remove line breaks
TextGrab = TextGrab.replace(/\s\s+/g, " ").trim(); // remove double spaces
I used this function:
function getPlainText( strSrc ) {
var resultStr = "";
// Ignore the <p> tag if it is in very start of the text
if(strSrc.indexOf('<p>') == 0)
resultStr = strSrc.substring(3);
else
resultStr = strSrc;
// Replace <p> with two newlines
resultStr = resultStr.replace(/<p>/gi, "\r\n\r\n");
// Replace <br /> with one newline
resultStr = resultStr.replace(/<br \/>/gi, "\r\n");
resultStr = resultStr.replace(/<br>/gi, "\r\n");
//-+-+-+-+-+-+-+-+-+-+-+
// Strip off other HTML tags.
//-+-+-+-+-+-+-+-+-+-+-+
return resultStr.replace( /<[^<|>]+?>/gi,'' );
}
Function call:
var plain_text = getPlainText(FCKeditorAPI.GetInstance("FCKeditor1").GetXHTML());
I created this fiddle for testing: http://jsfiddle.net/4etVv/3/
I use this method (need jQuery):
var objEditor =CKEDITOR.instances["textarea_id"];
var msg = objEditor.getData();
var txt = jQuery(msg).text().replaceAll("\n\n","\n");
hope it helps!
Assuming that editor is your CKEditor instance (CKEditor.instances.editor1 from above example or if you are using events then event.editor). You can use following code to get plain text content.
editor.ui.contentsElement.getChild(0).getText()
Apparently CKEditor adds a "voice label" element to the actual editable content. Hence getChild(0).

string search in body.html() not working

Hi here is my total work to search a string in HTML and highlight it if it is found in document:
The problem is here
var SearchItems = text.split(/\r\n|\r|\n/);
var replaced = body.html();
for(var i=0;i<SearchItems.length;i++)
{
var tempRep= '<span class="highlight" style="background-color: yellow">';
tempRep = tempRep + SearchItems[i];
tempRep = tempRep + '</span>';
replaced = replaced.replace(SearchItems[i],tempRep); // It is trying to match along with html tags...
// As the <b> tags will not be there in search text, it is not matching...
}
$("body").html(replaced);
The HTML I'm using is as follows;
<div>
The clipboardData object is reserved for editing actions performed through the Edit menu, shortcut menus, and shortcut keys. It transfers information using the system clipboard, and retains it until data from the next editing operation replace s it. This form of data transfer is particularly suited to multiple pastes of the same data.
<br><br>
This object is available in script as of <b>Microsoft Internet Explorer 5.</b>
</div>
<div class='b'></div>
If I search for a page which is pure or without any html tags it will match. However, if I have any tags in HTML this will not work.. Because I am taking body html() text as the target text. It is exactly trying to match along with html tags..
In fiddle second paragraph will not match.
First of all, to ignore the HTML tags of the element to look within, use the .text() method.
Secondly, in your fiddle, it wasn't working because you weren't calling the SearchQueue function on load.
Try this amended fiddle

Categories