Clean Microsoft Word Pasted Text using JavaScript

Clean Microsoft Word Pasted Text using JavaScript - javascript

I am using a 'contenteditable' <div/> and enabling PASTE.
It is amazing the amount of markup code that gets pasted in from a clipboard copy from Microsoft Word. I am battling this, and have gotten about 1/2 way there using Prototypes' stripTags() function (which unfortunately does not seem to enable me to keep some tags).
However, even after that, I wind up with a mind-blowing amount of unneeded markup code.
So my question is, is there some function (using JavaScript), or approach I can use that will clean up the majority of this unneeded markup?

Here is the function I wound up writing that does the job fairly well (as far as I can tell anyway).
I am certainly open for improvement suggestions if anyone has any. Thanks.
function cleanWordPaste( in_word_text ) {
var tmp = document.createElement("DIV");
tmp.innerHTML = in_word_text;
var newString = tmp.textContent||tmp.innerText;
// this next piece converts line breaks into break tags
// and removes the seemingly endless crap code
newString = newString.replace(/\n\n/g, "<br />").replace(/.*<!--.*-->/g,"");
// this next piece removes any break tags (up to 10) at beginning
for ( i=0; i<10; i++ ) {
if ( newString.substr(0,6)=="<br />" ) {
newString = newString.replace("<br />", "");
}
}
return newString;
}
Hope this is helpful to some of you.

You can either use the full CKEditor which cleans on paste, or look at the source.

I am using this:
$(body_doc).find('body').bind('paste',function(e){
var rte = $(this);
_activeRTEData = $(rte).html();
beginLen = $.trim($(rte).html()).length;
setTimeout(function(){
var text = $(rte).html();
var newLen = $.trim(text).length;
//identify the first char that changed to determine caret location
caret = 0;
for(i=0;i < newLen; i++){
if(_activeRTEData[i] != text[i]){
caret = i-1;
break;
}
}
var origText = text.slice(0,caret);
var newText = text.slice(caret, newLen - beginLen + caret + 4);
var tailText = text.slice(newLen - beginLen + caret + 4, newLen);
var newText = newText.replace(/(.*(?:endif-->))|([ ]?<[^>]*>[ ]?)|( )|([^}]*})/g,'');
newText = newText.replace(/[·]/g,'');
$(rte).html(origText + newText + tailText);
$(rte).contents().last().focus();
},100);
});
body_doc is the editable iframe, if you are using an editable div you could drop out the .find('body') part. Basically it detects a paste event, checks the location cleans the new text and then places the cleaned text back where it was pasted. (Sounds confusing... but it's not really as bad as it sounds.
The setTimeout is needed because you can't grab the text until it is actually pasted into the element, paste events fire as soon as the paste begins.

How about having a "paste as plain text" button which displays a <textarea>, allowing the user to paste the text in there? that way, all tags will be stripped for you. That's what I do with my CMS; I gave up trying to clean up Word's mess.

You can do it with regex
Remove head tag
Remove script tags
Remove styles tag
let clipboardData = event.clipboardData || window.clipboardData;
let pastedText = clipboardData.getData('text/html');
pastedText = pastedText.replace(/\<head[^>]*\>([^]*)\<\/head/g, '');
pastedText = pastedText.replace(/\<script[^>]*\>([^]*)\<\/script/g, '');
pastedText = pastedText.replace(/\<style[^>]*\>([^]*)\<\/style/g, '');
// pastedText = pastedText.replace(/<(?!(\/\s*)?(b|i|u)[>,\s])([^>])*>/g, '');
here the sample : https://stackblitz.com/edit/angular-u9vprc

I did something like that long ago, where i totally cleaned up the stuff in a rich text editor and converted font tags to styles, brs to p's, etc, to keep it consistant between browsers and prevent certain ugly things from getting in via paste. I took my recursive function and ripped out most of it except for the core logic, this might be a good starting point ("result" is an object that accumulates the result, which probably takes a second pass to convert to a string), if that is what you need:
var cleanDom = function(result, n) {
var nn = n.nodeName;
if(nn=="#text") {
var text = n.nodeValue;
}
else {
if(nn=="A" && n.href)
...;
else if(nn=="IMG" & n.src) {
....
}
else if(nn=="DIV") {
if(n.className=="indent")
...
}
else if(nn=="FONT") {
}
else if(nn=="BR") {
}
if(!UNSUPPORTED_ELEMENTS[nn]) {
if(n.childNodes.length > 0)
for(var i=0; i<n.childNodes.length; i++)
cleanDom(result, n.childNodes[i]);
}
}
}

This works great to remove any comments from HTML text, including those from Word:
function CleanWordPastedHTML(sTextHTML) {
var sStartComment = "<!--", sEndComment = "-->";
while (true) {
var iStart = sTextHTML.indexOf(sStartComment);
if (iStart == -1) break;
var iEnd = sTextHTML.indexOf(sEndComment, iStart);
if (iEnd == -1) break;
sTextHTML = sTextHTML.substring(0, iStart) + sTextHTML.substring(iEnd + sEndComment.length);
}
return sTextHTML;
}

Had a similar issue with line-breaks being counted as characters and I had to remove them.
$(document).ready(function(){
$(".section-overview textarea").bind({
paste : function(){
setTimeout(function(){
//textarea
var text = $(".section-overview textarea").val();
// look for any "\n" occurences and replace them
var newString = text.replace(/\n/g, '');
// print new string
$(".section-overview textarea").val(newString);
},100);
}
});
});

Could you paste to a hidden textarea, copy from same textarea, and paste to your target?

Hate to say it, but I eventually gave up making TinyMCE handle Word crap the way I want. Now I just have an email sent to me every time a user's input contains certain HTML (look for <span lang="en-US"> for example) and I correct it manually.

Related

How to wrap word into span on user click in javascript

I have: Simple block of html text:
<p>
The future of manned space exploration and development of space depends critically on the
creation of a dramatically more proficient propulsion architecture for in-space transportation.
A very persuasive reason for investigating the applicability of nuclear power in rockets is the
vast energy density gain of nuclear fuel when compared to chemical combustion energy...
</p>
I want: wrap word into span when user click on it.
I.e. User clicked at manned word, than I should get
<p>
The future of <span class="touched">manned</span> space exploration and development of space depends critically on the
creation of a ....
Question: How to do that? Is there way more efficient that just wrap all words into span at loading stage?
P.S. I'm not interested in window.getSelection() because I want to imply some specific styling for touched words and also keep collection of touched words
Special for #DavidThomas: example where I get selected text, but do not know how to wrap it into span.

I were you, I'd wrap all words with <span> tags beforehand and just change the class on click. This might look like
$( 'p' ).html(function( _, html ) {
return html.split( /\s+/ ).reduce(function( c, n ) {
return c + '<span>' + n + ' </span>'
});
});
and then we could have a global handler, which listens for click events on <span> nodes
$( document.body ).on('click', 'span', function( event ) {
$( event.target ).addClass( 'touch' );
});
Example: http://jsfiddle.net/z54kehzp/
I modified #Jonast92 solution slightly, I like his approach also. It might even be better for huge data amounts. Only caveat there, you have to live with a doubleclick to select a word.
Example: http://jsfiddle.net/5D4d3/106/

I modified a previous answer to almost get what you're looking for, as demonstrated in this demo.
It finds the currently clicked word and wraps a span with that specific class around the string and replaced the content of the paragraph with a new content which's previously clicked word is replaced with the newly wrapped string.
It's limited a bit though because if you click on a substring of another word, let's say 'is' then it will attempt to replace the first instance of that string within the paragraph.
You can probably play around with it to achieve what you're looking for, but the main thing is to look around.
The modified code:
$(document).ready(function()
{
var p = $('p');
p.css({ cursor: 'pointer' });
p.dblclick(function(e) {
var org = p.html();
var range = window.getSelection() || document.getSelection() || document.selection.createRange();
var word = $.trim(range.toString());
if(word != '')
{
var newWord = "<span class='touched'>"+word+"</span>";
var replaced = org.replace(word, newWord);
$('p').html(replaced);
}
range.collapse();
e.stopPropagation();
});
});
Then again, #jAndy's answer looks very promising.

Your answers inspired me to the next solution:
$(document).ready(function()
{
var p = $('p');
p.css({ cursor: 'pointer' });
p.dblclick(function(e) {
debugger;
var html = p.html();
var range = window.getSelection() || document.getSelection() || document.selection.createRange();
var startPos = range.focusOffset; //Prob: isn't precise +- few symbols
var selectedWord = $.trim(range.toString());
var newHtml = html.substring(0, startPos) + '<span class=\"touched\">' + selectedWord + '</span>' + html.substring(startPos + selectedWord.length);
p.html(newHtml);
range.collapse(p);
e.stopPropagation();
});
});
We haven't there wrap each word in span. Instead we wrap word only on click.

use
range.surroundContents(node)
$('.your-div').unbind("dblclick").dblclick(function(e) {
e.preventDefault();
// unwrap .touched spans for each dblclick.
$(this).find('.touched').contents().unwrap();
var t = getWord();
if (t.startContainer.nodeName == '#text' && t.endContainer.nodeName == '#text') {
var newNode = document.createElement("span");
newNode.setAttribute('class', 'touched');
t.surroundContents(newNode);
}
e.stopPropagation();
});
function getWord() {
var txt = document.getSelection();
var txtRange = txt.getRangeAt(0);
return txtRange;
}

Rangy: word under caret (again)

I'm trying to create a typeahead code to add to a wysihtml5 rich text editor.
Basically, I need to be able to insert People/hashtag references like Twitter/Github/Facebook... do.
I found some code of people trying to achieve the same kind of thing.
http://jsfiddle.net/A9z3D/
This works pretty fine except it only do suggestions for the last word and has some bugs. And I want a select box like Twitter, not a simple "selection switching" using the tab key.
For that I tried to detect the currently typed word.
getCurrentlyTypedWord: function(e) {
var iframe = this.$("iframe.wysihtml5-sandbox").get(0);
var sel = rangy.getSelection(iframe);
var word;
if (sel.rangeCount > 0 && sel.isCollapsed) {
console.debug("Rangy: ",sel);
var initialCaretPositionRange = sel.getRangeAt(0);
var rangeToExpand = initialCaretPositionRange.cloneRange();
var newStartOffset = rangeToExpand.startOffset > 0 ? rangeToExpand.startOffset - 1 : 0;
rangeToExpand.setStart(rangeToExpand.startContainer,newStartOffset);
sel.setSingleRange(rangeToExpand);
sel.expand("word", {
trim: true,
wordOptions: {
includeTrailingSpace: true,
//wordRegex: /([a-z0-9]+)*/gi
wordRegex: /[a-z0-9]+('[a-z0-9]+)*/gi
// wordRegex: /([a-z0-9]+)*/gi
}
});
word = sel.text();
sel.removeAllRanges();
sel.setSingleRange(initialCaretPositionRange);
} else {
word = "noRange";
}
console.debug("WORD=",word);
return word;
This is only triggered when the selection is collapsed.
Notice I had to handle a backward move of the start offset because if the caret is at the end of the word (like it is the case most of the time when an user is typing), then the expand function doesn't expand around the currently typed word.
This works pretty nicely until now, the problem is that it uses the alpha release of Rangy 1.3 which has the TextRangeModule. The matter is that I noticed wysihtml5 is also using Rangy in a different and incompatible version (1.2.2) (problem with rangy.dom that probably has been removed).
As Rangy uses a global window.rangy variable, I think I'll have to use version 1.2.2 anyway.
How can I do an equivalent of the expand function, using only rangy 1.2.2?
Edit: by the way, is there any other solution than using the expand function? I think it is a bit strange and hakish to modify the current selection and revert it back just to know which word is currently typed. Isn't there a solution that doesn't involve selecting the currently typed word? I mean just based on ranges once we know the initial caret collapsed range?

As Rangy uses a global window.rangy variable, I think I'll have to use version 1.2.2 anyway.
Having read Rangy's code, I had the intuition that probably it would be feasible to load two versions of Rangy in the same page. I did a google search and found I was right. Tim Down (creator of Rangy) explained it in an issue report. He gave this example:
<script type="text/javascript" src="/rangy-1.0.1/rangy-core.js"></script>
<script type="text/javascript" src="/rangy-1.0.1/rangy-cssclassapplier.js"></script>
<script type="text/javascript">
var rangy1 = rangy;
</script>
<script type="text/javascript" src="/rangy-1.1.2/rangy-core.js"></script>
<script type="text/javascript" src="/rangy-1.1.2/rangy-cssclassapplier.js"></script>
So you could load the version of Rangy that your code wants. Rename it and use this name in your code, and then load what wysihtml5 wants and leave this version as rangy.
Otherwise, having to implement expand yourself in a way that faithfully replicates what Rangy 1.3 does is not a simple matter.
Here's an extremely primitive implementation of code that would expand selections to word boundaries. This code is going to be tripped by elements starting or ending within words.
var word_sep = " ";
function expand() {
var sel = rangy.getSelection();
var range = sel.getRangeAt(0);
var start_node = range.startContainer;
if (start_node.nodeType === Node.TEXT_NODE) {
var sep_at = start_node.nodeValue.lastIndexOf(word_sep, range.startOffset);
range.setStart(start_node, (sep_at !== -1) ? sep_at + 1 : 0);
}
var end_node = range.endContainer;
if (end_node.nodeType === Node.TEXT_NODE) {
var sep_at = end_node.nodeValue.indexOf(word_sep, range.endOffset);
range.setEnd(end_node, (sep_at !== -1) ? sep_at : range.endContainer.nodeValue.length);
}
sel.setSingleRange(range);
}
Here's a fiddle for it. This should work in rangy 1.2.2. (It would even work without rangy.)

For those interested, based in #Louis suggestions, I made this JsFiddle that shows a wysihtml5 integration to know the currently typed word.
It doesn't need the use of the expand function that is in rangy 1.3 which is still an alpha release.
http://jsfiddle.net/zPxSL/2/
$(function () {
$('#txt').wysihtml5();
var editor = $('#txt').data("wysihtml5").editor;
$(".wysihtml5-sandbox").contents().find("body").click(function(e) {
getCurrentlyTypedWord();
});
$(".wysihtml5-sandbox").contents().find("body").keydown(function(e) {
getCurrentlyTypedWord();
});
function getCurrentlyTypedWord() {
var iframe = this.$("iframe.wysihtml5-sandbox").get(0);
var sel = rangy.getIframeSelection(iframe);
var wordSeparator = " ";
if (sel.rangeCount > 0) {
var selectedRange = sel.getRangeAt(0);
var isCollapsed = selectedRange.collapsed;
var isTextNode = (selectedRange.startContainer.nodeType === Node.TEXT_NODE);
var isSimpleCaret = (selectedRange.startOffset === selectedRange.endOffset);
var isSimpleCaretOnTextNode = (isCollapsed && isTextNode && isSimpleCaret);
// only trigger this behavior when the selection is collapsed on a text node container,
// and there is an empty selection (this means just a caret)
// this is definitely the case when an user is typing
if (isSimpleCaretOnTextNode) {
var textNode = selectedRange.startContainer;
var text = textNode.nodeValue;
var caretIndex = selectedRange.startOffset;
// Get word begin boundary
var startSeparatorIndex = text.lastIndexOf(wordSeparator, caretIndex);
var startWordIndex = (startSeparatorIndex !== -1) ? startSeparatorIndex + 1 : 0;
// Get word end boundary
var endSeparatorIndex = text.indexOf(wordSeparator, caretIndex);
var endWordIndex = (endSeparatorIndex !== -1) ? endSeparatorIndex : text.length
// Create word range
var wordRange = selectedRange.cloneRange();
wordRange.setStart(textNode, startWordIndex);
wordRange.setEnd(textNode, endWordIndex);
console.debug("Word range:", wordRange.toString());
return wordRange;
}
}
}
});

Getting the last entered word from a contentEditable div

I have a div tag with contenteditable set to true.
I am trying to find out the last entered word in the div.
For example, if I type in This is a test and I hit a space, I want to be able to get the word test
I want to be able to use this logic so that I can test each word being typed (after the space is pressed).
It would be great if someone could help me with this.

An easy solution would be the following
var str = "This is a test "; // Content of the div
var lastWord = str.substr(str.trim().lastIndexOf(" ")+1);
trim might need a shim for older browsers. (.replace(/\s$/,""))
To strip punctuation like " Test!!! " you could additionally do a replace like following:
lastWord.replace(/[\W]/g,"");
You might want to do a more specific definition of the characters to omit than \W, depending on your needs.
If you want to trigger your eventhandler also on punctuation characters and not only on space, the last replace is not needed.

You first have to know when the content is edited. Using jQuery, that can be done with
$("div").on("keyup", function(){ /* code */ });
Then, you'll have to get the whole text and split it into words
var words = $(this).text().trim().split(' ');
And getting the last word is as complicated as getting the last element of the words array.
Here's the whole code
HTML
<div contenteditable="true">Add text here</div>
JavaScript (using jQuery)

$("div").on("keyup", function(){
var words = $(this).text().trim().split(' '),
lastWord = words[words.length - 1];
console.log(lastWord);
});
Demo

This is the ultimate way:
// listen to changes (do it any way you want...)
document.querySelectorAll('div')[0].addEventListener('input', function(e) {
console.log( getLastWord(this.textContent) );
}, false);
function getLastWord(str){
// strip punctuations
str = str.replace(/[\.,-\/#!$%\^&\*;:{}=\_`~()]/g,' ');
// get the last word
return str.trim().split(' ').reverse()[0];
}
DEMO PAGE

You can try this to get last word from a editable div.
HTML
<div id='edit' contenteditable='true' onkeypress="getLastWord(event,this)">
</div>
JS
function getLastWord(event,element){
var keyPressed = event.which;
if(keyPressed == 32){ //Hits Space
var val = element.innerText.trim();
val = val.replace(/(\r\n|\n|\r)/gm," ");
var idx = val.lastIndexOf(' ');
var lastWord = val.substring(idx+1);
console.log("Last Word " + lastWord);
}
}
Try this link http://jsfiddle.net/vV2mN/18/

jQuery Plugin "readmore", trim text without cutting words

I'm using http://rockycode.com/blog/jquery-plugin-readmore/ for trim long text and add a "See more" link to reveal all the text.
I would love to avoid cutting words, how could I do that?
If the limit is 35, don't cut the w...
but
If the limit is 35, don't cut the word... (and in this case, trim it at 38 and then show the hidden text from 39th chtill the end.

Instead of doing this:
$elem.readmore({
substr_len: 35
});
You could do this
$elem.readmore({
substr_len: $elem.text().substr(0, 35).lastIndexOf(" ")
});
What we're doing is to go to the latest space posible before index 35.
Of course 35 can be variable. Also you could put it into a function to reuse it.
Hope this helps

You can change the abridge function within that plugin as follows:
function abridge(elem) {
var opts = elem.data("opts");
var txt = elem.html();
var len = opts.substr_len;
var dots = "<span>" + opts.ellipses + "</span>";
var charAtLen = txt.substr(len, 1);
while (len < txt.length && !/\s/.test(charAtLen)) {
len++;
charAtLen = txt.substr(len, 1);
}
var shown = txt.substring(0, len) + dots;
var hidden = '<span class="hidden" style="display:none;">' + txt.substring(len, txt.length) + '</span>';
elem.html(shown + hidden);
}
...and it will behave as you desire. You might want to add an option to turn this feature off and on, but I'll leave that up to you.
See working example →

I was just gathering information about this subject, with your help and the help from other related posts I wrote this:
http://jsfiddle.net/KHd6J/526/

JavaScript to add HTML tags around content

I was wondering if it is possible to use JavaScript to add a <div> tag around a word in an HTML page.
I have a JS search that searches a set of HTML files and returns a list of files that contain the keyword. I'd like to be able to dynamically add a <div class="highlight"> around the keyword so it stands out.
If an alternate search is performed, the original <div>'s will need to be removed and new ones added. Does anyone know if this is even possible?
Any tips or suggestions would be really appreciated.
Cheers,
Laurie.

In general you will need to parse the html code in order to ensure that you are only highlighting keywords and not invisible text or code (such as alt text attributes for images or actual markup). If you do as Jesse Hallett suggested:
$('body').html($('body').html().replace(/(pretzel)/gi, '<b>$1</b>'));
You will run into problems with certain keywords and documents. For example:
<html>
<head><title>A history of tables and tableware</title></head>
<body>
<p>The table has a fantastic history. Consider the following:</p>
<table><tr><td>Year</td><td>Number of tables made</td></tr>
<tr><td>1999</td><td>12</td></tr>
<tr><td>2009</td><td>14</td></tr>
</table>
<img src="/images/a_grand_table.jpg" alt="A grand table from designer John Tableius">
</body>
</html>
This relatively simple document might be found by searching for the word "table", but if you just replace text with wrapped text you could end up with this:
<<span class="highlight">table</span>><tr><td>Year</td><td>Number of <span class="highlight">table</span>s made</td></tr>
and this:
<img src="/images/a_grand_<span class="highlight">table</span>.jpg" alt="A grand <span class="highlight">table</span> from designer John <span class="highlight">Table</span>ius">
This means you need parsed HTML. And parsing HTML is tricky. But if you can assume a certain quality control over the html documents (i.e. no open-angle-brackets without closing angle brackets, etc) then you should be able to scan the text looking for non-tag, non-attribute data that can be further-marked-up.
Here is some Javascript which can do that:
function highlight(word, text) {
var result = '';
//char currentChar;
var csc; // current search char
var wordPos = 0;
var textPos = 0;
var partialMatch = ''; // container for partial match
var inTag = false;
// iterate over the characters in the array
// if we find an HTML element, ignore the element and its attributes.
// otherwise try to match the characters to the characters in the word
// if we find a match append the highlight text, then the word, then the close-highlight
// otherwise, just append whatever we find.
for (textPos = 0; textPos < text.length; textPos++) {
csc = text.charAt(textPos);
if (csc == '<') {
inTag = true;
result += partialMatch;
partialMatch = '';
wordPos = 0;
}
if (inTag) {
result += csc ;
} else {
var currentChar = word.charAt(wordPos);
if (csc == currentChar && textPos + (word.length - wordPos) <= text.length) {
// we are matching the current word
partialMatch += csc;
wordPos++;
if (wordPos == word.length) {
// we've matched the whole word
result += '<span class="highlight">';
result += partialMatch;
result += '</span>';
wordPos = 0;
partialMatch = '';
}
} else if (wordPos > 0) {
// we thought we had a match, but we don't, so append the partial match and move on
result += partialMatch;
result += csc;
partialMatch = '';
wordPos = 0;
} else {
result += csc;
}
}
if (inTag && csc == '>') {
inTag = false;
}
}
return result;
}

Wrapping is pretty easy with jQuery:
$('span').wrap('<div class="highlight"></div>'); // wraps spans in a b tag
Then, to remove, something like this:
$('div.highlight').each(function(){ $(this).after( $(this).text() ); }).remove();
Sounds like you will have to do some string splitting, though, so wrap may not work unless you want to pre-wrap all your words with some tag (ie. span).

The DOM API does not provide a super easy way to do this. As far as I know the best solution is to read text into JavaScript, use replace to make the changes that you want, and write the entire content back. You can do this either one HTML node at a time, or modify the whole <body> at once.
Here is how that might work in jQuery:
$('body').html($('body').html().replace(/(pretzel)/gi, '<b>$1</b>'));

couldn't you just write a selector as such to wrap it all?
$("* :contains('foo')").wrap("<div class='bar'></div>");
adam wrote the code above to do the removal:
$('div.bar').each(function(){ $(this).after( $(this).text() ); }).remove();
edit: on second thought, the first statement returns an element which would wrap the element with the div tag and not the sole word. maybe a regex replace would be a better solution here.

We Keep Coding

JavaScript is the programming language of the Web.

Clean Microsoft Word Pasted Text using JavaScript - javascript

You can either use the full CKEditor which cleans on paste, or look at the source.

How about having a "paste as plain text" button which displays a <textarea>, allowing the user to paste the text in there? that way, all tags will be stripped for you. That's what I do with my CMS; I gave up trying to clean up Word's mess.

Could you paste to a hidden textarea, copy from same textarea, and paste to your target?

Hate to say it, but I eventually gave up making TinyMCE handle Word crap the way I want. Now I just have an email sent to me every time a user's input contains certain HTML (look for <span lang="en-US"> for example) and I correct it manually.

Related

How to wrap word into span on user click in javascript

Rangy: word under caret (again)

Getting the last entered word from a contentEditable div

jQuery Plugin "readmore", trim text without cutting words

JavaScript to add HTML tags around content

Categories

Resources