String replace with regex for copied text - javascript

While copying text from word file to text editor I am getting html code like,
<p><br></p>
<p> <br></p>
<p> <br></p>
<p> <br></p>
I want to replace above code with empty text like this,
var updated = copyieddata.replace('<p><br></p>', '');
updated = updated.replace('<p> <br></p>', '');
updated = updated.replace('<p> <br></p>', '');
updated = updated.replace('<p> <br></p>', '');
How to implement above functionality by using Regex to avoid repetition.

pedram's answer is probably the easiest way to achieve what you want.
However, if you want to only remove the <p> <br></p> tags and keep all other tags intact, then you need a regular expression that gets all parts of your string that:
Start with <p> and end with </p>
Have only <br> or whitespace in between
The regular expression you need would look like this: /<p>(\s|<br>)*<\/p>/g
This expression looks for any substring that starts with <p>, has zero or more occurrences of either whitespace (\s) or the <br> tag, and ends with </p>.
The /g at the end ensures that if there are multiple occurrences of the pattern in the string, then every pattern is matched. Omitting /g would match only the first occurence of the pattern in your string.
So, your code would look something like this:
var pattern = /<p>(\s|<br>)*<\/p>/g;
var updated = copyieddata.replace(pattern, '');

The simplest way is convert html to text (it remove all additional html tags, and you get clean text) but also you use this topics to learn how format ms word texts.
Jquery Remove MS word format from text area
Clean Microsoft Word Pasted Text using JavaScript
var text = $('#stack');
text.html(text.text());
console.log(text.html());
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<div id="stack">
some text
<p><br></p>
<p> <br></p>
<p> <br></p>
<p> <br></p>
some text
</div>
Or you use this to replace all <br> and <p> tags.
$("#stack").html(
$("#stack").html()
.replace(/\<br\>/g, "\n")
.replace(/\<br \/\>/g, "\n")
.replace(/\<p>/g, "\n")
.replace(/\<\/p>/g, "\n")
);
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<div id="stack">
some text
<p><br></p>
<p> <br></p>
<p> <br></p>
<p> <br></p>
some text
</div>
Instead of "\n" you can use nothing like this ""

Related

How can I match a bunch of p tags at the end of an html document with javascript regex?

Here is a sample content:
<p> so so so </p>
<div> whatever</div
<p> another paragraph </p>
<div> forever </div>
<p> first of last </p>
<p> second of last </p>
How can I match the last two paragraphs (or any number of consecutive paragraphs) at the end of the above document?
The match output I want is:
<p> first of last </p>
<p> second of last </p>
I tried /(<p>[\s\S]*?<\/p>[\s]*)$/g, but the lazy matching is not working as expected, it sucks all the p tags in between, and matches from the first opening p tag it encounters up to the end of the document.
Note: there might not be paragraphs at the end at all, the regex should not match if there are no paragraphs at the end.
Here we use regex to match all paragraphs and then take the last two elements of the result array.
let str = `<p> so so so </p>
<div>
whatever
</div
<p> another paragraph </p>
<div> forever </div>
<p> first of last </p>
<p> second of last </p>`
let reg = /<p>[\w\s]*<\/p>/g;
let res = str.match(reg);
console.log(res[res.length-2]);
console.log(res[res.length-1]);
Adding a negative look ahead to make sure nested paragraphs are not matched seems to do the trick:
/(<p>((?!<p>)[\s\S])*?<\/p>[\s]*)+$/g
Would appreciate better suggestions though!

How to replace multiple instances of a text using JavaScript?

I want to prevent users to enter multiple empty paragraphs in the text editor and using the following approach I can remove a single <p><br></p> from the message text in the text editor.
var str = content.replace('<p><br></p>', '');
However, I need to remove all of the <p><br></p> parts like <p><br></p><p><br></p>Lorem Ipsum is simply dummy text<p><br></p><p><br></p><p><br></p>. Is there a smarter way e.g. regex or method to perform this in a single operation?
your replace will only remove exactly '<p><br></p>'.
Removing elements without content (or only whitespace content) using a proper DOM-method may be more successful. The snippet demonstrates that for some hypothetical elements in a mockup document body.
document.body.innerHTML = `
<p><br></p>
<p>
<br>
</p>
<p>Lorem Ipsum is simply dummy text</p>
<p>
<br>
</p>
<p><br> </p>
<p><br></p>
<p> <br>
<p><br></p>
</p>`;
document.querySelectorAll("p").forEach(el => {
if (!el.textContent.trim()) {
el.parentNode.removeChild(el)
};
});
console.log(document.body.innerHTML.trim());
use regex to replace the content.
content.replace(/<p><br><\/p>/g, "");

.text() concatenates strings without spaces when removing html tags

I have a rich text editor on my site and I am trying to create a reliable word counter for it.
Because it's a rich text editor it (potentially) contains html.
This html might be, for example:
<div class="textEditor">
<h1><strong>this is a sample heading</strong></h1>
<p><br></p>
<p>and this is a sample paragraph</p>
</div>
To get a reliable word count I am trying to first convert the html to text using:
var value = $('.textEditor').text()
The problem I am facing is the string that is returned seems to concatenate where it removes the html tags and what I am left with is:
this is a sample headingand this is a sample paragraph
as you can see, the words 'heading' 'and' are joined to become 'headingand' which would give me a word count of 10 instead of 11.
any thoughts on how to properly achieve this would be much appreciated :)
You can use innerText:
var value = document.querySelector('.textEditor').innerText
or
var value = $('.textEditor')[0].innerText
console.log(document.body.innerText)
<div class="textEditor">
<h1><strong>this is a sample heading</strong></h1>
<p><br></p>
<p>and this is a sample paragraph</p>
</div>
I had a bit of playing around with it and came up with the following:
let value = $('.textEditor').text();
function read(){
alert(value.trim().split(/\s+/).length);
}
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<div class="textEditor">
<h1><strong>this is a sample heading</strong></h1>
<p><br></p>
<p>and this is a sample paragraph</p>
</div>
<button onclick="read()">Read</button>
https://codepen.io/anon/pen/YOazqG?editors=1010
Just trim it and split and you should be fine.

JavaScript: can I search the regular expression not from the start of string?

I learned that indexOf() could not be used for searching the regular expression in the string, however search() has not the start position and the end position as the optional parameters. How can I find and replace all certain regular expression in the same string? I added the problem where it is no so simple as replace() will be enough.
Problem example
Replace all consecutive two <br/><br/> with </p><p>, if after second <br/> some letters or digits (\w) are following.
Leave all single <br/> of three or more consecutive <br/> such as.
If there are no letter or digits after consecutive two <br/><br/>, leave it such as.
If we use replace() for solving this problem, not only <br/><br/>, but also following symbols will be replaced. To evade it, we need:
Find the start of matching with regular expression. It will be /(?:<br\s*[\/]?>){2}\s*\w+/.
From the start of matching position, find the start position of \w part.
Replace the /(?:<br\s*[\/]?>){2}\s*/ part with </p><p>.
Repeat 1-3 inside the loop from the end of the previous matching position util next matches exists.
As I told above, I don't know how to search the new matching from the certain position. Is there some ways except slice the string and join it again?
var testString = $('#container').html();
console.log(testString.search(/(?:<br\s*[\/]?>){2}\s*\w+/));
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<div id="container">
<p>
<!-- Only one br: leave such as -->
Brick quiz whangs jumpy veldt fox! <br/>
<!-- Two br and letters then: replace by </p><p> -->
Sphinx of black quartz judge my vow! <br/><br />
<!-- No symbols after 2nd br: leave such as -->
Pack my box with five dozen liquor jugs. <br/><br /><br/>
<!-- Two br and symbols then: replace by </p><p> -->
The vixen jumped quickly on her foe barking with zeal. <br/><br />
<!-- No letters after <br/><br/>: leave such as -->
Brawny gods just flocked up to quiz and vex him.<br/><br />
<p>
</div>
As commented by #epascarello and #torazaburo its NOT recommended to use RegExp for parsing HTML and you should better use HTML parsers to be on safer side.
But if your HTML string that you want to parse is going to use a fixed template / format, you can still use RegExp for parsing it.
Assuming the current RegExp that you have posted returns expected search results for you, you can try following code to replace the string and use </p><p> as required.
var testString = $('#container').html();
console.log(testString.replace(/(?:<br\s*[\/]?>){2}(\s*\w+)/gi, '</p><p>$1'));
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<div id="container">
<p>
Brick quiz whangs jumpy veldt fox! <br/>
Sphinx of black quartz judge my vow! <br/><br />
Pack my box with five dozen liquor jugs. <br/><br /><br/>
The vixen jumped quickly on her foe barking with zeal. <br/><br />
Brawny gods just flocked up to quiz and vex him.<br/><br />
</p>
</div>
Note:
I've kept your RegExp as is assuming it finds the <br> tags as per your requirement, and just added the () around \s*\w+ because we want to remember (keep) that string in the output
I've used gi flags in the RegExp. You can find details here
$1 in replace string will use the remembered string which was matched by \s*\w+

Javascript regexp replace of multiline content between two tags (including the tags)

In the string
some text <p id='item_1' class='item'>multiline content\r\n\r\n for <br/>remove</p><br clear='all' id='end_of_item_1'/><p id='item_2' class='item'>another multiline content\r\n\r\n</p><br clear='all' id='end_of_item_2'/>
I need to remove
<p id='item_1' class='item'>multiline content\r\n\r\n for <br/>remove</p><br clear='all' id='end_of_item_1'/>
Can't find a way how to do it.
var id = 'item_1';
var patt=new RegExp("<p id='"+id+"'(.)*|([\S\s]*?)end_of_"+id+"'\/>","g");
var str="some text <p id='item_1' class='item'>multiline content\r\n\r\n for <br/>remove</p><br clear='all' id='end_of_item_1'/><p id='item_2' class='item'>another multiline content\r\n\r\n</p><br clear='all' id='end_of_item_2'/>";
document.write(str.replace(patt,""));
The result is
some text for
<br>
remove
<p></p>
<br id="<p id=" class="item" clear="all" item_2'="">
another multiline content
<p></p>
<br id="end_of_item_2" clear="all">
Please help to solve this.
Here's the regex for the current scenario. When the regex approach eventually breaks, remember that we warned that parsing HTML with regex was a fool's errand. ;)
This:
var s = "some text <p id='item_1' class='item'>multiline content\r\n\r\n for <br/>remove</p><br clear='all' id='end_of_item_1'/><p id='item_2' class='item'>another multiline content\r\n\r\n</p><br clear='all' id='end_of_item_2'/><ul><li>";
var id = 'item_1';
var patt = new RegExp ("<p[^<>]*\\sid=['\"]" + id + "['\"](?:.|\\n|\\r)*<br[^<>]*\\sid=['\"]end_of_" + id + "['\"][^<>]*>", "ig")
var stripped = s.replace (patt, "");
Produces this:
"some text <p id='item_2' class='item'>another multiline content
</p><br clear='all' id='end_of_item_2'/><ul><li>"
Why can't you use the DOM API to remove it? (add everything to the document, and then remove what you don't need)
var item1 = document.getElementById('item_1'),
endOfItem1 = document.getElementById('end_of_item_1');
item1.parentNode.removeChild(item1);
endOfItem1.parentNode.removeChild(endOfItem1);
I need to assume a bit of unspoken constraints from your question, to get this to work:
Am I right in guessing, that you want a regex, that can find (and then replace) any 'p' tag with a specific id, up to a certain tag (like e.g. a 'br' tag) with an id of 'end_of_[firstid]'?
If that is correct, than the following regex might work for you. It may be, that you need to modify it a bit, to get JS to accept it:
<p\s+id='([a-zA-Z0-9_]+)'.*?id='end_of_\1'\s*\/>
This will give you any constellation with the criteria, describled above, and the name if the id as group 1, It should now be a simple task, to check if group1 contains the id you want to remove and then replace the whole match with an empty string.
If I understand your example correcty (I am not that good with JavaScript and my RegEx was based rather on the general perl-regex fashion) you could maybe do something like the following:
var patt=new RegExp("<p\s+id='"+id+"'.*?id='end_of_"+id+"'\s*\/>","g");
That way, you don't have to worry about group matching, although I find it to be more elegant, to match the id you wanted via a group instead of inserting it into the RegEx.

Categories