Calculating word count after stripping HTML

Calculating word count after stripping HTML - javascript

I have encountered a simple yet peculiar problem while calculating the word count of a string that contains HTML. The simple method is to first strip the HTML and then to count the whitespace. The problem I've found is that once you strip away the HTML tags some words are incorrectly concatenated.
See the example below that illustrates the issue using Javascript "textContent" to strip the HTML.
<p>One</p><p>Two</p><p>Three</p> becomes OneTwoThree and is counted as a single word.
How would you go about counting words (simply)?
var text = document.getElementById("test").textContent;
var words = text.match(/\S+/g).length;
document.getElementById("words").textContent = words;
<div class="box" id="test">
<p>One</p><p>Two</p><p>Three</p>
</div>
<div><span id="words">???</span> word(s)</div>

Maybe this could work for you:
Replace all tags with spaces, so <p>One</p><p>Two</p> would become One Two .
Trim the middle spaces, and make them one space, so our string should just have an extra space on the left and right.
Remove that extra space.
let html = "your html";
let tmp = html.replace(/(<([^>]+)>)/ig," ");
tmp = tmp.replace(/\s+/gm, " ");
console.log(tmp.replace(/^\s+|\ +$/gm, ""));
//Now we can count the number of spaces in tmp.
let count = (tmp.match(/ /g) || []).length;

You need to use innerText instead to get the all text content even with whitespaces.
var textWithoutWhiteSpaces = document.getElementById("test").textContent;
var wordsWithoutWhiteSpaces = textWithoutWhiteSpaces.match(/\S+/g).length;
var textWithWhiteSpaces = document.getElementById("test").innerText;
var wordsWithWhiteSpaces = textWithWhiteSpaces.match(/\S+/g).length;
console.log(wordsWithoutWhiteSpaces)
console.log(wordsWithWhiteSpaces)
document.getElementById("words").textContent = wordsWithWhiteSpaces;
<div class="box" id="test">
<p>One</p><p>Two</p><p>Three</p>
</div>
<div><span id="words">???</span> word(s)</div>

I will add another option to count the number of words among the tags:
const str = '<p><p><p>One<br></p><p>Two</p><p>Three</p><p></p></p><h1>four</h1><b>five</b><H1>123</H1>';
const result = str
.replace(/(<.*?>)/g, '|')
.split('|')
.filter((el) => el !== '').length;
console.log(result);

Related

How can I cut the word after certain symbols?

I have such a structure "\"item:Test:3:Facebook\"" and I need somehow fetch the word Facebook.
The words can be dynamic. So I need to get word which is after third : and before \
I tried var arr = str.split(":").map(item => item.trim()) but it doesn't do what I need. How can I cut a word that will be after third : ?

A litte extra code to remove the last " aswell.
var str = "\":Test:3:Facebook\"";
var arr = str.split(":").map(item => item.trim());
var thirdItem = arr[3].replace(/[^a-zA-Z]/g, "");
console.log(thirdItem);

If the amount of colons (:) doesn't vary you can simply use an index on the resulting array like this:
var foo = str.split(":")[3];

The word after the 3rd : will be the fourth word returned, so it will be at index 3 in the array returned by split() (arrays being zero-indexed, of course). You might also want to get rid of the trailing quote mark.
Demo:
str = "\"item:Test:3:Facebook\"";
var word = str.split(":")[3].replace("\"", "");
console.log(word);

This should do the trick, plus remove all symbols
var foo = str.split(":")[3].replace(/[^a-zA-Z ]/g, "")

How to count every char except of word?

I need to count length of string without spaces and tags.
My JS pattern doesnt work because it also not counts 'b' and 'r' chars.
My code is here:
content.match(/[^\s^<br />]/g).length
How to fix it?

Instead of a match, just use .replace(). Match always returns an array, and because primitives in Javascript are immutable, you can make a new string without those characters easily using replace().
let newString = oldString.replace(/\s/g, '') //replace all whitespace with empty spaces
newString = newString.replace(/<br\s*\/?>/g, '') //replace <br> and <br /> with empty spaces
and then just do newString.length
In the future, try using https://regexr.com to test your regex matching

If you wanted to remove all the HTML tags (not just <br/>), you could add your string as the HTML to a new element, grab the textContent, and then run a regex match on that.
let str = '<div>Hallo this is a string.</div><br/>';
let el = document.createElement('div');
el.innerHTML = str;
let txt = el.textContent;
let count = txt.match(/[^\s]/g).join('').length; // 19
DEMO

How to find symbols on a page and get a values between the two symbols in JS?

Some page have inserted secret text and JS script on this page should get a value between two characters, for example, between characters ##. In this variant ##569076## - 569076 should receive.
Here is what i've tried:
<div id="textDiv"></div>
<script>
var markup = document.documentElement.outerHTML;
var div = document.getElementById("textDiv");
div.textContent = markup.match(/##([^#]*)##/);
var text = div.textContent;
</script>
But nothing displayed

The problem is that the match function returns an array of matched values and textContent will only accept a string.
So you have to select which array item you want to use for the textContent assignment:
div.textContent = markup.match(/##(\d+)##/)[1];
Note we selected the first captured item using the [1] from the matches array.
If the string has spaces in, just add a space in the list of matched items like so:
var markup = '##56 90 76##';
alert( markup.match(/##([\d ]*)##/)[1]);

you should put your text/code into a variable.
You can replace body selector by yours.
Plain JS:
var html = document.getElementsByTagName('body')[0].innerHtml,
match = html.match(/##(\d+)##/),
digits;
if (match) {
digits = match[1];
}
if you are using jQuery and like shortands:
var digits, match;
if (match = $('body').html().match(/##(\d+)##/)) {
digits = match[1];
}

Split in to Sentences and Wrap With Tags

I'm trying to build a text fixing page for normalising text written in all capital letters, all lower case or an ungrammatical mixture of both.
What I'm currently trying to do is write a regular expression to find all full stops, question marks and line breaks, then split the string in to various strings containing all of the words up to and including each full stop.
Then I'm going to wrap them with <span> tags and use CSS :first-letter and text-transform:capitalize; to capitalise the first letter of each sentence.
The last stage will be writing a dictionary function to find user-specified words for capitalisation.
This question only concerns the part about writing a regex and splitting in to strings.
I've tried too many methods to post here, with varying results, but here's my current attempt:
for(var i=0; i < DoIt.length; i++){
DoIt[i].onclick = function(){
var offendingtext = input.value.toString();
var keeplinebreaks = offendingtext.replace(/\r?\n/g, '<br />');
var smalltext = keeplinebreaks.toLowerCase();
//split at each character I specify
var breakitup = smalltext.split(/[/.?\r\n]/g);
breakitup.forEach(function(i){
var i;
console.log(i);
var packagedtogo = document.createElement('span');
packagedtogo.className = 'sentence';
packagedtogo.innerHTML = breakitup[i];
output.appendChild(packagedtogo);
i++;
});
}
}
It was splitting at the right places before, but it was printing undefined in the output area between the tags. I've been at this for days, please could someone give me a hand.
How can I split a string in to multiple string sentences, and then wrap each string with html tags?

Your regex for the split is fine. Just forgot to escape a few characters:
var str = "SDFDSFDSF?sdf dsf sdfdsf. sdfdsfsdfdsfdsfdsfdsfsdfdsf sdf."
str.split( (/[\.\?\r\n]/g))
//["SDFDSFDSF", "sdf dsf sdfdsf", " sdfdsfsdfdsfdsfdsfdsfsdfdsf sdf", ""]
Use for each iteration capabilities like this:
breakitup.forEach(function(element){
var packagedtogo = document.createElement('span');
packagedtogo.className = 'sentence';
packagedtogo.innerHTML = element;//breakitup is undefiend
output.appendChild(packagedtogo);
//No need to increase index
});

Javascript regex with quotes

I need a way to replace all appearances of <br class=""> with just <br>
I'm a complete novice with regex, but I tried:
str = str.replace(/<br\sclass=\"\"\s>/g, "<br>");
and it didn't work.
What's a proper regex to do this?

I would not use a regex to do this, but rather actually parse the html and remove the classes.
This is untested, but probably works.
// Dummy <div> to hold the HTML string contents
var d = document.createElement("div");
d.innerHTML = yourHTMLString;
// Find all the <br> tags inside the dummy <div>
var brs = d.getElementsByTagName("br");
// Loop over the <br> tags and remove the class
for (var i=0; i<brs.length; i++) {
if (brs[i].hasAttribute("class")) {
brs[i].removeAttribute("class");
}
}
// Return it to a string
var yourNewHTMLString = d.innerHTML;

One way is with the following
var s = '<br class="">';
var n = s.replace(/(.*)(\s.*)(>)/,"$1$3");
console.log(n)

\s matches exactly one whitespace character. You probably want \s*, which will match any number (including zero) of whitespace characters, and \s+, which will match at least one.
str = str.replace(/'<br\s+class=\"\"\s*>/g, "<br>");

We Keep Coding

JavaScript is the programming language of the Web.

Calculating word count after stripping HTML - javascript

I will add another option to count the number of words among the tags: const str = '<p><p><p>One<br></p><p>Two</p><p>Three</p><p></p></p><h1>four</h1><b>five</b><H1>123</H1>'; const result = str .replace(/(<.*?>)/g, '|') .split('|') .filter((el) => el !== '').length; console.log(result);

Related

How can I cut the word after certain symbols?

How to count every char except of word?

How to find symbols on a page and get a values between the two symbols in JS?

Split in to Sentences and Wrap With Tags

Javascript regex with quotes

Categories

Resources