Javascript substr(); limit by word not char - javascript

I would like to limit the substr by words and not chars. I am thinking regular expression and spaces but don't know how to pull it off.
Scenario: Limit a paragraph of words to 200 words using javascript/jQuery.
var $postBody = $postBody.substr(' ',200);
This is great but splits words in half :) Thanks ahead of time!

function trim_words(theString, numWords) {
expString = theString.split(/\s+/,numWords);
theNewString=expString.join(" ");
return theNewString;
}

if you're satisfied with a not-quite accurate solution, you could simply keep a running count on the number of space characters within the text and assume that it is equal to the number of words.
Otherwise, I would use split() on the string with " " as the delimiter and then count the size of the array that split returns.

very quick and dirty
$("#textArea").val().split(/\s/).length

I suppose you need to consider punctuation and other non-word, non-whitespace characters as well. You want 200 words, not counting whitespace and non-letter characters.
var word_count = 0;
var in_word = false;
for (var x=0; x < text.length; x++) {
if ( ... text[x] is a letter) {
if (!in_word) word_count++;
in_word = true;
} else {
in_word = false;
}
if (!in_word && word_count >= 200) ... cut the string at "x" position
}
You should also decide whether you treat digits as a word, and whether you treat single letters as a word.

Related

Remove leading and trailing characters from a

I have a text file which has strings separated by whitespace. The text file contains some special characters (latin, currency, punctuations etc.) Which need to be discarded from final output. Please note that legal characters are all characters in Unicode except these special characters.
We need to separate/split text by whitespaces and then remove only leading and trailing special characters. If special characters are in between two legal characters then we won't remove them.
I can easily do it in two phases. Split text by whitespaces and then remove only leading and trailing special characters from each string. However, I need to process string only once. Is there any way, it could be achieved in one pass. Note: We can't use RegEx.
For this question assume that these characters are special:
[: , ! . < ; ' " > [ ] { } ` ~ = + - ? / ]
Example:
:!/,.<;:.?;,BBM!/,.<;:.?;,` IS TALKING TO `B!?AM!/,.<;:.?;,
Here output would be an array of valid strings: ["BBM", "IS", "TALKING", "TO", "B!?AM"]
Make simple state machine (finite automata)
Walk in a loop through all chars
At every step check if current char is letter, space or special
Execute some operation (perhaps empty) depending on state and char kind
Change state if needed
for example, you may stay in "special" state until letter is met. Remember starting index of the word and make state "inside word". Continue until special char or space is met (it is still not clear from your question).
I have used typescript and have done it in a single pass.
Please note that isSpecialCharacterCode(charCode) function simply checks whether unicode of text character is same as unicode of provided special characters.Same is true for isWhitespaceCode(charCode) function.
parseText(text: string): string[]{
let words : string[] = [];
let word = "";
let charCode = 1;
let haveSeenLegalChar = false; //set it if we have encountered legal character in text
let seenSpecialCharsToInclude = false; //set it if we have encountered //special character in text
let inBetweenSpecialChars = ""; // string containing special chars //which may be included in between legal word
for(let index = 0; index < text.length; index++){
charCode = text.charCodeAt(index);
let isSpecialChar = isSpecialCharacterCode(charCode);
let isWhitespace = isWhitespaceCode(charCode);
if(isSpecialChar && !isWhitespace){
//if this is a special character then two cases
//first is: It can be part of word (it is only possible if we have already seen atleast one legal character)
//Since it can be part of word but we are not sure whether this will be part of word so store it for now
//second is: This is either leading or trailing special character..we should not include these in word
if(haveSeenLegalChar){
inBetweenSpecialChars += text[index];
seenSpecialCharsToInclude = true;
}else{
//since we have not seen any legal character till now so it must be either leading or trailing special chars
seenSpecialCharsToInclude = false;
inBetweenSpecialChars = "";
}
}else if(isWhitespace){
//we have encountered a whitespace.This is either beginning of word or ending of word.
//if we have encountered any leagl char, push word into array
if(haveSeenLegalChar){
words.push(word);
word = "";
inBetweenSpecialChars = "";
}
haveSeenLegalChar = false;
}else if(!isSpecialChar){
//legal character case
haveSeenLegalChar = true;
if(seenSpecialCharsToInclude){
word += inBetweenSpecialChars;
seenSpecialCharsToInclude = false;
inBetweenSpecialChars = "";
}
word += text[index];
}
}
return words;
}

Javascript regex: match first 50 characters, respecting words

I'm trying to keep some nav bar lines short by matching the first 50 chars then concatenating '...', but using substr sometimes creates some awkward word chops.
So I want to figure out a way to respect words.
I could write a function to do this, but I'm just seeing if there's an easier/cleaner way.
I've used this successfully in perl:
^(.{50,50}[^ ]*)
Nice and elegant! But it doesn't work in Javascript :(
let catName = "A string that is longer than 50 chars that I want to abbreviate";
let regex = /^(.{50,50}[^ ]*)/;
let match = regex.exec(catName);
match is undefined
Use String#match method with regex with word boundary to include the last word.
str.match(/^.{1,50}.*?\b/)[0]
var str="I'm trying to keep some nav bar lines short by matching the first 50 chars then concatenating '...', but using substr sometimes creates some awkward word chops. So I want to figure out a way to respect words.";
console.log('With your code:', str.substr(0,50));
console.log('Using match:',str.match(/^.{1,50}.*?\b/)[0]);
Probably the most fool-proof solution with regular expression would be to use replace method instead. It won't fail with strings less than 50 characters:
str.replace(/^(.{50}[^ ]*).*/, '$1...');
var str = 'A string that is longer than 50 chars that I want to abbreviate';
console.log( str.replace(/^(.{50}[^ ]*).*/, '$1...') );
Tinkering with Pranov's answer, I think this works and is most succinct:
// abbreviate strings longer than 50 char, respecting words
if (catName.length > 50) {
catName = catName.match(/^(.{50,50}[^ ]*)/)[0] + '...';
}
The regex in my OP did work, but it was used in a loop and was choking on strings that already had fewer than 50 chars.
You can .split() \s, count characters at each array element which contains a word at for loop, when 50 or greater is reached when .length of each word is accrued at a variable, .slice() at current iteration from array, .join() with space characters " ", .concat() ellipses, break loop.
let catName = "A string that is longer than 50 chars that I want to abbreviate";
let [stop, res] = [50, ""];
if (catName.length > stop) {
let arr = catName.split(/\s/);
for (let i = 0, n = 0; i < arr.length; i++) {
n += arr[i].length;
if (n >= stop) {
res = arr.slice(0, i).join(" ").concat("...");
break;
};
}
} else {
res = catName.slice(0, 50).concat("...")
}
document.querySelector("pre").textContent = res;
<pre></pre>

Javascript all words are 3 characters or longer

Let say I have a string = 'all these words are three characters orr longer'
I want to check it
if (string.someWayToCheckAllWordsAre3CharactersOrLonger) {
alert("it's valid!");
}
How can I do that?
Split the string into an array, then check if each word is longer than 3 characters using every.
var string = 'all these words are three characters orr longer';
// Using regex \s+ to split the string, so only words are get in the array
string.trim().split(/\s+/).every(e => e.length >= 3);
You can use every
Before you can use every the string need to be pre-processed.
trim the string, remove the leading and trailing spaces.
split the string by one or more space characters
Then use every to check if length every element of the array is greater than or equal to three.
Demo
var string = 'all these words are three characters orr longer';
string.trim().split(/\s+/).every(function(e) { return e.length >= 3; });
how about something like this
var string = "all these words are three characters orr longer";
var words = string.split(' ');
var allWordsAreLongerThanThreeChars = true;
for(var i=0;i<words.length;i++){
if(words[i].length < 3){
allWordsAreLongerThanThreeChars = false;
return;
}
}
Two simple steps: split string into an array using .split(), loop through the array and check the length of each word using .length(). Hope this helps.
var string = 'all these words are three characters orr longer';
var stringArray = string.split(" ");
for (var i = 0; i < stringArray.length; i++){
if(stringArray[i].length >= 3) {
alert(stringArray[i]);
}
};

Regex won't match words as expected

I am trying to use XRegExp to test if a string is a valid word according to these criteria:
The string begins with one or more Unicode letters, followed by
an apostrophe (') followed by one or more Unicode letters, repeated 0 or more times.
The string ends immediately after the matched pattern.
That is, it will match these terms
Hello can't Alah'u'u'v'oo O'reilly
but not these
eatin' 'sup 'til
I am trying this pattern,
^(\\p{L})+('(\\p{L})+)*$
but it won't match any words that contain apostrophes. What am I doing wrong?
EDIT: The code using the regex
var separateWords = function(text) {
var word = XRegExp("(\\p{L})+('(\\p{L})+)*$");
var splits = [];
for (var i = 0; i < text.length; i++) {
var item = text[i];
while (i + 1 < text.length && word.test(item + text[i + 1])) {
item += text[i + 1];
i++;
}
splits.push(item);
}
return splits;
};
I think you will need to omit the string start/end anchors to match single words:
"(\\p{L})+('(\\p{L})+)*"
Also I'm not sure what those capturing groups are needed for (that may depend on your application), but you could shorten them to
"\\p{L}+('\\p{L}+)*"
Try this regex:
^[^'](?:[\w']*[^'])?$
First it checks to ensure the first character is not an apostrophe. Then it either gets any number of word characters or apostrophes followed by anything other than an apostrophe, or it gets nothing (one-letter word).

Count number of words in string using JavaScript

I am trying to count the number of words in a given string using the following code:
var t = document.getElementById('MSO_ContentTable').textContent;
if (t == undefined) {
var total = document.getElementById('MSO_ContentTable').innerText;
} else {
var total = document.getElementById('MSO_ContentTable').textContent;
}
countTotal = cword(total);
function cword(w) {
var count = 0;
var words = w.split(" ");
for (i = 0; i < words.length; i++) {
// inner loop -- do the count
if (words[i] != "") {
count += 1;
}
}
return (count);
}
In that code I am getting data from a div tag and sending it to the cword() function for counting. Though the return value is different in IE and Firefox. Is there any change required in the regular expression? One thing that I show that both browser send same string there is a problem inside the cword() function.
[edit 2022, based on comment] Nowadays, one would not extend the native prototype this way. A way to extend the native protype without the danger of naming conflicts is to use the es20xx symbol. Here is an example of a wordcounter using that.
Old answer: you can use split and add a wordcounter to the String prototype:
if (!String.prototype.countWords) {
String.prototype.countWords = function() {
return this.length && this.split(/\s+\b/).length || 0;
};
}
console.log(`'this string has five words'.countWords() => ${
'this string has five words'.countWords()}`);
console.log(`'this string has five words ... and counting'.countWords() => ${
'this string has five words ... and counting'.countWords()}`);
console.log(`''.countWords() => ${''.countWords()}`);
I would prefer a RegEx only solution:
var str = "your long string with many words.";
var wordCount = str.match(/(\w+)/g).length;
alert(wordCount); //6
The regex is
\w+ between one and unlimited word characters
/g greedy - don't stop after the first match
The brackets create a group around every match. So the length of all matched groups should match the word count.
This is the best solution I've found:
function wordCount(str) {
var m = str.match(/[^\s]+/g)
return m ? m.length : 0;
}
This inverts whitespace selection, which is better than \w+ because it only matches the latin alphabet and _ (see http://www.ecma-international.org/ecma-262/5.1/#sec-15.10.2.6)
If you're not careful with whitespace matching you'll count empty strings, strings with leading and trailing whitespace, and all whitespace strings as matches while this solution handles strings like ' ', ' a\t\t!\r\n#$%() d ' correctly (if you define 'correct' as 0 and 4).
You can make a clever use of the replace() method although you are not replacing anything.
var str = "the very long text you have...";
var counter = 0;
// lets loop through the string and count the words
str.replace(/(\b+)/g,function (a) {
// for each word found increase the counter value by 1
counter++;
})
alert(counter);
the regex can be improved to exclude html tags for example
//Count words in a string or what appears as words :-)
function countWordsString(string){
var counter = 1;
// Change multiple spaces for one space
string=string.replace(/[\s]+/gim, ' ');
// Lets loop through the string and count the words
string.replace(/(\s+)/g, function (a) {
// For each word found increase the counter value by 1
counter++;
});
return counter;
}
var numberWords = countWordsString(string);

Categories