split string into array of n words per index - javascript

I have a string that I'd like to split in an array that has (for example) 3 words per index.
What I'd also like it to do is if it encounters a new line character in that string that it will "skip" the 3 words limit and put that in a new index and start adding words in that new index until it reaches 3 again. example
var text = "this is some text that I'm typing here \n yes I really am"
var array = text.split(magic)
array == ["this is some", "text that I'm", "typing here", "yes I really", "am"]
I've tried looking into regular expressions, but so far I can't really make sense of the syntax that is used in regex.
I have written a way to complicated function that splits my string into lines of 3 by first splitting it into an array of separate words using .split(" "); and then using a loop to add add it per 3 into another array. But with that I can't take the new line character into account.

You can try with this pattern:
var result = text.match(/\b[\w']+(?:[^\w\n]+[\w']+){0,2}\b/g);
since the quantifier {0,2} is greedy by default, it will take a value less than 2 (N-1) only if a newline is found (since newlines are not allowed here: [^\w\n]+) or if you are a the end of the string.

If you're interested in a regexp solution, it goes like this:
text.match(/(\S+ \S+ \S+)|(\S+ \S+)(?= *\n|$)|\S+/g)
// result ["this is some", "text that I'm", "typing here", "yes I really", "am"]
Explanation: match either three space separated words, or two words followed by spaces + newline, or just one word (a "word" being simply a sequence of non-spaces).
For any number of words, try this:
text.match(/((\S+ ){N-1}\S+)|(\S+( \S+)*)(?= *\n|$)|\S+/g)
(replace N-1 with a number).

Try something like this:
words = "this is some text that I'm typing here \n yes I really am".split(" ");
result = [];
temp = "";
for (i = 0; i < words.length; i++) {
if ((i + 1) % 3 == 0) {
result.push(temp + words[i] + " ");
temp = "";
} else if (i == words.length - 1) {
result.push(temp + words[i]);
} else {
temp += words[i] + " ";
}
}
console.log(result);
Basically what this does is splits the string by words, then loops through each word. Every third word it gets to, it adds that along with what is stored in temp into the array, otherwise it adds the word to temp.

Only if you know there are no words 'left', so the number of words is always a multiple of 3:
"this is some text that I'm typing here \n yes I really am".match(/\S+\s+\S+\s+\S+/g)
=> ["this is some", "text that I'm", "typing here \n yes", "I really am"]
but if you add a word:
"this is some text that I'm typing here \n yes I really am FOO".match(/\S+\s+\S+\s+\S+/g)
the result will be exactly the same, so "FOO" is missing.

here one more way:
use this pattern ((?:(?:\S+\s){3})|(?:.+)(?=\n|$))
Demo

Related

Why is there such a difference between "" and " " in .split()?

So I have this code
function upperCase (text) {
let arr = text.split(" ");
let arr2 = [];
for(i = 0; i < arr.length; i++) {
arr2.push(arr[i].charAt(0).toUpperCase()+arr[i].slice(1));
}
return arr2.join(" ");
}
console.log(upperCase("something something"));
The current output is Something Something. But if I change the values in both .join() from .join(" ") to .join(""), the output is all capitalized (SOMETHING SOMETHING). I dont understand why does this happen? How does one space between "" make all characters capitalized?
split(" ") splits it into "something","something"
split("") splits it into "s","o","m","e","t","h","i","n","g", "s","o","m","e","t","h","i","n","g"
The uppercasing is done because you operate on lots of 1 element lists in the second case and every one gets its first character uppercased.
The parameter of split() states the character on witch the string is split. So if you provide a blank space " " your string will split on every "word".
But if you provide no character at all with "", the string will get split on every position, like Patrick Artner pointed out.
You could also split on comma "," or semicolon ";" or anything else.

Regex Match Punctuation Space but Retain Punctuation

I have a large paragraph string which I'm trying to split into sentences using JavaScript's .split() method. I need a regex that will match a period or a question-mark [?.] followed by a space. However, I need to retain the period/question-mark in the resulting array. How can I do this without positive lookbehinds in JS?
Edit: Example input:
"This is sentence 1. This is sentence 2? This is sentence 3."
Example output:
["This is sentence 1.", "This is sentence 2?", "This is sentence 3."]
This regex will work
([^?.]+[?.])(?:\s|$)
Regex Demo
JS Demo
Ideone Demo
var str = 'This is sentence 1. This is sentence 2? This is sentence 3.';
var regex = /([^?.]+[?.])(?:\s|$)/gm;
var m;
while ((m = regex.exec(str)) !== null) {
document.writeln(m[1] + '<br>');
}
Forget about split(). You want match()
var text = "This is an example paragragh. Oh and it has a question? Ok it's followed by some other random stuff. Bye.";
var matches = text.match(/[\w\s'\";\(\)\,]+(\.|\?)(\s|$)/g);
alert(matches);
The generated matches array contains each sentence:
Array[4]
0:"This is an example paragragh. "
1:"Oh and it has a question? "
2:"Ok it's followed by some other random stuff. "
4:"Bye. "
Here is the fiddle with it for further testing: https://jsfiddle.net/uds4cww3/
Edited to match end of line too.
May be this one validates your array items
\b.*?[?\.](?=\s|$)
Debuggex Demo
This is tacky, but it works:
var breakIntoSentences = function(s) {
var l = [];
s.replace(/[^.?]+.?/g, a => l.push(a));
return l;
}
breakIntoSentences("how? who cares.")
["how?", " who cares."]
(Really how it works: the RE matches a string of not-punctuation, followed by something. Since the match is greedy, that something is either punctuation or the end-of-string.)
This will only capture the first in a series of punctuation, so breakIntoSentences("how???? who cares...") also returns ["how?", " who cares."]. If you want to capture all the punctuation, use /[^.?]+[.?]*/g as the RE instead.
Edit: Hahaha: Wavvves teaches me about match(), which is what the replace/push does. You learn something knew every goddamn day.
In its minimal form, supporting three punctuation marks, and using ES6 syntax, we get:
const breakIntoSentences = s => s.match(/[^.?,]+[.?,]*/g)
I guess .match will do it:
(?:\s?)(.*?[.?])
I.e.:
sentence = "This is sentence 1. This is sentence 2? This is sentence 3.";
result = sentence.match(/(?:\s?)(.*?[.?])/ig);
for (var i = 0; i < result.length; i++) {
document.write(result[i]+"<br>");
}

Remove (n)th space from string in JavaScript

I am trying to remove some spaces from a few dynamically generated strings. Which space I remove depends on the length of the string. The strings change all the time so in order to know how many spaces there are, I iterate over the string and increment a variable every time the iteration encounters a space. I can already remove all of a specific type of character with str.replace(' ',''); where 'str' is the name of my string, but I only need to remove a specific occurrence of a space, not all the spaces. So let's say my string is
var str = "Hello, this is a test.";
How can I remove ONLY the space after the word "is"? (Assuming that the next string will be different so I can't just write str.replace('is ','is'); because the word "is" might not be in the next string).
I checked documentation on .replace, but there are no other parameters that it accepts so I can't tell it just to replace the nth instance of a space.
If you want to go by indexes of the spaces:
var str = 'Hello, this is a test.';
function replace(str, indexes){
return str.split(' ').reduce(function(prev, curr, i){
var separator = ~indexes.indexOf(i) ? '' : ' ';
return prev + separator + curr;
});
}
console.log(replace(str, [2,3]));
http://jsfiddle.net/96Lvpcew/1/
As it is easy for you to get the index of the space (as you are iterating over the string) , you can create a new string without the space by doing:
str = str.substr(0, index)+ str.substr(index);
where index is the index of the space you want to remove.
I came up with this for unknown indices
function removeNthSpace(str, n) {
var spacelessArray = str.split(' ');
return spacelessArray
.slice(0, n - 1) // left prefix part may be '', saves spaces
.concat([spacelessArray.slice(n - 1, n + 1).join('')]) // middle part: the one without the space
.concat(spacelessArray.slice(n + 1)).join(' '); // right part, saves spaces
}
Do you know which space you want to remove because of word count or chars count?
If char count, you can Rafaels Cardoso's answer,
If word count you can split them with space and join however you want:
var wordArray = str.split(" ");
var newStr = "";
wordIndex = 3; // or whatever you want
for (i; i<wordArray.length; i++) {
newStr+=wordArray[i];
if (i!=wordIndex) {
newStr+=' ';
}
}
I think your best bet is to split the string into an array based on placement of spaces in the string, splice off the space you don't want, and rejoin the array into a string.
Check this out:
var x = "Hello, this is a test.";
var n = 3; // we want to remove the third space
var arr = x.split(/([ ])/); // copy to an array based on space placement
// arr: ["Hello,"," ","this"," ","is"," ","a"," ","test."]
arr.splice(n*2-1,1); // Remove the third space
x = arr.join("");
alert(x); // "Hello, this isa test."
Further Notes
The first thing to note is that str.replace(' ',''); will actually only replace the first instance of a space character. String.replace() also accepts a regular expression as the first parameter, which you'll want to use for more complex replacements.
To actually replace all spaces in the string, you could do str.replace(/ /g,""); and to replace all whitespace (including spaces, tabs, and newlines), you could do str.replace(/\s/g,"");
To fiddle around with different regular expressions and see what they mean, I recommend using http://www.regexr.com
A lot of the functions on the JavaScript String object that seem to take strings as parameters can also take regular expressions, including .split() and .search().

Javascript regex parsing dots and whitespaces

In Javascript I have several words separated by either a dot or one ore more whitepaces (or the end of the string).
I'd like to replace certain parts of it to insert custom information at the appropriate places.
Example:
var x = "test1.test test2 test3.xyz test4";
If there's a dot it should be replaced with ".X_"
If there's one or more space(s) and the word before does not contain a dot, replace with ".X "
So the desired output for the above example would be:
"test1.X_test test2.X test3.X_xyz test4.X"
Can I do this in one regex replace? If so, how?
If I need two or more what would they be?
Thanks a bunch.
Try this:
var str = 'test1.test test2 test3.xyz test4';
str = str.replace(/(\w+)\.(\w+)/g, '$1.X_$2');
str = str.replace(/( |^)(\w+)( |$)/g, '$1$2.X$3');
console.log(str);
In the first replace it replaces the dot in the dotted words with a .X_, where a dotted word is two words with a dot between them.
In the second replace it adds .X to words that have no dot, where words that have no dot are words that are preceded by a space OR the start of the string and are followed by a space OR the end of the string.
To answer this:
If there's a dot it should be replaced with ".X_"
If there's one or more spaces it should be replaced with ".X"
Do this:
x.replace(/\./g, '.X_').replace(/\s+/g, '.X');
Edit: To get your desired output (rather than your rules), you can do this:
var words = x.replace(/\s+/g, ' ').split(' ');
for (var i = 0, l = words.length; i < l; i++) {
if (words[i].indexOf('.') === -1) {
words[i] += ".X";
}
else {
words[i] = words[i].replace(/\./g, '.X_');
}
}
x = words.join(' ');
Basically...
Strip all multiple spaces and create an array of "words"
Loop through each word.
If it doesn't have a period in it, then add ".X" to the end of the word
Else, replace the periods with ".X_"
Join the "words" back into a string and separate it by spaces.
Edit 2:
Here's a solution using only javascript's replace function:
x.replace(/\s+/g, ' ') // replace multiple spaces with one space
.replace(/\./g, '.X_') // replace dots with .X_
// find words without dots and add a ".X" to the end
.replace(/(^|\s)([^\s\.]+)($|\s)/g, "$1$2.X$3");

Regex won't match words as expected

I am trying to use XRegExp to test if a string is a valid word according to these criteria:
The string begins with one or more Unicode letters, followed by
an apostrophe (') followed by one or more Unicode letters, repeated 0 or more times.
The string ends immediately after the matched pattern.
That is, it will match these terms
Hello can't Alah'u'u'v'oo O'reilly
but not these
eatin' 'sup 'til
I am trying this pattern,
^(\\p{L})+('(\\p{L})+)*$
but it won't match any words that contain apostrophes. What am I doing wrong?
EDIT: The code using the regex
var separateWords = function(text) {
var word = XRegExp("(\\p{L})+('(\\p{L})+)*$");
var splits = [];
for (var i = 0; i < text.length; i++) {
var item = text[i];
while (i + 1 < text.length && word.test(item + text[i + 1])) {
item += text[i + 1];
i++;
}
splits.push(item);
}
return splits;
};
I think you will need to omit the string start/end anchors to match single words:
"(\\p{L})+('(\\p{L})+)*"
Also I'm not sure what those capturing groups are needed for (that may depend on your application), but you could shorten them to
"\\p{L}+('\\p{L}+)*"
Try this regex:
^[^'](?:[\w']*[^'])?$
First it checks to ensure the first character is not an apostrophe. Then it either gets any number of word characters or apostrophes followed by anything other than an apostrophe, or it gets nothing (one-letter word).

Categories