Splitting a string into words and keeping delimiter - javascript

I want to split up a string (sentence) in an array of words and keep the delimiters.
I have found and I am currently using this regex for this:
[^.!?\s][^.!?]*(?:[.!?](?!['"]?\s|$)[^.!?]*)*[.!?]?['"]?(?=\s|$)
An explanation can be found here: http://regex101.com/
This works exactly as I want it to and effectively makes a string like
This is a sentence.
To an array of
["This", "is", "a", "sentence."]
The problem here is that it does not include spaces nor newlines. I want the string to be parsed as words as it already does but I also want the corresponding space and or newline character to belong to the previous word.
I have read about positive lookahead that should look for future characters (space and or newline) but still take them into account when extracting the word. Although this might be the solution I have failed to implement it.
If it makes any difference I am using JavaScript and the following code:
//save the regex -- g modifier to get all matches
var reg = /[^.!?\s][^.!?]*(?:[.!?](?!['"]?\s|$)[^.!?]*)*[.!?]?['"]?(?=\s|$)/g;
//define variable for holding matches
var matches;
//loop through each match
while(matches = reg.exec(STRING_HERE)){
//the word without spaces or newlines
console.log(matches[0]);
}
The code works but as I said, it does not include spaces and newline characters.

Yo can try something simpler:
str.split(/\b(?!\s)/);
However, note non word characters (e.g. full stop) will be considered another word:
"This is a sentence.".split(/\b(?!\s)/);
// [ "This ", "is ", "a ", "sentence", "." ]
To fix that, you can use a character class with the characters that shouldn't begin another word:
str.split(/\b(?![\s.])/);

function split_string(str){
var arr = str.split(" ");
var last_i = arr.length - 1;
for(var i=0; i<last_i; i++){
arr[i]+=" ";
}
return arr;
}

It may be as simple as this:
var sentence = 'This is a sentence.';
sentence = sentence.split(' ').join(' ||');
sentence = sentence.split('\n').join('\n||');
var matches = sentence.split('||');
Note that I use 2 pipes as a delimiter, but ofcourse you can use anything as long as it's unique.
Also note that I only split \n as a newline, but you may add \r\n or whatever you want to split as well.

General Solution
To keep the delimiters conjoined in the results, the regex needs to be a zero-width match. In other words, the regex can be thought of as matching the point between a delimiter and non-delimiter, rather than matching the delimiters themselves. This can be achieved with zero-width matching expressions, matching before, at, or after the split point (at most one each); let's call these A, B, and C. Sometimes a single sub-expression will do it, others you'll need two; offhand, I can't think of a case where you'd need three.
Not only look-aheads but lookarounds in general are the perfect candidates for this purpose: lookbehinds ((?<=...)) to match before the split point, and lookaheads ((?=...)) after. That's the essence of this approach. Positive or negative lookarounds can be used. The one pitfall is that lookbehinds are relatively new to JS regexes, so not all browsers or other JS engines will support them (current versions of Firefox, Chrome, Opera, Edge, and node.js do; Safari does not). If you need to support a JS engine that doesn't support lookbehinds, you might still be able to write & use a regex that matches at-and-before (BC).
To have the delimiters appear at the end of each match, put them in A. To have them at the start, in C. Fortunately, JS regexes do not place restrictions on lookbehinds, so simply wrapping the delimiter regex in the positive lookaround markers should be all that's required for delimiters. If the delimiters aren't so simple (i.e. context-sensitive), it might take a little more work to write the regex, which doesn't need to match the entire delimiter.
Paired with the delimiter pattern, you'll need to write a pattern that matches the start (for C) or end (for A) of the non-delimiter. This step is likely the one that will require the most additional work.
The at-split-point match, B
will often (always?) be a simple boundary, such as \b.
Specific Solution
If spaces are the only delimiters, and they're to appear at the end of each match, the delimiter pattern would be (?<=\s), in A. However, there are some cases not covered in the problem description. For example, should words separated by only punctuation (e.g. "x.y") be split? Which side of a split point should quotation marks and hyphens appear, if any? Should they count as punctuation? Another option for the delimiter is to match (after) all non-word characters, in which case A would be (<?=\W).
Since the split-point is at a word boundary, B could be \b.
Since the start of a match is a word character, (?=\w) will suffice for C.
Any two of those three should suffice. One that is perhaps clearest in meaning (and splits at the most points) is /(<?=\W)(?=\w)/, which can be translated as "split at the start of each word". \b could be added, if you find it more understandable, though it has no functional affect: /(<?=\W)\b(?=\w)/.
Note Oriol's excellent solutions are given by B=\b and (C=(?!\s) or C=(?![\s.])).
Additional
As a point of interest, there would be a simpler solution for this particular case if JS regexes supported TCL word boundaries: \m matches only at the start of a word, so str.split(/\m/) would split exactly at the start of each word. (\m is equivalent to (<?=\W)(?=\w).)

If you want to include the whitespace after the word, the regex \S+\s* should work.
const s = `This is a sentence.
This is another sentence.`;
console.log(s.match(/\S+\s*/g))

Related

Remove all non-latin passages from a string with regex

I need to remove all passages that contain non-latin characters from a string however unlike a lot of answers I have seen, I want to also remove the punctuation in those passages while leaving the same punctuation in English passages.
To say it in another way, when a non-latin character such as "ָהּ" is encountered, the regex will start skipping everything including ascii punctuation until an [a-zA-Z] character is found.
I have tried the following example but its incorrectly removing the quote after "halves" leaving me to believe I don't have a good definition of non-latin characters.
[\u0250-\ue007][^a-zA-Z]*
Here is an example of input text (updated):
or perhaps, a - אוֹ דִילְמָא אֵין אִשָּׁה מִתְקַדְּשֶׁת לַחֲצָאִין כְּלָל (12);time
תֵּיקוּ
person cannot be in separate halves at all, even
though both "halves” would come together simultaneously?(13)
The speaker replies:(14)
and the resulting string is:
or perhaps, a - time
person cannot be in separate halves at all, even
though both "halveswould come together simultaneously?(13)
The speaker replies:(14)
As you can see, it messes up on the third line. Obviously, I could just exclude that particular character but I'm worried it will mess up on other edge cases.
Any other ideas? (I'm working with Javascript btw)
I understand that by "a non-latin character such as הּ" you mean any non-ASCII letter.
To match any letter other than an ASCII letter, you can use [^\P{L}a-zA-Z]. This is a negated character class that matches any chars other than a non-letter char (\P{L}) and ASCII letters (a-zA-Z). So, it is basically the \p{L} pattern with the exception of ASCII letters.
This Unicode character class based pattern requires a u flag, supported by Node.js JavaScript environment.
The solution will look like
text = text.replace(/[^\P{L}a-z][^a-z]*/gui, '')
Note the g flag makes replace replace all occurrences in the string and i is used to shorten the ASCII letter pattern (since it makes the pattern matching case insensitive).
See the JavaScript demo:
const text = `or perhaps, a - אוֹ דִילְמָא אֵין אִשָּׁה מִתְקַדְּשֶׁת לַחֲצָאִין כְּלָל (12);time
תֵּיקוּ
person cannot be in separate halves at all, even
though both "halves” would come together simultaneously?(13)
The speaker replies:(14)`;
console.log(
text.replace(/[^\P{L}a-z][^a-z]*/gui, '')
)
Output:
or perhaps, a - time
person cannot be in separate halves at all, even
though both "halves” would come together simultaneously?(13)
The speaker replies:(14)

How to match all words starting with dollar sign but not slash dollar

I want to match all words which are starting with dollar sign but not slash and dollar sign.
I already try few regex.
(?:(?!\\)\$\w+)
\\(\\?\$\w+)\b
String
$10<i class="">$i01d</i>\$id
Expected result
*$10*
*$i01d*
but not this
*$id*
After find all expected matching word i want to replace this my object.
One option is to eliminate escape sequences first, and then match the cleaned-up string:
s = String.raw`$10<i class="">$i01d</i>\$id`
found = s.replace(/\\./g, '').match(/\$\w+/g)
console.log(found)
The big problem here is that you need a negative lookbehind, however, JavaScript does not support it. It's possible to emulate it crudely, but I will offer an alternative which, while not great, will work:
var input = '$10<i class="">$i01d</i>\\$id';
var regex = /\b\w+\b\$(?!\\)/g;
//sample implementation of a string reversal function. There are better implementations out there
function reverseString(string) {
return string.split("").reverse().join("");
}
var reverseInput = reverseString(input);
var matches = reverseInput
.match(regex)
.map(reverseString);
console.log(matches);
It is not elegant but it will do the job. Here is how it works:
JavaScript does support a lookahead expression ((?>)) and a negative lookahead ((?!)). Since this is the reverse of of a negative lookbehind, you can reverse the string and reverse the regex, which will match exactly what you want. Since all the matches are going to be in reverse, you need to also reverse them back to the original.
It is not elegant, as I said, since it does a lot of string manipulations but it does produce exactly what you want.
See this in action on Regex101
Regex explanation Normally, the "match x long as it's not preceded by y" will be expressed as (?<!y)x, so in your case, the regex will be
/(?<!\\)\$\b\w+\b/g
demonstration (not JavaScript)
where
(?<!\\) //do not match a preceding "\"
\$ //match literal "$"
\b //word boundary
\w+ //one or more word characters
\b //second word boundary, hence making the match a word
When the input is reversed, so do all the tokens in order to match. Furthermore, the negative lookbehind gets inverted into a negative lookahead of the form x(?!y) so the new regular expression is
/\b\w+\b\$(?!\\)/g;
This is more difficult than it appears at first blush. How like Regular Expressions!
If you have look-behind available, you can try:
/(?<!\\)\$\w+/g
This is NOT available in JS. Alternatively, you could specify a boundary that you know exists and use a capture group like:
/\s(\$\w+)/g
Unfortunately, you cannot rely on word boundaries via /b because there's no such boundary before '\'.
Also, this is a cool site for testing your regex expressions. And this explains the word boundary anchor.
If you're using a language that supports negative lookback assertions you can use something like this.
(?<!\\)\$\w+
I think this is the cleanest approach, but unfortunately it's not supported by all languages.
This is a hackier implementation that may work as well.
(?:(^\$\w+)|[^\\](\$\w+))
This matches either
A literal $ at the beginning of a line followed by multiple word characters. Or...
A literal $ this is preceded by any character except a backslash.
Here is a working example.

RegEx: Understanding Syllable Counter Code

I have used Dylan's question on here regarding JavaScript syllable counting, and more specifically artfulhacker's answer, in my own code and, regardless of which single or multi word string I feed it, the function is always able to correctly count the number of syllables.
I have a limited experience with RegEx and not enough prior knowledge to decipher what exactly is happening in the following code without some help. I'm not someone who is ever happy with having some code I pulled from somewhere just work without me knowing how it works. Is someone able to please articulate what is happening in the new_count(word) function below and help me decipher the use of RegEx and how it is that the function is able to correctly count syllables? Many
function new_count(word) {
word = word.toLowerCase(); //word.downcase!
if(word.length <= 3) { return 1; } //return 1 if word.length <= 3
word = word.replace(/(?:[^laeiouy]es|ed|[^laeiouy]e)$/, ''); //word.sub!(/(?:[^laeiouy]es|ed|[^laeiouy]e)$/, '')
word = word.replace(/^y/, ''); //word.sub!(/^y/, '')
return word.match(/[aeiouy]{1,2}/g).length; //word.scan(/[aeiouy]{1,2}/).size
}
As far as I see it, we basically want to count the vowels, or vowel pairs, with some special cases. Let's start by the last line, which does that, i.e. count vowels and pairs:
return word.match(/[aeiouy]{1,2}/g).length;
This will match any vowel, or vowel pair. [...] means a character class, i.e. that if we go through the string character-by-character, we have a match, if the actual character is one of those. {1, 2} is the number of repetitions, i.e. it means that we should match exactly one or two such characters.
The other two lines are for special cases.
word = word.replace(/(?:[^laeiouy]es|ed|[^laeiouy]e)$/, '');
This line will remove 'syllables' from the end of the word, which are either:
Xes (where X is anything but any of 'laeiouy', e.g. 'zes')
ed
Xe (where X is anything but any of 'laeiouy', e.g. 'xe')
(I'm not really sure what the grammatical meaning behind this is, but I guess, that 'syllables' at the end of the word, like '-ed', '-ded', '-xed' etc. don't really count as such.)
As for the regexp part: (?:...) is a non-capturing group. I guess it's not really important in this case that this group be non-capturing; this just means that we would like to group the whole expression, but then we do not need to refer back to it. However, we could have used a capturing group too (i.e. (...) )
The [^...] is a negated character class. It means, match any character, which is none of those listed here. (Compare to the (non-negated) character-class mentioned above.)
The pipe symbol, i.e. |, is the alternation operator, which means, that any of the expressions can match.
Finally, the $ anchor matches the end of the line, or string (depending on the context).
word = word.replace(/^y/, '');
This line removes 'y'-s from the beginning of words (probably 'y' at the beginning does not count as a syllable -- which makes sense in my opinion).
^ is the anchor for matching the beginning of the line, or string (c.f. $ mentioned above).
Note: the algorithm only works if word really contains one single word.
/(?:[^laeiouy]es|ed|[^laeiouy]e)$/
That matches three possible substrings: a letter other than 'l' or a vowel followed by 'es' (like "res" or "tes"); 'ed'; or a non-vowel, non-'l' followed by just an 'e'. Those patterns must appear at the end of the word to match because of the $ at the end of the pattern. The grouping (?: ) is just a grouping; the leading ?: makes that distinction. The pattern could have been a little shorter:
/(?:[^laeiouy]es?|ed)$/
would do the same thing. In any case, if the pattern matches the characters involved are removed from the word.
Then,
/^y/
matches a 'y' at the beginning of a word. If a 'y' is found, it's removed.
Finally,
/[aeiouy]{1,2}/g
matches any one- or two-character stretch of vowels (including 'y'). The g suffix makes it a global match, so that the return value is an array consisting of all such spans of vowels. The length of that returned array is the number of syllables (according to this technique).
Note that the words "poem" and "lion" would be reported as one-syllable words, which may be correct for some English variants but not all.
Here is a pretty good reference for JavaScript regular expression operators.

using regular expression to find polysyllabic words

i'm trying to use regexp to find the number of polysyllabic words in a piece of text, my code works most of the time but doesn't pick up on some of the poly words:
polySyllableCount = lWords2.replace(/(?:[^laeiouy\s]es|ed|[^laeiouy\s]e)$/, '');
is what I use to count the syllables, and
polySyllableCount = lWords2.replace(/^y/, '');
to replace the leading Y's so they are not counted,
and finally:
try
{
polySyllables = polySyllableCount.match(/[aeiouy]\S[aeiouy]\S[aeiouy]/g).length;
}
catch(err)
{
console.log("No Poly Words")
}
to count the number of polysyllabic words.
My thought process is that it will find any 3 vowels in a (modified) word, separated by anything except a whitespace, to give me the number of polysyllabic words
please notice that \S also matches punctuation marks like . and , and that can be the cause of some mis-detection. For example:
'ame.na mana miu' //'ame.na' will be treated like one word with your regexp
You can replace \S with \w for better results. Of course \w will include numbers too and if you want to be really accurate, you may use [a-z]. Also you are using the /g switch. You need to add /i to it so that it searches for AEIOUY too so it will be
/...regexp.../gi
You can learn more here: javascriptkit.com/javatutors/redev2.shtml

Regular Expression to find words (using word boundary) where words include a dash character

Given the following Regular Expression:
\b(MyString|MyString-Dash)\b
And the text:
AString
MyString
MyString-Dash
Running a match against the text never finds a match for the second thing (MyString-Dash) because the '-' (dash) character isn't a word boundary character. The following javascript always outputs "MyString,MyString" to the "matches" div (I would like to find MyString and MyString-Dash as distinct matches). How can I define a pattern that will match both MyString and MyString-Dash ?
<html>
<body>
<h1>Content</h1>
<div id="content">
AString
MyString
MyString-Dash
</div>
<br>
<h1>Matches (expecting MyString,MyString-Dash)</h1>
<div id="matches"></div>
</body>
<script>
var content = document.getElementById('content');
var matchesDiv = document.getElementById('matches');
var pattern = '\\b(MyString|MyString-Dash)\\b';
var matches = content.innerHTML.match(pattern);
matchesDiv.innerHTML = matches;
</script>
</html>
Swap the order of your matching so that the longest possible is first:
content.innerHTML.match(/\b(MyString-Dash|MyString)\b/)
I believe regular expressions match from left to right. Just tested this in Firebug, it works.
I would also change that pattern var to a regular expression literal, from '\\b(MyString-Dash|MyString)\\b' to /\b(MyString-Dash|MyString)\b/g
You want the /g in there because that will make the regular expression return all matches, rather than just the first one.
Please see this answer for how to deal with words with dashes in them and the issues related to boundaries when you have those kinds of words.
There are a couple problems with your assumptions.
Running a match against the text never finds a match for the second thing (MyString-Dash) because the '-' (dash) character isn't a word boundary character.
There's no such thing as a word boundary character. Word boundaries are the space between characters that match \w and don't match \w. - does not match '\w', so on either side of it is a "word boundary", but that won't break your match: the - is a literal dash in your regex and the \b's are far outside of it.
Second, regexen will always try to match the first thing they can in the string that matches your regex. As long as that first string in there matches, it will keep returning the first thing in there. You're asking for the first match when you ask for a match. That's the design. If you didn't want it to match MyString, don't ask for it.
Third, most regex engines prioritize 'completing a match' over length of a match. Thus, 'MyString', if it matches, will always be the first thing it returns. You'll have to wait until Perl 6 grammars for a regex engine that prioritizes length. :)
The only way for you to really do this is with two checks, one for the longer one, first, and then one for the shorter one. It will always match the first thing it finds that works. If you have a priority other than that, it's up to you to code it in as separate checks.

Categories