Remove all non-latin passages from a string with regex - javascript

I need to remove all passages that contain non-latin characters from a string however unlike a lot of answers I have seen, I want to also remove the punctuation in those passages while leaving the same punctuation in English passages.
To say it in another way, when a non-latin character such as "ָהּ" is encountered, the regex will start skipping everything including ascii punctuation until an [a-zA-Z] character is found.
I have tried the following example but its incorrectly removing the quote after "halves" leaving me to believe I don't have a good definition of non-latin characters.
[\u0250-\ue007][^a-zA-Z]*
Here is an example of input text (updated):
or perhaps, a - אוֹ דִילְמָא אֵין אִשָּׁה מִתְקַדְּשֶׁת לַחֲצָאִין כְּלָל (12);time
תֵּיקוּ
person cannot be in separate halves at all, even
though both "halves” would come together simultaneously?(13)
The speaker replies:(14)
and the resulting string is:
or perhaps, a - time
person cannot be in separate halves at all, even
though both "halveswould come together simultaneously?(13)
The speaker replies:(14)
As you can see, it messes up on the third line. Obviously, I could just exclude that particular character but I'm worried it will mess up on other edge cases.
Any other ideas? (I'm working with Javascript btw)

I understand that by "a non-latin character such as הּ" you mean any non-ASCII letter.
To match any letter other than an ASCII letter, you can use [^\P{L}a-zA-Z]. This is a negated character class that matches any chars other than a non-letter char (\P{L}) and ASCII letters (a-zA-Z). So, it is basically the \p{L} pattern with the exception of ASCII letters.
This Unicode character class based pattern requires a u flag, supported by Node.js JavaScript environment.
The solution will look like
text = text.replace(/[^\P{L}a-z][^a-z]*/gui, '')
Note the g flag makes replace replace all occurrences in the string and i is used to shorten the ASCII letter pattern (since it makes the pattern matching case insensitive).
See the JavaScript demo:
const text = `or perhaps, a - אוֹ דִילְמָא אֵין אִשָּׁה מִתְקַדְּשֶׁת לַחֲצָאִין כְּלָל (12);time
תֵּיקוּ
person cannot be in separate halves at all, even
though both "halves” would come together simultaneously?(13)
The speaker replies:(14)`;
console.log(
text.replace(/[^\P{L}a-z][^a-z]*/gui, '')
)
Output:
or perhaps, a - time
person cannot be in separate halves at all, even
though both "halves” would come together simultaneously?(13)
The speaker replies:(14)

Related

Regex match any word followed by a number (but the word can contain special characters like diacritics or accents) [duplicate]

I've looked on Stack Overflow (replacing characters.. eh, how JavaScript doesn't follow the Unicode standard concerning RegExp, etc.) and haven't really found a concrete answer to the question "How can JavaScript match accented characters (those with diacritical marks)?"
I'm forcing a field in a UI to match the format: last_name, first_name (last [comma space] first), and I want to provide support for diacritics, but evidently in JavaScript it's a bit more difficult than other languages/platforms.
This was my original version, until I wanted to add diacritic support:
/^[a-zA-Z]+,\s[a-zA-Z]+$/
Currently I'm debating one of three methods to add support, all of which I have tested and work (at least to some extent, I don't really know what the "extent" is of the second approach). Here they are:
Explicitly listing all accented characters that I would want to accept as valid (lame and overly-complicated):
var accentedCharacters = "àèìòùÀÈÌÒÙáéíóúýÁÉÍÓÚÝâêîôûÂÊÎÔÛãñõÃÑÕäëïöüÿÄËÏÖÜŸçÇßØøÅåÆæœ";
// Build the full regex
var regex = "^[a-zA-Z" + accentedCharacters + "]+,\\s[a-zA-Z" + accentedCharacters + "]+$";
// Create a RegExp from the string version
regexCompiled = new RegExp(regex);
// regexCompiled = /^[a-zA-ZàèìòùÀÈÌÒÙáéíóúýÁÉÍÓÚÝâêîôûÂÊÎÔÛãñõÃÑÕäëïöüÿÄËÏÖÜŸçÇßØøÅåÆæœ]+,\s[a-zA-ZàèìòùÀÈÌÒÙáéíóúýÁÉÍÓÚÝâêîôûÂÊÎÔÛãñõÃÑÕäëïöüÿÄËÏÖÜŸçÇßØøÅåÆæœ]+$/
This correctly matches a last/first name with any of the supported accented characters in accentedCharacters.
My other approach was to use the . character class, to have a simpler expression:
var regex = /^.+,\s.+$/;
This would match for just about anything, at least in the form of: something, something. That's alright I suppose...
The last approach, which I just found might be simpler...
/^[a-zA-Z\u00C0-\u017F]+,\s[a-zA-Z\u00C0-\u017F]+$/
It matches a range of Unicode characters - tested and working, though I didn't try anything crazy, just the normal stuff I see in our language department for faculty member names.
Here are my concerns:
The first solution is far too limiting, and sloppy and convoluted at that. It would need to be changed if I forgot a character or two, and that's just not very practical.
The second solution is better, concise, but it probably matches far more than it actually should. I couldn't find any real documentation on exactly what . matches, just the generalization of "any character except the newline character" (from a table on the MDN).
The third solution seems the be the most precise, but are there any gotchas? I'm not very familiar with Unicode, at least in practice, but looking at a code table/continuation of that table, \u00C0-\u017F seems to be pretty solid, at least for my expected input.
Faculty won't be submitting forms with their names in their native language (e.g., Arabic, Chinese, Japanese, etc.), so I don't have to worry about out-of-Latin-character-set characters
Which of these three approaches is most suited for the task? Or are there better solutions?
The easier way to accept all accents is this:
[A-zÀ-ú] // accepts lowercase and uppercase characters
[A-zÀ-ÿ] // as above, but including letters with an umlaut (includes [ ] ^ \ × ÷)
[A-Za-zÀ-ÿ] // as above but not including [ ] ^ \
[A-Za-zÀ-ÖØ-öø-ÿ] // as above, but not including [ ] ^ \ × ÷
See Unicode Character Table for characters listed in numeric order.
The accented Latin range \u00C0-\u017F was not quite enough for my database of names, so I extended the regex to
[a-zA-Z\u00C0-\u024F]
[a-zA-Z\u00C0-\u024F\u1E00-\u1EFF] // includes even more Latin chars
I added these code blocks (\u00C0-\u024F includes three adjacent blocks at once):
\u00C0-\u00FF Latin-1 Supplement
\u0100-\u017F Latin Extended-A
\u0180-\u024F Latin Extended-B
\u1E00-\u1EFF Latin Extended Additional
Note that \u00C0-\u00FF is actually only a part of Latin-1 Supplement. It skips unprintable control signals and all symbols except for the awkwardly-placed multiply × \u00D7 and divide ÷ \u00F7.
[a-zA-Z\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u024F] // exclude ×÷
If you need more code points, you can find more ranges on Wikipedia's List of Unicode characters. For example, you could also add Latin Extended-C, D, and E, but I left them out because only historians seem interested in them now, and the D and E sets don't even render correctly in my browser.
The original regex stopping at \u017F borked on the name "Șenol". According to FontSpace's Unicode Analyzer, that first character is \u0218, LATIN CAPITAL LETTER S WITH COMMA BELOW. (Yeah, it's usually spelled with a cedilla-S \u015E, "Şenol." But I'm not flying to Turkey to go tell him, "You're spelling your name wrong!")
Which of these three approaches is most suited for the task?
Depends on the task :-) To match exactly all Latin characters and their accented versions, the Unicode ranges probably provide the best solution. They might be extended to all non-whitespace characters, which could be done using the \S character class.
I'm forcing a field in a UI to match the format: last_name, first_name (last [comma space] first)
The most basic problem I'm seeing here are not diacritics, but whitespaces. There are a few names that consist of multiple words, e.g. for titles. So you should go with the most generic, that is allowing everything but the comma that distinguishes first from last name:
/[^,]+,\s[^,]+/
But your second solution with the . character class is just as fine, you only might need to care about multiple commata then.
The XRegExp library has a plugin named Unicode that helps solve tasks like this.
<script src="xregexp.js"></script>
<script src="addons/unicode/unicode-base.js"></script>
<script>
var unicodeWord = XRegExp("^\\p{L}+$");
unicodeWord.test("Русский"); // true
unicodeWord.test("日本語"); // true
unicodeWord.test("العربية"); // true
</script>
/^[\pL\pM\p{Zs}.-]+$/u
Explanation:
\pL - matches any kind of letter from any language
\pM - matches a character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.)
\p{Zs} - matches a whitespace character that is invisible, but does take up space
u - Pattern and subject strings are treated as UTF-8
Unlike other proposed regex (such as [A-Za-zÀ-ÖØ-öø-ÿ]), this will work with all language specific characters, e.g. Šš is matched by this rule, but not matched by others on this page.
Unfortunately, natively JavaScript does not support these classes. However, you can use xregexp, e.g.
const XRegExp = require('xregexp');
const isInputRealHumanName = (input: string): boolean => {
return XRegExp('^[\\pL\\pM-]+ [\\pL\\pM-]+$', 'u').test(input);
};
You can use this:
/^[a-zA-ZÀ-ÖØ-öø-ÿ]+$/
You can use this:
^([a-zA-Z]|[à-ú]|[À-Ú])+$
It will match every word with accented characters or not.
You can remove the diacritics from alphabets by using:
var str = "résumé"
str.normalize('NFD').replace(/[\u0300-\u036f]/g, '') // returns resume
It will remove all the diacritical marks, and then perform your regex on it.
Reference:
Searching and sorting text with diacritical marks in JavaScript
From Wikipedia: Basic Latin
For Latin letters, I use
/^[A-zÀ-ÖØ-öø-ÿ]+$/
It avoids hyphens and specials characters.
My context is slightly different and limited to French: I want to search text by allowing a mistake of accents.
For example, I want to find "maîtrisée", but the text to be searched is "... maitrisee ...". So, I used the regular expression /ma[i|î|ï]tris[e|é|è|ê|ë]/ in JavaScript.
In the expression, the '[' and ']' define a set of characters, and the '|' is an OR condition.
This page gives a list of accented characters: Diacritiques utilisés en français

Splitting a string into words and keeping delimiter

I want to split up a string (sentence) in an array of words and keep the delimiters.
I have found and I am currently using this regex for this:
[^.!?\s][^.!?]*(?:[.!?](?!['"]?\s|$)[^.!?]*)*[.!?]?['"]?(?=\s|$)
An explanation can be found here: http://regex101.com/
This works exactly as I want it to and effectively makes a string like
This is a sentence.
To an array of
["This", "is", "a", "sentence."]
The problem here is that it does not include spaces nor newlines. I want the string to be parsed as words as it already does but I also want the corresponding space and or newline character to belong to the previous word.
I have read about positive lookahead that should look for future characters (space and or newline) but still take them into account when extracting the word. Although this might be the solution I have failed to implement it.
If it makes any difference I am using JavaScript and the following code:
//save the regex -- g modifier to get all matches
var reg = /[^.!?\s][^.!?]*(?:[.!?](?!['"]?\s|$)[^.!?]*)*[.!?]?['"]?(?=\s|$)/g;
//define variable for holding matches
var matches;
//loop through each match
while(matches = reg.exec(STRING_HERE)){
//the word without spaces or newlines
console.log(matches[0]);
}
The code works but as I said, it does not include spaces and newline characters.
Yo can try something simpler:
str.split(/\b(?!\s)/);
However, note non word characters (e.g. full stop) will be considered another word:
"This is a sentence.".split(/\b(?!\s)/);
// [ "This ", "is ", "a ", "sentence", "." ]
To fix that, you can use a character class with the characters that shouldn't begin another word:
str.split(/\b(?![\s.])/);
function split_string(str){
var arr = str.split(" ");
var last_i = arr.length - 1;
for(var i=0; i<last_i; i++){
arr[i]+=" ";
}
return arr;
}
It may be as simple as this:
var sentence = 'This is a sentence.';
sentence = sentence.split(' ').join(' ||');
sentence = sentence.split('\n').join('\n||');
var matches = sentence.split('||');
Note that I use 2 pipes as a delimiter, but ofcourse you can use anything as long as it's unique.
Also note that I only split \n as a newline, but you may add \r\n or whatever you want to split as well.
General Solution
To keep the delimiters conjoined in the results, the regex needs to be a zero-width match. In other words, the regex can be thought of as matching the point between a delimiter and non-delimiter, rather than matching the delimiters themselves. This can be achieved with zero-width matching expressions, matching before, at, or after the split point (at most one each); let's call these A, B, and C. Sometimes a single sub-expression will do it, others you'll need two; offhand, I can't think of a case where you'd need three.
Not only look-aheads but lookarounds in general are the perfect candidates for this purpose: lookbehinds ((?<=...)) to match before the split point, and lookaheads ((?=...)) after. That's the essence of this approach. Positive or negative lookarounds can be used. The one pitfall is that lookbehinds are relatively new to JS regexes, so not all browsers or other JS engines will support them (current versions of Firefox, Chrome, Opera, Edge, and node.js do; Safari does not). If you need to support a JS engine that doesn't support lookbehinds, you might still be able to write & use a regex that matches at-and-before (BC).
To have the delimiters appear at the end of each match, put them in A. To have them at the start, in C. Fortunately, JS regexes do not place restrictions on lookbehinds, so simply wrapping the delimiter regex in the positive lookaround markers should be all that's required for delimiters. If the delimiters aren't so simple (i.e. context-sensitive), it might take a little more work to write the regex, which doesn't need to match the entire delimiter.
Paired with the delimiter pattern, you'll need to write a pattern that matches the start (for C) or end (for A) of the non-delimiter. This step is likely the one that will require the most additional work.
The at-split-point match, B
will often (always?) be a simple boundary, such as \b.
Specific Solution
If spaces are the only delimiters, and they're to appear at the end of each match, the delimiter pattern would be (?<=\s), in A. However, there are some cases not covered in the problem description. For example, should words separated by only punctuation (e.g. "x.y") be split? Which side of a split point should quotation marks and hyphens appear, if any? Should they count as punctuation? Another option for the delimiter is to match (after) all non-word characters, in which case A would be (<?=\W).
Since the split-point is at a word boundary, B could be \b.
Since the start of a match is a word character, (?=\w) will suffice for C.
Any two of those three should suffice. One that is perhaps clearest in meaning (and splits at the most points) is /(<?=\W)(?=\w)/, which can be translated as "split at the start of each word". \b could be added, if you find it more understandable, though it has no functional affect: /(<?=\W)\b(?=\w)/.
Note Oriol's excellent solutions are given by B=\b and (C=(?!\s) or C=(?![\s.])).
Additional
As a point of interest, there would be a simpler solution for this particular case if JS regexes supported TCL word boundaries: \m matches only at the start of a word, so str.split(/\m/) would split exactly at the start of each word. (\m is equivalent to (<?=\W)(?=\w).)
If you want to include the whitespace after the word, the regex \S+\s* should work.
const s = `This is a sentence.
This is another sentence.`;
console.log(s.match(/\S+\s*/g))

Concrete JavaScript regular expression for accented characters (diacritics)

I've looked on Stack Overflow (replacing characters.. eh, how JavaScript doesn't follow the Unicode standard concerning RegExp, etc.) and haven't really found a concrete answer to the question "How can JavaScript match accented characters (those with diacritical marks)?"
I'm forcing a field in a UI to match the format: last_name, first_name (last [comma space] first), and I want to provide support for diacritics, but evidently in JavaScript it's a bit more difficult than other languages/platforms.
This was my original version, until I wanted to add diacritic support:
/^[a-zA-Z]+,\s[a-zA-Z]+$/
Currently I'm debating one of three methods to add support, all of which I have tested and work (at least to some extent, I don't really know what the "extent" is of the second approach). Here they are:
Explicitly listing all accented characters that I would want to accept as valid (lame and overly-complicated):
var accentedCharacters = "àèìòùÀÈÌÒÙáéíóúýÁÉÍÓÚÝâêîôûÂÊÎÔÛãñõÃÑÕäëïöüÿÄËÏÖÜŸçÇßØøÅåÆæœ";
// Build the full regex
var regex = "^[a-zA-Z" + accentedCharacters + "]+,\\s[a-zA-Z" + accentedCharacters + "]+$";
// Create a RegExp from the string version
regexCompiled = new RegExp(regex);
// regexCompiled = /^[a-zA-ZàèìòùÀÈÌÒÙáéíóúýÁÉÍÓÚÝâêîôûÂÊÎÔÛãñõÃÑÕäëïöüÿÄËÏÖÜŸçÇßØøÅåÆæœ]+,\s[a-zA-ZàèìòùÀÈÌÒÙáéíóúýÁÉÍÓÚÝâêîôûÂÊÎÔÛãñõÃÑÕäëïöüÿÄËÏÖÜŸçÇßØøÅåÆæœ]+$/
This correctly matches a last/first name with any of the supported accented characters in accentedCharacters.
My other approach was to use the . character class, to have a simpler expression:
var regex = /^.+,\s.+$/;
This would match for just about anything, at least in the form of: something, something. That's alright I suppose...
The last approach, which I just found might be simpler...
/^[a-zA-Z\u00C0-\u017F]+,\s[a-zA-Z\u00C0-\u017F]+$/
It matches a range of Unicode characters - tested and working, though I didn't try anything crazy, just the normal stuff I see in our language department for faculty member names.
Here are my concerns:
The first solution is far too limiting, and sloppy and convoluted at that. It would need to be changed if I forgot a character or two, and that's just not very practical.
The second solution is better, concise, but it probably matches far more than it actually should. I couldn't find any real documentation on exactly what . matches, just the generalization of "any character except the newline character" (from a table on the MDN).
The third solution seems the be the most precise, but are there any gotchas? I'm not very familiar with Unicode, at least in practice, but looking at a code table/continuation of that table, \u00C0-\u017F seems to be pretty solid, at least for my expected input.
Faculty won't be submitting forms with their names in their native language (e.g., Arabic, Chinese, Japanese, etc.), so I don't have to worry about out-of-Latin-character-set characters
Which of these three approaches is most suited for the task? Or are there better solutions?
The easier way to accept all accents is this:
[A-zÀ-ú] // accepts lowercase and uppercase characters
[A-zÀ-ÿ] // as above, but including letters with an umlaut (includes [ ] ^ \ × ÷)
[A-Za-zÀ-ÿ] // as above but not including [ ] ^ \
[A-Za-zÀ-ÖØ-öø-ÿ] // as above, but not including [ ] ^ \ × ÷
See Unicode Character Table for characters listed in numeric order.
The accented Latin range \u00C0-\u017F was not quite enough for my database of names, so I extended the regex to
[a-zA-Z\u00C0-\u024F]
[a-zA-Z\u00C0-\u024F\u1E00-\u1EFF] // includes even more Latin chars
I added these code blocks (\u00C0-\u024F includes three adjacent blocks at once):
\u00C0-\u00FF Latin-1 Supplement
\u0100-\u017F Latin Extended-A
\u0180-\u024F Latin Extended-B
\u1E00-\u1EFF Latin Extended Additional
Note that \u00C0-\u00FF is actually only a part of Latin-1 Supplement. It skips unprintable control signals and all symbols except for the awkwardly-placed multiply × \u00D7 and divide ÷ \u00F7.
[a-zA-Z\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u024F] // exclude ×÷
If you need more code points, you can find more ranges on Wikipedia's List of Unicode characters. For example, you could also add Latin Extended-C, D, and E, but I left them out because only historians seem interested in them now, and the D and E sets don't even render correctly in my browser.
The original regex stopping at \u017F borked on the name "Șenol". According to FontSpace's Unicode Analyzer, that first character is \u0218, LATIN CAPITAL LETTER S WITH COMMA BELOW. (Yeah, it's usually spelled with a cedilla-S \u015E, "Şenol." But I'm not flying to Turkey to go tell him, "You're spelling your name wrong!")
Which of these three approaches is most suited for the task?
Depends on the task :-) To match exactly all Latin characters and their accented versions, the Unicode ranges probably provide the best solution. They might be extended to all non-whitespace characters, which could be done using the \S character class.
I'm forcing a field in a UI to match the format: last_name, first_name (last [comma space] first)
The most basic problem I'm seeing here are not diacritics, but whitespaces. There are a few names that consist of multiple words, e.g. for titles. So you should go with the most generic, that is allowing everything but the comma that distinguishes first from last name:
/[^,]+,\s[^,]+/
But your second solution with the . character class is just as fine, you only might need to care about multiple commata then.
The XRegExp library has a plugin named Unicode that helps solve tasks like this.
<script src="xregexp.js"></script>
<script src="addons/unicode/unicode-base.js"></script>
<script>
var unicodeWord = XRegExp("^\\p{L}+$");
unicodeWord.test("Русский"); // true
unicodeWord.test("日本語"); // true
unicodeWord.test("العربية"); // true
</script>
/^[\pL\pM\p{Zs}.-]+$/u
Explanation:
\pL - matches any kind of letter from any language
\pM - matches a character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.)
\p{Zs} - matches a whitespace character that is invisible, but does take up space
u - Pattern and subject strings are treated as UTF-8
Unlike other proposed regex (such as [A-Za-zÀ-ÖØ-öø-ÿ]), this will work with all language specific characters, e.g. Šš is matched by this rule, but not matched by others on this page.
Unfortunately, natively JavaScript does not support these classes. However, you can use xregexp, e.g.
const XRegExp = require('xregexp');
const isInputRealHumanName = (input: string): boolean => {
return XRegExp('^[\\pL\\pM-]+ [\\pL\\pM-]+$', 'u').test(input);
};
You can use this:
/^[a-zA-ZÀ-ÖØ-öø-ÿ]+$/
You can use this:
^([a-zA-Z]|[à-ú]|[À-Ú])+$
It will match every word with accented characters or not.
You can remove the diacritics from alphabets by using:
var str = "résumé"
str.normalize('NFD').replace(/[\u0300-\u036f]/g, '') // returns resume
It will remove all the diacritical marks, and then perform your regex on it.
Reference:
Searching and sorting text with diacritical marks in JavaScript
From Wikipedia: Basic Latin
For Latin letters, I use
/^[A-zÀ-ÖØ-öø-ÿ]+$/
It avoids hyphens and specials characters.
My context is slightly different and limited to French: I want to search text by allowing a mistake of accents.
For example, I want to find "maîtrisée", but the text to be searched is "... maitrisee ...". So, I used the regular expression /ma[i|î|ï]tris[e|é|è|ê|ë]/ in JavaScript.
In the expression, the '[' and ']' define a set of characters, and the '|' is an OR condition.
This page gives a list of accented characters: Diacritiques utilisés en français

Regular Expression to match all characters up to next match

I'm parsing text that is many repetitions of a simple pattern. The text is in the format of a script for a play, like this:
SAMPSON
I mean, an we be in choler, we'll draw.
GREGORY
Ay, while you live, draw your neck out o' the collar.
I'm currently using the pattern ([A-Z0-9\s]+)\s*\:?\s*[\r\n](.+)[\r\n]{2}, which works fine (explanation below) except for when the character's speech has line breaks in it. When that happens, the character's name is captured successfully but only the first line of the speech is captured.
Turning on Single-line mode (to include line breaks in .) just creates one giant match.
How can I tell the (.+) to stop when it finds the next character name and end the match?
I'm iterating over each match individually (JavaScript), so the name must be available to the next match.
Ideally, I would be able to match all characters until the entire pattern is repeated.
Pattern explained:
The first group matches a character's name (allowing capital letters, numbers, and whitespace), (with a trailing colon and whitespace optional).
The second group (character's speech) begins on a new line and captures any characters (except, problematically, line breaks and characters after them).
The pattern ends (and starts over) after a blank line.
Consider going a different direction with this. You really want to split a larger dialogue on any line that contains a name. You can do this with a regular expression still (replace the regex with whatever will match the "speaker" line):
results = "Insert script here".split(/^([A-Z]+)$/)
On a standards compliant implementation, you example text will end up in an array like so:
results[0] = ""
results[1] = "SAMPSON"
results[2] = "I mean, an we be in choler, we'll draw.
"
results[3] = "GREGORY"
results[4] = "Ay, while you live, draw your neck out o' the collar. "
A caveat is that most browsers are spotty on the standard here. You can use the library XRegExp to get cross platform behaviour.
Okay, I did a little tinkering and found something that works. It isn't super elegant, but it does the job.
([A-Z0-9\s]+)\s*\:?\s*[\r\n]((.+[\r\n]?.*)+)[\r\n]{2}
I modified the last capture group to allow endless repetitions of arbitrary text, a new line, and more arbitrary text. Since two line breaks in a row aren't allowed, the pattern ends after the speech.
I finally managed to get it to match only what you wanted, i.e.
- the name of the character, allowing for whitespaces and the colon
- and, optionally multiline with linebreaks, the text associated with the person
You would need to do findAll using this regex - it is case sensitive:
((?:[A-Z]{2,}\s*:?\s*)+)\s+((?![A-Z]{2,}\s*:?\s*).+?[.?!]\s*)+
Explanation:
((?:[A-Z]{2,}\s*:?\s*)+) - the first group captures the upper case name of the person - it will match 'GREGOR' as well as 'MANFRED THE GREATEST:'
\s+ - at least one whitespace character
Then repeat at least once:
(?![A-Z]{2,}\s*:?\s*) - look ahead to check that the next text is not the upper case character name
.+?[.?!]\s* - match everything until you find a character that ends a sentence [.?!] and optionally whitespaces

Regular expression to find last word in sentence

How can I find last word in a sentence with a regular expression?
If you need to find the last word in a string, then do this:
m/
(\w+) (?# Match a word, store its value into pattern memory)
[.!?]? (?# Some strings might hold a sentence. If so, this)
(?# component will match zero or one punctuation)
(?# characters)
\s* (?# Match trailing whitespace using the * because there)
(?# might not be any)
$ (?# Anchor the match to the end of the string)
/x;
After this statement, $1 will hold the last word in the string. You may need to expand the character class, [.!?], by adding more punctuation.
in PHP:
<?php
$str = 'MiloCold is Neat';
$str_Pattern = '/[^ ]*$/';
preg_match($str_Pattern, $str, $results);
// Prints "Neat", but you can just assign it to a variable.
print $results[0];
?>
In general you can't correctly parse English text with regular expressions.
The best you can do is to look for some punctuation that usually terminates a sentence but unfortunately this is not a guarantee. For example the text Mr. Bloggs is here. Do you want to talk to him? contains two periods which have different meanings. There is no way for a regular expression to distinguish between the two uses of the period.
I'd suggest instead that you look at a natural language parsing library. For example the Stanford Parser has no trouble at all correctly parsing the above text into the two sentences:
Mr./NNP Bloggs/NNP is/VBZ here/RB ./.
Do/VBP you/PRP want/VB to/TO talk/VB to/TO him/PRP ?/.
There are lots of other freely available NLP libraries that you could use too, I'm not endorsing that one product in particular - it's just an example to demonstrate that it is possible to parse text into sentences with a fairly high reliability. Note though that even a natural language parsing library will still occasionally make a mistake - parsing human languages correctly is hard.

Categories