Match full sentences skipping spurious dots

Match full sentences skipping spurious dots - javascript

I need to match complete sentences ending at the full stop, but I'm stuck on trying to skip false dots.
To keep it simple, I've started with this syntax [^.]+[^ ] which works fine with normal sentences, but, as you can see, it breaks at every dots.
My regex101
So, at the first sentence, the result should be:
Recent studies have described a pattern associated with specific object (e.g., face-related and building-related) in human occipito-temporal cortex.
and so on.

Just use a lookahead to set the condition as match upto a dot which must be followed by a space or end of the line anchor $.
(.*?\.)(?=\s|$)
DEMO

Expanding upon this, here is a regex that doesn't use reluctant matching and potentially more efficient:
(?:[^.]+|\.\S)+\.
And if you would like to match the sentences themselves, and remove the one trending space that you would get from using the regex of the accepted answer, you can use this:
\S(?:[^.]+|\.\S)+\.
Here is a regex demo.

Related

Javascript -- Regex -- Blacklist of multiple words to END with a partial match

I've read many Questions on StackOverflow, including this one, this one, and even read Rexegg's Best Trick, which is also in a question here. I found this one, which works on entire lines, but not "everything up to the bad word". None of these have helped me, so here I go:
In Javascript, I have a long regex pattern. I'm trying to match a sequence in similar sentence structures, like follows:
1 UniquePrefixA [some-token] and [some-token] want to take [some-token] to see some monkeys.
2 UniqueC [some-token] wants to take [some-token] to the store. UniqueB, [some-token] is in the pattern once more.
3 UniquePrefixA [some-token] is using [some-token] to [some-token].
Notice that each pattern starts with a unique prefix. Encountering that prefix signals the start of a pattern. If I encounter that pattern again during capture, I should not capture a second occurance, and STOP THERE. I'll have captured everything up to that prefix.
If I don't encounter the prefix later in the pattern, I need to continue matching that pattern.
I'm also using capture groups (not repeating, since Capture Groups only return the last matched of that group). The capture group contents need to be returned, so I'm using match, non-greedy.
Here's my pattern and a working example
/(?:UniquePrefixA|UniqueB|UniqueC)\s*(\[some-token\])(?:and|\s)*(\[some-token\])?(\s|[^\[\]])*(\[some-token\])? --->(\s|[^\[\]])*<--- (\[some-token\])?(\s|[^\[\]])*/i
It's basically 2 repeating patterns in a specific order:
(\s|[^\[\]])* // Basicaly .*, but excluding brackets
(\[some-token\]) // A token [some-token]
How I can prevent the match from continuing past a black list of words?
I want this to happen where I drew three arrows, for context. The equivalent of Any character, but not the contents of this list: (UniquePrefixA|UniqueB|UniqueC) (as seen in capture group 1).
It's possible I need a better understanding of negative lookahead, or if it can work with a group of things. Most importantly, I'm looking to know if a negative look-ahead approach can support a list of options Or is there a better way altogether? If the answer is "you can't do that," that's cool too.

I think, an easier to maintain solution is to divide your task into 2 parts:
Find each chunk of text starting from any of your unique prefixes,
up to the next or to the end of string.
Process each such chunk, looking for your some tokens and maybe
also the content between them.
The regex performing the first task should include 3 parts:
(?:UniquePrefixA|UniqueB|UniqueC) - A non-capturing group looking
for any unique prefix.
((?:.|\n)+?) - A capturing group - the fragment to catch for further
processing (see the note below).
(?=UniquePrefixA|UniqueB|UniqueC|$) - A positive lookahead, looking
for either any unique prefix or the end of the string (a stop criterion
you are looking for).
To sum up, the whole regex looks like below:
/(?:UniquePrefixA|UniqueB|UniqueC)((?:.|\n)+?)(?=UniquePrefixA|UniqueB|UniqueC|$)/gi
Note: Unfortunately, JavaScript flavour of regex does not implement
single-line (-s) option. So, instead of just . in the capturing group
above, you must use (?:.|\n), meaning:
either any char other than \n (.),
or just \n.
Both these variants are "enveloped" into a non-capturing group,
to put limits of variants (both sides of |), because the repetition
marker (+?) pertains to both variants.
Note ? after +, meaning the reluctant version.
So this part of regex (the capturing group) will match any sequence of chars
including \n, ending before the next uniqie prefix (if any),
just as you expect.
The second task is to apply another regex to the captured chunk (group 1),
looking for [some-token]s and possibly the content between them.
You didn't specify what you want exactly do with each chunk,
so I'm not sure what this second regex shoud include.
Maybe it will be enough just to match [some-token]?

to ensure a pattern not occurs in a repeating character sequence such as (\s|[^\[\]])*, note that \s is included in [^\[\]] so may be just [^\[\]]*, is to prepend a negative lookahead (which is a zero lentgh match assertion like ^) at the left and inside the repeating pattern so that it is checked for every character :
((?!UniquePrefixA)(\s|[^\[\]]))*

Matching multiple optional characters depending on each other

I want to match all valid prefixes of substitute followed by other characters, so that
sub/abc/def matches the sub part.
substitute/abc/def matches the substitute part.
subt/abc/def either doesn't match or only matches the sub part, not the t.
My current Regex is /^s(u(b(s(t(i(t(u(te?)?)?)?)?)?)?)?)?/, which works, however this seems a bit verbose.
Is there any better (as in, less verbose) way to do this?

This would do like the same as you mentioned in your question.
^s(?:ubstitute|ubstitut|ubstitu|ubstit|ubsti|ubst|ubs|ub|u)?
The above regex will always try to match the large possible word. So at first it checks for substitute, if it finds any then it will do matching else it jumps to next pattern ie, substitut , likewise it goes on upto u.
DEMO 1 DEMO 2

you could use a two-step regex
find first word of subject by using this simple pattern ^(\w+)
use the extracted word from step 1 as your regex pattern e.g. ^subs against the word substitute

Capture everything between constants

I want to capture everything between every instance of User. and a space, including User.
So given a test string of
psdojfsdf User.sdoinwpoiev.spoinwelsdknonfsjfnw ldnkfwwdf sdf User.sdoinffon.ribwgg
I want it to capture User.sdoinwpoiev.spoinwelsdknonfsjfnw and User.sdoinffon.ribwgg
I've gotten this far: /(User\..*)\s/, but this captures everything until the last space.

The way I believe is best is to tell it to match everything but space rather than everything. That gives:
/(User\.\S*)/
Another alternative is to use a non-greedy match, but I think that's less clear:
/(User\..*?)\s/

use a non-greedy quantifier:
/(User\..*?)\s/
See regular-expressions.info for details about greediness of repetition operators.
Note that this won't work if the word ends at the end of the input string, if there's no space at the end. Coenwulf's answer may be better, as it doesn't have this problem.

Use the *? non-greedy zero or more
/User\.[^ \s]*?/g
Also if you want to force it to have something between the dot and the space
/User\.[^ \s]+?/g
Or if you want it to be alphanumeric
/User\.[a-zA-Z_$]+?[a-zA-Z_$0-9]*?( | |\s)/g
If you want to allow line breaks between the dot and the property identifier
/User\.[a-zA-Z_$]+?[a-zA-Z_$0-9]*?(\n| | |\s)/gm

Javascript regular expression (unbroken repetitions of a pattern)

Let's say that I have a given string in javascript - e.g., var s = "{{1}}SomeText{{2}}SomeText"; It may be very long (e.g., 25,000+ chars).
NOTE: I'm using "SomeText" here as a placeholder to refer to any number of characters of plain text. In other words, "SomeText" could be any plain text string which doesn't include {{1}} or {{2}}. So the above example could be var s = "{{1}}Hi there. This is a string with one { curly bracket{{2}}Oh, very nice to meet you. I also have one } curly bracket!"; And that would be perfectly valid.
The rules for it are simple:
It does not need to have any instances of {{2}}. However, if it does, then after that instance we cannot encounter another {{2}} unless we find a {{1}} first.
Valid examples:
"{{2}}SomeText"
"{{1}}SomeText{{2}}SomeText"
"{{1}}SomeText{{1}}SomeText{{2}}SomeText"
"{{1}}SomeText{{1}}SomeText{{2}}SomeText{{1}}SomeText"
"{{1}}SomeText{{1}}SomeText{{2}}SomeText{{1}}SomeText{{1}}SomeText"
"{{1}}SomeText{{1}}SomeText{{2}}SomeText{{1}}SomeText{{1}}SomeText{{2}}SomeText"
etc...
Invalid examples:
"{{2}}SomeText{{2}}SomeText"
"{{1}}SomeText{{2}}SomeText{{2}}SomeText"
"{{1}}SomeText{{2}}SomeText{{2}}SomeText{{1}}SomeText"
etc...
This seems like a relatively easy problem to solve - and indeed I could easily solve it without regular expressions, but I'm keen to learn how to do something like this with regular expressions. Unfortunately, I'm not even sure if "conditionals and lookaheads" is a correct description of the issue in this case.
NOTE: If a workable solution is presented that doesn't involve "conditionals and lookaheads" then I will edit the title.

It's probably easier to invert the condition. Try to match any text that contains two consecutive instances of {{2}}, and if it doesn't match that, it's good.
Using this strategy, your pattern can be as simple as:
/{\{2}}([^{]*){\{2}}/
Demonstration
This will match a literal {{2}}, followed by zero or more characters other than {, followed by a literal {{2}}.
Notice that the second { needs to be escaped, otherwise, the regex engine will consider the {2} as to be a quantifier on the previous { (i.e. {{2} matches exactly two { characters).
Just in case you need to allow characters like {, and between the two {{2}}, you can use a pattern like this:
/{\{2}}((?!{\{1}}).)*{\{2}}/
Demonstration
This will match a literal {{2}}, followed by zero or more of any character, so long as those characters create a sequence like {{1}}, followed by a literal {{2}}.

(({{1}}SomeText)+({{2}}SomeText)?)*
Broken down:
({{1}}SomeText)+ - 1 to many {{1}} instances (greedy match)
({{2}}SomeText)? - followed by an optional {{2}} instance
Then the whole thing is wrapped in ()* such that the sequence can appear 0 to many times in a row.
No conditionals or lookaheads needed.

You said you can have one instance of {2} first, right?
^(.(?!{2}))(.{2})?(?!{2})((.(?!{2})){1}(.(?!{2}))({2})?)$
Note if {2} is one letter replace all dots with [^{2}]

regular expression for ends with some word

I want to build regular expression for series
cd1_inputchk,rd_inputchk,optinputchk where inputchk is common (ending characters)
please guide for the same

Very simply, it's:
/inputchk$/
On a per-word basis (only testing matching /inputchk$/.test(word) ? 'matches' : 'doesn\'t match';). The reason this works, is it matches "inputchk" that comes at the end of a string (hence the $)
As for a list of words, it starts becoming more complicated.
Are there spaces in the list?
Are they needed?
I'm going to assume no is the answer to both questions, and also assume that the list is comma-separated.
There are then a couple of ways you could proceed. You could use list.split() to get an array of each word, and teast each to see if they end in inputchk, or you could use a modified regular expression:
/[^,]*inputchk(?:,|$)/g
This one's much more complicated.
[^,] says to match non-, characters
* then says to match 0 or more of those non-, chars. (it will be greedy)
inputchk matches inputchk
(?:...) is a non-capturing parenthesis. It says to match the characters, but not store the match as part of the result.
, matches the , character
| says match one side or the other
$ says to match the end of the string
Hopefully all of this together will select the strings that you're looking for, but it's very easy to make a mistake, so I'd suggest doing some rigorous testing to make sure there aren't any edge-conditions that are being missed.

This one should work (dollar sign basically means "end of string"):
/inputchk$/

We Keep Coding

JavaScript is the programming language of the Web.

Match full sentences skipping spurious dots - javascript

Just use a lookahead to set the condition as match upto a dot which must be followed by a space or end of the line anchor $. (.*?\.)(?=\s|$) DEMO

Related

Javascript -- Regex -- Blacklist of multiple words to END with a partial match

Matching multiple optional characters depending on each other

Capture everything between constants

Javascript regular expression (unbroken repetitions of a pattern)

regular expression for ends with some word

Categories

Resources