Matching multiple optional characters depending on each other - javascript

I want to match all valid prefixes of substitute followed by other characters, so that
sub/abc/def matches the sub part.
substitute/abc/def matches the substitute part.
subt/abc/def either doesn't match or only matches the sub part, not the t.
My current Regex is /^s(u(b(s(t(i(t(u(te?)?)?)?)?)?)?)?)?/, which works, however this seems a bit verbose.
Is there any better (as in, less verbose) way to do this?

This would do like the same as you mentioned in your question.
^s(?:ubstitute|ubstitut|ubstitu|ubstit|ubsti|ubst|ubs|ub|u)?
The above regex will always try to match the large possible word. So at first it checks for substitute, if it finds any then it will do matching else it jumps to next pattern ie, substitut , likewise it goes on upto u.
DEMO 1 DEMO 2

you could use a two-step regex
find first word of subject by using this simple pattern ^(\w+)
use the extracted word from step 1 as your regex pattern e.g. ^subs against the word substitute

Related

Regex to match phrases not containing a palindrome

Is there a way to match a word not containing a palindrome (be it as long as it may)?
For instance, for a 6-character-long palindrome, foo/bar would match but xbarrabzz/1xoxxoxa14 would not match.
Use a negative lookahead, for example for length 5/6 (3-letter with middle letter reused or doubled):
^(?:(.)(?!(.)(.)\3?\2\1))*$
See live demo.
But you would have to add another look ahead for each length (which I leave as an exercise for the reader).
You can use \b(?:(?!(\w)(\w)\2?\1)\w)+\b.
Online Demo.
It's a simple negative lookahead that checks if the word contains a structure like xyx or xyyx.

Match full sentences skipping spurious dots

I need to match complete sentences ending at the full stop, but I'm stuck on trying to skip false dots.
To keep it simple, I've started with this syntax [^.]+[^ ] which works fine with normal sentences, but, as you can see, it breaks at every dots.
My regex101
So, at the first sentence, the result should be:
Recent studies have described a pattern associated with specific object (e.g., face-related and building-related) in human occipito-temporal cortex.
and so on.
Just use a lookahead to set the condition as match upto a dot which must be followed by a space or end of the line anchor $.
(.*?\.)(?=\s|$)
DEMO
Expanding upon this, here is a regex that doesn't use reluctant matching and potentially more efficient:
(?:[^.]+|\.\S)+\.
And if you would like to match the sentences themselves, and remove the one trending space that you would get from using the regex of the accepted answer, you can use this:
\S(?:[^.]+|\.\S)+\.
Here is a regex demo.

Regex - why empty parenthesis?

I have a regex to update and there is an empty parenthesis in it. And i wondering : what is the purpose ? I don't find something about it.
The regex :
(DE)()([0-9]{1,12})
Because, if it is useless, i can remove it.
There is one possible application for empty parentheses that I'm aware of, and that is if you plan to use a regex to determine if a certain string matches a permutation of sub-regexes.
For example,
^(?:A()|B()|C()){3}\1\2\3$
will match ABC or CBA or BCA but not AAA or BCC etc.
But it doesn't look like that's what the author of your regex was going for.
Maybe (and only maybe) the other code uses the capturing groups by their numbers.
It happened to me that I changed one regex changing the parenthesis so the matching groups were changed as well and the rest of the code stopped working because depended on the number of the matching groups.
I recommend you to verify if this is your case before removing the parenthesis.

Since "a+?" is Lazy, Why does "a+?b" Match "aaab"?

While learning regular expressions in javascript using JavaScript: The Definitive Guide, I was confused by this passage:
But /a+?/ matches one or more occurrences of the letter a, matching as
few characters as necessary. When applied to the same string, this
pattern matches only the first letter a.
…
Now let’s use the nongreedy version: /a+?b/. This should match the
letter b preceded by the fewest number of a’s possible. When applied
to the same string “aaab”, you might expect it to match only one a and
the last letter b. In fact, however, this pattern matches the entire
string, just like the greedy version of the pattern.
Why is this so?
This is the explanation from the book:
This is because regular-expression pattern matching is done by finding
the first position in the string at which a match is possible. Since a
match is possible starting at the first character of the
string,shorter matches starting at subsequent characters are never
even considered.
I don't understand. Can anyone give me a more detailed explanation?
Okay, so you have your search space, "aaabc", and your pattern, /a+?b/
Does /a+?b/ match "a"? No.
Does /a+?b/ match "aa"? No.
Does /a+?b/ match "aaa"? No.
Does /a+?b/ match "aaab"? Yes.
Since you're matching literal characters and not any sort of wildcard, the regular expression a+?b is effectively the same as a+b anyway. The only type of sequence either one will match is a string of one or more a characters followed by a single b character. The non-greedy modifier makes no difference here, as the only thing an a can possibly match is an a.
The non-greedy qualifier becomes interesting when it's applied to something that can take on lots of different values, like .. (edit or cases where there's interesting stuff to the left of something like a+?)
edit — if you're expecting a+?b to match just the last a before the b in aaab, well that's not how it works. Searching for a pattern in a string implicitly means to search for the earliest occurrence of the pattern. Thus, though starting from the last a does give a substring that matches the pattern, it's not the first substring that matches.
The Engine Attempts a Match at the Beginning of the String
Can anyone give me a more detailed explanation?
Yes.
In short: .+? does not look for a shortest match globally, at the level of the entire string, but locally, from the position in the string where the engine is currently positioned.
How the Engine Works
When you try a regex against the string aaab, the engine first tries to find a match starting at the very first position in the string. That position is the position before the first a. If the engine cannot find a match at the first position, it moves on and tries again starting from the second position (between the first and second a)
So can a match be found by the regex a+?b at the first position? Yes.
a matches the first a
The +? quantifiers tells the engine to match the fewest number of a chars necessary. Since we are looking to return a match, necessary means that the following tokens (in this case) have to be allowed to match. In this case, the fewest number of a chars needed to allow the b to match is all the remaining a chars.
b matches
In the details the second point is a bit more complex (the engine tries to match b against the second a, fails, backtracks...) but you don't need to worry about that.
'?' after a+ means minimum number of characters to satisfy expression. /a+/ means one 'a' or as many as you can encounter before some other character. In order to satisfy /a+?/ (since it's nogreedy) it only needs single 'a'.
In order to satisfy /a+?b/, since we have 'b' at the end, in order to satisfy this expression it needs to match one or more 'a' before it hits 'b'. It has to hit that 'b'. /a+/ doesn't have to hit b because RegEx doesn't ask for that. /a+?b/ has to hit that 'b'.
Just think about it. What other meaning /a+?b/ could have?
Hope this helps

regular expression for ends with some word

I want to build regular expression for series
cd1_inputchk,rd_inputchk,optinputchk where inputchk is common (ending characters)
please guide for the same
Very simply, it's:
/inputchk$/
On a per-word basis (only testing matching /inputchk$/.test(word) ? 'matches' : 'doesn\'t match';). The reason this works, is it matches "inputchk" that comes at the end of a string (hence the $)
As for a list of words, it starts becoming more complicated.
Are there spaces in the list?
Are they needed?
I'm going to assume no is the answer to both questions, and also assume that the list is comma-separated.
There are then a couple of ways you could proceed. You could use list.split() to get an array of each word, and teast each to see if they end in inputchk, or you could use a modified regular expression:
/[^,]*inputchk(?:,|$)/g
This one's much more complicated.
[^,] says to match non-, characters
* then says to match 0 or more of those non-, chars. (it will be greedy)
inputchk matches inputchk
(?:...) is a non-capturing parenthesis. It says to match the characters, but not store the match as part of the result.
, matches the , character
| says match one side or the other
$ says to match the end of the string
Hopefully all of this together will select the strings that you're looking for, but it's very easy to make a mistake, so I'd suggest doing some rigorous testing to make sure there aren't any edge-conditions that are being missed.
This one should work (dollar sign basically means "end of string"):
/inputchk$/

Categories