Match Strings Composed of Substrings in a Given Set - javascript

By searching a dictionary of words, I am trying find strings made up of substrings.
First, finding string made of letters is straight forward:
1. [abcdefgjlmnqrsz]+
Above finds any word or phrase that contains the above letters. What I am trying to figure out how to find strings made up of substrings:
So for example dictionary = ["neon","none","dog","bear","bare"]
and regex is: [ar|be|o|ne|n]+
I would like to find: neon, bear
But, regex ex. 2 is incorrect, because it finds: neon, noen, bear, bare.
Any help appreciated

A Character Class Matches One Single Character
You are looking for
\b(?:ar|be|o|ne|n)+\b
Note that [things] is a character class that matches one single character. Therefore [ar|be|o|ne|n] does not mean what you thought: it means "one character that is either one of a,r,|,b,e,|,o,|,n,e,|,n
Explanation
\b is a word boundary that matches a position where one side is a letter, and the other side is not a letter (for instance a space character, or the beginning of the string)
(?: ... ) is a non-capture group
| is the alternation (OR) operator

Related

JS regex for proper names: How to force capital letter at the start of each word?

I want a JS regex that only matches names with capital letters at the beginning of each word and lowercase letters thereafter. (I don't care about technical accuracy as much as visual consistency — avoiding people using, say, all caps or all lower cases, for example.)
I have the following Regex from this answer as my starting point.
/^[a-z ,.'-]+$/gmi
Here is a link to the following Regex on regex101.com.
As you can see, it matches strings like jane doe which I want to prevent. And only want it to match Jane Doe instead.
How can I accomplish that?
Match [A-Z] initially, then use your original character set afterwards (sans space), and make sure not to use the case-insensitive flag:
/^[A-Z][a-z,.'-]+(?: [A-Z][a-z,.'-]+)*$/g
https://regex101.com/r/y172cv/1
You might want the non-word characters to only be permitted at word boundaries, to ensure there are alphabetical characters on each side of, eg, ,, ., ', and -:
^[A-Z](?:[a-z]|\b[,.'-]\b)+(?: [A-Z](?:[a-z]|\b[,.'-]\b)+)*$
https://regex101.com/r/nP8epM/2
If you want a capital letter at the beginning and lowercase letters following where the name can possibly end on one of ,.'- you might use:
^[A-Z][a-z]+[,.'-]?(?: [A-Z][a-z]+[,.'-]?)*$
^ Start of string
[A-Z][a-z]+ Match an uppercase char, then 1+ lowercase chars a-z
[,.'-]? Optionally match one of ,.'-
(?: Non capturing group
[A-Z][a-z]+[,.'-]? Match a space, then repeat the same pattern as before
)* Close group and repeat 0+ times to also match a single name
$ End of string
Regex demo
Here's my solution to this problem
const str = "jane dane"
console.log(str.replace(/(^\w{1})|(\s\w{1})/g, (v) => v.toUpperCase()));
So first find the first letter in the first word (^\w{1}), then use the PIPE | operator which serves as an OR in regex and look for the second block of the name ie last name where the it is preceded by space and capture the letter. (\s\w{1}). Then to close it off with the /g flag you continue to run through the string for any iterations of these conditions set.
Finally you have the function to uppercase them. This works for any name containing first, middle and lastname.

RegEx matching help: won't match on each appearence

I need to write a little RegEx matcher which will match any occurrence of strings in the form of
[a-zA-Z]+(_[a-zA-Z0-9]+)?
If I use the regex above it does match the sections needed but would also match onto the abc part of 4_abc which is not intended. I tried to exclude it with:
(?:[^a-zA-Z0-9_]|^)([a-zA-Z]+(_[a-zA-Z0-9]+)?)(?:[^a-zA-Z0-9_]|$)
The problem is that the 'not' matches at the beginning and end are not really working like I hoped they would. If I use them on the example
a_d Dd_da 4_d d_4
they would block matching the second Dd_da because the space was used in the first match.Sadly I can't use lookarounds because I am using JS.
So the input:
a_d Dd_da 4_d d_4
should match: a_d, Dd_da and d_4
but matches: a_d (there is a space at the end)
Is there another way to match the needed sections, or to not consume the 'anchor' matches?
I really appreciate your help.
You can make use of \b:
\b[a-zA-Z]+(_[a-zA-Z0-9]+)?\b
\b matches the (zero-width) point where either the preceding character or following character is a letter, digit or underscore, but not both. It also matches with the start/end of the string if the first/last character is a letter, digit or underscore.

Grab full regex word if pattern inside it matches

How do I retrieve an entire word that has a specific portion of it that matches a regex?
For example, I have the below text.
Using ^.[\.\?\!:;,]{2,} , I match the first 3, but not the last. The last should be matched as well, but $ doesn't seem to produce anything.
a!!!!!!
n.......
c..,;,;,,
huhuhu..
I want to get all strings that have an occurrence of certain characters equal to or more than twice. I produced the aforementioned regex, but on Rubular it only matches the characters themselves, not the entire string. Using ^ and $
I've read a few stackoverflow posts similar, but not quite what I'm looking for.
Change your regex to:
/^.*[.?!:;,]{2,}/gm
i.e. match 0 more character before 2 of those special characters.
RegEx Demo
If I understand well you are trying to match an entire string that contains at least the same punctuation character two times:
^.*?([.?!:;,])\1.*
Note: if your string has newline characters, change .* to [\s\S]*
The trick is here:
([.?!:;,]) # captures the punct character in group 1
\1 # refers to the character captured in group 1

/(\S)\1(\1)+/g matching all occurrences of three equal non-whitespace characters following each other

Its given: /(\S)\1(\1)+/g matches all occurrences of three equal non-whitespace characters following each other.
I don't understand why there is () around (\S) and 2nd (\1), but not around 1st (\1). Can anyone help in explaining how above regex works?
src: http://www.javascriptkit.com/javatutors/redev2.shtml
Thnx in advance.
The \S needs parentheses to capture its value, so you can refer back to the captured value with \1. \1 means "match the same text which capturing group #1 matched".
I believe there is a problem with this regex. You said you want to match "three equal non-whitespace characters". But the + will make this match 3 or more equal, consecutive non-whitespace characters.
The g on the end means "apply this regex over the entire input string, or globally".
The second set of parentheses is not necessary. It needlessly captures the repeated character a second time, while matching the same strings as this regex:
/(\S)\1\1+/g
Also, as #AlexD pointed out, the description should say that it matches at least three characters. If you replaced that regex with BONK in the string fooxxxxxxbar:
'fooxxxxxxbar'.replace(/(\S)\1\1+/g, 'BONK')
..you might expect the result to be fooBONKBONKbar from their description, because there are two sets of three 'x's. But in fact the result would be fooBONKbar; the first \1 matches the second 'x', and the \1+ matches the third 'x' and any 'x's that follow it. If they wanted to match just three characters, they should have left the + off.
I noticed several other sloppy descriptions like that, plus at least one outright error: \B is equivalent to (?!\b) (a position that's not a word boundary), not [^\b] (a character that's not a backspace). For that matter, their description of word boundaries--"the position between a word and a space"--is wrong, too. A word boundary isn't defined by any particular character, like a space--in fact, it can just as well be the absence of any character that creates one. The string:
Word
...starts with a word boundary because 'W' is a word character and, being first, it's not preceded by another word character. Similarly, the 'd' is not followed by another word character, so the end of the string is also a word boundary.
Also, a regex doesn't know from words, only word characters. The definition of a word character can vary depending on the regex flavor and Unicode or locale settings, but it always includes [A-Za-z0-9_] (ASCII letters and digits plus the underscore). A word boundary is simply a position that's between one of those characters and any other character (or no other character, as I explained earlier).
If you want to learn about regexes, I suggest you forget that site and start here instead: regular-expressions.info.

Find the first letter of the last word with jquery inside a string (string can have multiple words)

Hy, is there a way to find the first letter of the last word in a string? The strings are results in a XML parser function. Inside the each() loop i get all the nodes and put every name inside a variable like this: var person = xml.find("name").find().text()
Now person holds a string, it could be:
Anamaria Forrest Gump
John Lock
As you see, the first string holds 3 words, while the second holds 2 words.
What i need are the first letters from the last words: "G", "L",
How do i accomplish this? TY
This should do it:
var person = xml.find("name").find().text();
var names = person.split(' ');
var firstLetterOfSurname = names[names.length - 1].charAt(0);
This solution will work even if your string contains a single word. It returns the desired character:
myString.match(/(\w)\w*$/)[1];
Explanation: "Match a word character (and memorize it) (\w), then match any number of word characters \w*, then match the end of the string $". In other words : "Match a sequence of word characters at the end of the string (and memorize the first of these word characters)". match returns an array with the whole match in [0] and then the memorized strings in [1], [2], etc. Here we want [1].
Regexps are enclosed in / in javascript : http://www.w3schools.com/js/js_obj_regexp.asp
You can hack it with regex:
'Marry Jo Poppins'.replace(/^.*\s+(\w)\w+$/, "$1"); // P
'Anamaria Forrest Gump'.replace(/^.*\s+(\w)\w+$/, "$1"); // G
Otherwise Mark B's answer is fine, too :)
edit:
Alsciende's regex+javascript combo myString.match(/(\w)\w*$/)[1] is probably a little more versatile than mine.
regular expression explanation
/^.*\s+(\w)\w+$/
^ beginning of input string
.* followed by any character (.) 0 or more times (*)
\s+ followed by any whitespace (\s) 1 or more times (+)
( group and capture to $1
\w followed by any word character (\w)
) end capture
\w+ followed by any word character (\w) 1 or more times (+)
$ end of string (before newline (\n))
Alsciende's regex
/(\w)\w*$/
( group and capture to $1
\w any word character
) end capture
\w* any word character (\w) 0 or more times (*)
summary
Regular expressions are awesomely powerful, or as you might say, "Godlike!" Regular-Expressions.info is a great starting point if you'd like to learn more.
Hope this helps :)

Categories