Regex with special chars - javascript

Hi I have this regex.
/^[\w]|[åäöæøÅÄÖÆØ]$/
"tå" is ok but "åå" is not. Why is that? How can I make it accept words starting with åäöæøÅÄÖÆØ?

Note that the \w (and \W, \b, and \B) are English-centric. \w just means [A-Za-z0-9_], where A-Z means only the 26 English letters. Other letters are not considered part of a "word" by JavaScript's built-in character classes.
You'll need to build a character class including all of the letters you want to treat as word characters (then use the negated version of that wherever you "non-word character").
But that's not the only problem. Your regular expression says:
Match one English word character at the beginning of the string, or match one of this list of characters at the end of the string.
The | operator is fairly greedy, in this case it treats ^[\w] and [åäöæøÅÄÖÆØ]$ as the alternatives. I don't get the impression that's what you wanted.
"tå" is ok but "åå" is not.
I guess it depends on what you mean by "ok". Both match the expression:
console.log("tå".match(/^[\w]|[åäöæøÅÄÖÆØ]$/)); // ["t", index: 0, input: "tå"]
console.log("åå".match(/^[\w]|[åäöæøÅÄÖÆØ]$/)); // ["å", index: 1, input: "åå"]
"tå" matches because it matches the ^[\w] alternative. "åå" matches because it matches the [åäöæøÅÄÖÆØ]$ alternative.
How can I make it accept words starting with åäöæøÅÄÖÆØ?
If the goal is to accept only strings containing exactly one word, where "word" includes digits and the underscore (since \w does), then:
/^[A-Za-z0-9_åäöæøÅÄÖÆØ]+$/

Why do you think it fails? I would not put the \w in square brackets but various systems seem to allow that and both the following match the text being tested.
Javascript
var test = 'åå';
if (test.match(/^[\w]|[åäöæøÅÄÖÆØ]$/)) { alert("Match"); }
PHP
echo(preg_match("/^[\w]|[åäöæøÅÄÖÆØ]$/","åå")."</br>");
What are you trying to achieve here?

Related

Highlight specific word which is not a part of another word using regex in javascript [duplicate]

I'm trying to use regexes to match space-separated numbers.
I can't find a precise definition of \b ("word boundary").
I had assumed that -12 would be an "integer word" (matched by \b\-?\d+\b) but it appears that this does not work. I'd be grateful to know of ways of .
[I am using Java regexes in Java 1.6]
Example:
Pattern pattern = Pattern.compile("\\s*\\b\\-?\\d+\\s*");
String plus = " 12 ";
System.out.println(""+pattern.matcher(plus).matches());
String minus = " -12 ";
System.out.println(""+pattern.matcher(minus).matches());
pattern = Pattern.compile("\\s*\\-?\\d+\\s*");
System.out.println(""+pattern.matcher(minus).matches());
This returns:
true
false
true
A word boundary, in most regex dialects, is a position between \w and \W (non-word char), or at the beginning or end of a string if it begins or ends (respectively) with a word character ([0-9A-Za-z_]).
So, in the string "-12", it would match before the 1 or after the 2. The dash is not a word character.
In the course of learning regular expression, I was really stuck in the metacharacter which is \b. I indeed didn't comprehend its meaning while I was asking myself "what it is, what it is" repetitively. After some attempts by using the website, I watch out the pink vertical dashes at the every beginning of words and at the end of words. I got it its meaning well at that time. It's now exactly word(\w)-boundary.
My view is merely to immensely understanding-oriented. Logic behind of it should be examined from another answers.
A word boundary can occur in one of three positions:
Before the first character in the string, if the first character is a word character.
After the last character in the string, if the last character is a word character.
Between two characters in the string, where one is a word character and the other is not a word character.
Word characters are alpha-numeric; a minus sign is not.
Taken from Regex Tutorial.
I would like to explain Alan Moore's answer
A word boundary is a position that is either preceded by a word character and not followed by one or followed by a word character and not preceded by one.
Suppose I have a string "This is a cat, and she's awesome", and I want to replace all occurrences of the letter 'a' only if this letter ('a') exists at the "Boundary of a word",
In other words: the letter a inside 'cat' should not be replaced.
So I'll perform regex (in Python) as
re.sub(r"\ba","e", myString.strip()) //replace a with e
Therefore,
Input; Output
This is a cat and she's awesome
This is e cat end she's ewesome
A word boundary is a position that is either preceded by a word character and not followed by one, or followed by a word character and not preceded by one.
I talk about what \b-style regex boundaries actually are here.
The short story is that they’re conditional. Their behavior depends on what they’re next to.
# same as using a \b before:
(?(?=\w) (?<!\w) | (?<!\W) )
# same as using a \b after:
(?(?<=\w) (?!\w) | (?!\W) )
Sometimes that isn’t what you want. See my other answer for elaboration.
I ran into an even worse problem when searching text for words like .NET, C++, C#, and C. You would think that computer programmers would know better than to name a language something that is hard to write regular expressions for.
Anyway, this is what I found out (summarized mostly from http://www.regular-expressions.info, which is a great site): In most flavors of regex, characters that are matched by the short-hand character class \w are the characters that are treated as word characters by word boundaries. Java is an exception. Java supports Unicode for \b but not for \w. (I'm sure there was a good reason for it at the time).
The \w stands for "word character". It always matches the ASCII characters [A-Za-z0-9_]. Notice the inclusion of the underscore and digits (but not dash!). In most flavors that support Unicode, \w includes many characters from other scripts. There is a lot of inconsistency about which characters are actually included. Letters and digits from alphabetic scripts and ideographs are generally included. Connector punctuation other than the underscore and numeric symbols that aren't digits may or may not be included. XML Schema and XPath even include all symbols in \w. But Java, JavaScript, and PCRE match only ASCII characters with \w.
Which is why Java-based regex searches for C++, C# or .NET (even when you remember to escape the period and pluses) are screwed by the \b.
Note: I'm not sure what to do about mistakes in text, like when someone doesn't put a space after a period at the end of a sentence. I allowed for it, but I'm not sure that it's necessarily the right thing to do.
Anyway, in Java, if you're searching text for the those weird-named languages, you need to replace the \b with before and after whitespace and punctuation designators. For example:
public static String grep(String regexp, String multiLineStringToSearch) {
String result = "";
String[] lines = multiLineStringToSearch.split("\\n");
Pattern pattern = Pattern.compile(regexp);
for (String line : lines) {
Matcher matcher = pattern.matcher(line);
if (matcher.find()) {
result = result + "\n" + line;
}
}
return result.trim();
}
Then in your test or main function:
String beforeWord = "(\\s|\\.|\\,|\\!|\\?|\\(|\\)|\\'|\\\"|^)";
String afterWord = "(\\s|\\.|\\,|\\!|\\?|\\(|\\)|\\'|\\\"|$)";
text = "Programming in C, (C++) C#, Java, and .NET.";
System.out.println("text="+text);
// Here is where Java word boundaries do not work correctly on "cutesy" computer language names.
System.out.println("Bad word boundary can't find because of Java: grep with word boundary for .NET="+ grep("\\b\\.NET\\b", text));
System.out.println("Should find: grep exactly for .NET="+ grep(beforeWord+"\\.NET"+afterWord, text));
System.out.println("Bad word boundary can't find because of Java: grep with word boundary for C#="+ grep("\\bC#\\b", text));
System.out.println("Should find: grep exactly for C#="+ grep("C#"+afterWord, text));
System.out.println("Bad word boundary can't find because of Java:grep with word boundary for C++="+ grep("\\bC\\+\\+\\b", text));
System.out.println("Should find: grep exactly for C++="+ grep(beforeWord+"C\\+\\+"+afterWord, text));
System.out.println("Should find: grep with word boundary for Java="+ grep("\\bJava\\b", text));
System.out.println("Should find: grep for case-insensitive java="+ grep("?i)\\bjava\\b", text));
System.out.println("Should find: grep with word boundary for C="+ grep("\\bC\\b", text)); // Works Ok for this example, but see below
// Because of the stupid too-short cutsey name, searches find stuff it shouldn't.
text = "Worked on C&O (Chesapeake and Ohio) Canal when I was younger; more recently developed in Lisp.";
System.out.println("text="+text);
System.out.println("Bad word boundary because of C name: grep with word boundary for C="+ grep("\\bC\\b", text));
System.out.println("Should be blank: grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));
// Make sure the first and last cases work OK.
text = "C is a language that should have been named differently.";
System.out.println("text="+text);
System.out.println("grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));
text = "One language that should have been named differently is C";
System.out.println("text="+text);
System.out.println("grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));
//Make sure we don't get false positives
text = "The letter 'c' can be hard as in Cat, or soft as in Cindy. Computer languages should not require disambiguation (e.g. Ruby, Python vs. Fortran, Hadoop)";
System.out.println("text="+text);
System.out.println("Should be blank: grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));
P.S. My thanks to http://regexpal.com/ without whom the regex world would be very miserable!
Check out the documentation on boundary conditions:
http://java.sun.com/docs/books/tutorial/essential/regex/bounds.html
Check out this sample:
public static void main(final String[] args)
{
String x = "I found the value -12 in my string.";
System.err.println(Arrays.toString(x.split("\\b-?\\d+\\b")));
}
When you print it out, notice that the output is this:
[I found the value -, in my string.]
This means that the "-" character is not being picked up as being on the boundary of a word because it's not considered a word character. Looks like #brianary kinda beat me to the punch, so he gets an up-vote.
Reference: Mastering Regular Expressions (Jeffrey E.F. Friedl) - O'Reilly
\b is equivalent to (?<!\w)(?=\w)|(?<=\w)(?!\w)
Word boundary \b is used where one word should be a word character and another one a non-word character.
Regular Expression for negative number should be
--?\b\d+\b
check working DEMO
I believe that your problem is due to the fact that - is not a word character. Thus, the word boundary will match after the -, and so will not capture it. Word boundaries match before the first and after the last word characters in a string, as well as any place where before it is a word character or non-word character, and after it is the opposite. Also note that word boundary is a zero-width match.
One possible alternative is
(?:(?:^|\s)-?)\d+\b
This will match any numbers starting with a space character and an optional dash, and ending at a word boundary. It will also match a number starting at the beginning of the string.
when you use \\b(\\w+)+\\b that means exact match with a word containing only word characters ([a-zA-Z0-9])
in your case for example setting \\b at the begining of regex will accept -12(with space) but again it won't accept -12(without space)
for reference to support my words: https://docs.oracle.com/javase/tutorial/essential/regex/bounds.html
I think it's the boundary (i.e. character following) of the last match or the beginning or end of the string.

JavaScript regular expressions to match no digits, whitespace and selected symbols

Thanks for taking a look.
My goal is to come up with a regexp that will match input that contains no digits, whitespace or the symbols !#£$%^&*()+= or any other symbol I may choose.
I am however struggling to grasp precisely how regular expressions work.
I started out with the simple pattern /\D/, which from my understanding will match the first non-digit character it can find. This would match the string 'James' which is correct but also 'James1' which I don't want.
So, my understanding is that if I want to ensure that a pattern is not found anywhere in a given string, I use the ^ and $ characters, as in /^\D$/. Now because this will only match a single character that is not a digit, I needed to use + to specify that 1 or more digits should not be founds in the entire string, giving me the expression /^\D+$/. Brilliant, it no longer matches 'James1'.
Question 1
Is my reasoning up to this point correct?
The next requirement was to ensure no whitespace is in the given string. \s will match a single whitespace and [^\s] will match the first non-whitespace character. So, from my understanding I just had to add this to what I have already to match strings that contain no digits and no whitespace. Again, because [^\s] will only match a single non-white space character, I used + to match one or more whitespace characters, giving the new regexp of /^\D+[^\s]+$/.
This is where I got lost, as the expression now matches 'James1' or even 'James Smith25'. What? Massively confused at this point.
Question 2
Why is /^\D+[^\s]+$/ matching strings that contain spaces?
Question 3
How would I go about writing the regular expression I'm trying to solve?
While I am keen to solve the problem I am more interested in figuring where my understanding of regular expressions is lacking, so any explanations would be helpful.
Not quite; ^ and $ are actually "anchors" - they mean "start" and "end", it's actually a little more complicated, but you can consider them to mean the start and end of a line for now - look up the various modifiers on regular expressions if you're interested in learning more about this. Unfortunately ^ has an overloaded meaning; if used inside square brackets it means "not", which is the meaning you are already acquainted with. It's very important that you understand the difference between these two meanings and that the definition in your head actually applies only to character range matching!
Contributing further to your confusion is that \d means "a numerical digit" and \D means "not a numerical digit". Similarly \s means "a whitespace (space/tab/newline/etc.) character" and \S means "not a whitespace character."
It's worth noting that \d is effectively a shortcut for [0-9] (note that - has a special meaning inside square brackets), and \D is a shortcut for [^0-9].
The reason it's matching strings that contain spaces is that you've asked for "1+ non-numerical digits followed by 1+ non-space characters" - so it'll match lots of strings! I think that perhaps you don't understand that regular expressions match bits of strings, you're not adding constraints as you go, but rather building up bots of matchers that will match bits of corresponding strings.
/^[^\d\s!#£$%^&*()+=]+$/ is the answer you're looking for - I'd look at it like this:
i. [] - match a range of characters
ii. []+ - match one or more of that range of characters
iii. [^\d\s]+ - match one or more characters that do not match \d (numerical digit) or \s (whitespace)
iv. [^\d\s!#£$%^&*()+=]+ - here's a bunch of other characters I don't want you to match
v. ^[^\d\s!#£$%^&*()+=]+$ - now there are anchors applied, so this matcher has to apply to the whole line otherwise it fails to match
A useful website to explore regexs is http://regexr.com/3b9h7 - which I supply with my suggested solution as an example. Edit: Pruthvi Raj's link to debuggerx is awesome!
Is my reasoning up to this point correct?
Almost. /\D/ matches any character other than a digit, but not just the first one (if you use g option).
and [^\s] will match the first non-whitespace character
Almost, [^\s] will match any non-whitespace character, not just the first one (if you use g option).
/^\D+[^\s]+$/ matching strings that contain spaces?
Yes, it does, because \D matches a space (space is not a digit).
Why is /^\D+[^\s]+$/ matching strings that contain spaces?
Because \D+ in /^\D+[^\s]+$/can match spaces.
Conclusion:
Use
^[^\d\s!#£$%^&*()+=]+$
It will match strings that have no digits and spaces, and the symbols you do not allow.
Mind that to match a literal -, ] or [ with a character class, you either need to escape them, or use at the start or end of the expression. To play it safe, escape them.
Just insert every character you don't want to include in a negated character class as follows:
^[^\s\d!#£$%^&*()+=]*$
DEMO
Debuggex Demo
^ - start of the string
[^...] - matches one character that is not in `...`
\s - matches a whitespace (space, newline,tab)
\d - matches a digit from 0 to 9
* - a quantifier that repeats immediately preceeding element by 0 or more times
so the regex matches any string that has
1. string that has a beginning
2. containing 0 or more number of characters that is not whitesapce, digit, and all the symbols included in the character class ( In this example !#£$%^&*()+=) i.e., characters that are not included in the character class `[...]`
3.that has ending
NOTE:
If the symbols you don't want it to have also includes - , a hyphen, don't put it in between some other characters because it is a metacharacter in character class, put it at last of character class

How can I check if string has particular word in javascript

Now, I know I can calculate if a string contains particular substring.Using this:
if(str.indexOf("substr") > -1){
}
Having my substring 'GRE' I want to match for my autocomplete list:
GRE:Math
GRE-Math
But I don't want to match:
CONGRES
and I particularly need to match:
NON-WORD-CHARGRENON-WORD-CHAR
and also
GRE
What should be the perfect regex in my case?
Maybe you want to use \b word boundaries:
(\bGRE\b)
Here is the explanation
Demo: http://regex101.com/r/hJ3vL6
MD, if I understood your spec, this simple regex should work for you:
\W?GRE(?!\w)(?:\W\w+)?
But I would prefer something like [:-]?GRE(?!\w)(?:[:-]\w+)? if you are able to specify which non-word characters you are willing to allow (see explanation below).
This will match
GRE
GRE:Math
GRE-Math
but not CONGRES
Ideally though, I would like to replace the \W (non-word character) with a list of allowable characters, for instance [-:] Why? Because \W would match non-word characters you do not want, such as spaces and carriage returns. So what goes in that list is for you to decide.
How does this work?
\W? optionally matches one single non-word character as you specified. Then we match the literal GRE. Then the lookahead (?!\w) asserts that the next character cannot be a word character. Then, optionally, we match a non-word character (as per your spec) followed by any number of word characters.
Depending on where you see this appearing, you could add boundaries.
You can use the regex: /\bGRE\b/g
if(/\bGRE\b/g.test("string")){
// the string has GRE
}
else{
// it doesn't have GRE
}
\b means word boundary

I want to ignore square brackets when using javascript regex [duplicate]

This question already has answers here:
Why is this regex allowing a caret?
(3 answers)
Closed 1 year ago.
I am using javascript regex to do some data validation and specify the characters that i want to accept (I want to accept any alphanumeric characters, spaces and the following !&,'\- and maybe a few more that I'll add later if needed). My code is:
var value = userInput;
var pattern = /[^A-z0-9 "!&,'\-]/;
if(patt.test(value) == true) then do something
It works fine and excludes the letters that I don't want the user to enter except the square bracket and the caret symbols. From all the javascript regex tutorials that i have read they are special characters - the brackets meaning any character between them and the caret in this instance meaning any character not in between the square brackets. I have searched here and on google for an explanation as to why these characters are also accepted but can't find an explanation.
So can anyone help, why does my input accept the square brackets and the caret?
The reason is that you are using A-z rather than A-Za-z. The ascii range between Z (0x5a) and a (0x61) includes the square brackets, the caret, backquote, and underscore.
Your regex is not in line with what you said:
I want to accept any alphanumeric characters, spaces and the following !&,'\- and maybe a few more that I'll add later if needed
If you want to accept only those characters, you need to remove the caret:
var pattern = /^[A-Za-z0-9 "!&,'\\-]+$/;
Notes:
A-z also includesthe characters: [\]^_`.
Use A-Za-z or use the i modifier to match only alphabets:
var pattern = /^[a-z0-9 "!&,'\\-]+$/i;
\- is only the character -, because the backslash will act as special character for escaping. Use \\ to allow a backslash.
^ and $ are anchors, used to match the beginning and end of the string. This ensures that the whole string is matched against the regex.
+ is used after the character class to match more than one character.
If you mean that you want to match characters other than the ones you accept and are using this to prevent the user from entering 'forbidden' characters, then the first note above describes your issue. Use A-Za-z instead of A-z (the second note is also relevant).
I'm not sure what you want but I don't think your current regexp does what you think it does:
It tries to find one character is not A-z0-9 "!&,'\- (^ means not).
Also, I'm not even sure what A-z matches. It's either a-z or A-Z.
So your current regexp matches strings like "." and "Hi." but not "Hi"
Try this: var pattern = /[^\w"!&,'\\-]/;
Note: \w also includes _, so if you want to avoid that then try
var pattern = /[^a-z0-9"!&,'\\-]/i;
I think the issue with your regex is that A-z is being understood as all characters between 0x41 (65) and 0x7A (122), which included the characters []^_` that are between A-Z and a-z. (Z is 0x5A (90) and a is 0x61 (97), which means the preceding characters take up 0x5B thru 0x60).

Javascript match function for special characters

I am working on this code and using "match" function to detect strength of password. how can I detect if string has special characters in it?
if(password.match(/[a-z]+/)) score++;
if(password.match(/[A-Z]+/)) score++;
if(password.match(/[0-9]+/)) score++;
If you mean !##$% and ë as special character you can use:
/[^a-zA-Z ]+/
The ^ means if it is not something like a-z or A-Z or a space.
And if you mean only things like !#$&$ use:
/\W+/
\w matches word characters, \W matching not word characters.
You'll have to whitelist them individually, like so:
if(password.match(/[`~!##\$%\^&\*\(\)\-=_+\\\[\]{}/\?,\.\<\> ...
and so on. Note that you'll have to escape regex control characters with a \.
While less elegant than /[^A-Za-z0-9]+/, this will avoid internationalization issues (e.g., will not automatically whitelist Far Eastern Language characters such as Chinese or Japanese).
you can always negate the character class:
if(password.match(/[^a-z\d]+/i)) {
// password contains characters that are *not*
// a-z, A-Z or 0-9
}
However, I'd suggest using a ready-made script. With the code above, you could just type a bunch of spaces, and get a better score.
Just do what you did above, but create a group for !##$%^&*() etc. Just be sure to escape characters that have meaning in regex, like ^ and ( etc....
EDIT -- I just found this which lists characters that have meaning in regex.
if(password.match(/[^\w\s]/)) score++;
This will match anything that is not alphanumeric or blank space. If whitespaces should match too, just use /[^\w]/.
As it look from your regex, you are calling everything except for alphanumeric a special character. If that is the case, simply do.
if(password.match(/[\W]/)) {
// Contains special character.
}
Anyhow how why don't you combine those three regex into one.
if(password.match(/[\w]+/gi)) {
// Do your stuff.
}
/[^a-zA-Z0-9 ]+/
This will accept only special characters and will not accept a to z & A to Z 0 to 9 digits

Categories