Matching Arabic and English letters only javascript regex - javascript

I'm trying to write a regex that matches Arabic and English letters only (numbers and special characters are not allowed) spaces are allowed.
This regex worked fine but allows numbers in the middle of the string
/[\u0620-\u064A\040a-zA-Z]+$/
for example, it matches (سم111111ر) which suppose not to match.
The question is there a way not to match numbers in the middle of the letters.

Note in JavaScript you will have to use the ECMAScript 2018+ with Unicode category class support:
const texts = ['أسبوع أسبوع','week week','hunāka','سم111111ر'];
const re = /^(?:(?=[\p{Script=Arabic}A-Za-z])\p{L}|\s)+$/u;
for (const text of texts) {
console.log(text, '=>', re.test(text))
}
The ^(?:(?=[\p{Script=Arabic}A-Za-z])\p{L}|\s)+$ means
^ - start of string
(?: - start of a non-capturing group container:
(?=[\p{Script=Arabic}A-Za-z]) - a positive lookahead that requires a char from the Arabic script or an ASCII letter to occur immediately to the right of the current location
\p{L} - any Unicode letter (note \p{Alphabetic} includes a bit more "letter" chars, you may want to try it out)
| - or
\s - whitespace
)+ - repeat one or more times
$ - end of string.

Related

Javascript: Regex to exclude whitespace and special characters

I need a regex to validate,
Should be of length 18
First 5 characters should be either (xyz34|xyz12)
Remaining 13 characters should be alphanumeric only letters and numbers, no whitespace or special characters is allowed.
I have a pattern like here, '/^(xyz34|xyz12)((?=.*[a-zA-Z])(?=.*[0-9])){13}/g'
But this is allowing whitespace and special characters like ($,% and etc) which is violating the rule #3.
Any suggestion to exclude this whitespace and special characters and to strictly check that it must be letters and numbers?
You should not quantify lookarounds. They are non-consuming patterns, i.e. the consecutive positive lookaheads check the presence of their patterns but do not advance the regex index, they check the text at the same position. It makes no sense repeating them 13 times. ^(xyz34|xyz12)((?=.*[a-zA-Z])(?=.*[0-9])){13} is equal to ^(xyz34|xyz12)(?=.*[a-zA-Z])(?=.*[0-9]), and means the string can start with xyz34 or xyz12 and then should have at least 1 letter and at least 1 digits.
You may consider fixing the issue by using a consuming pattern like this:
If you do not care if the last 13 chars contain only digits or only letters, use the patterns suggested by other users, like /^(?:xyz34|xyz12)[a-zA-Z\d]{13}$/ or /^xyz(?:34|12)[a-zA-Z0-9]{13}$/
If there must be at least 1 digit and at least 1 letter among those 13 alphanumeric chars, use /^xyz(?:34|12)(?=[a-zA-Z]*\d)(?=\d*[a-zA-Z])[a-zA-Z\d]{13}$/.
See the regex demo #1 and the regex demo #2.
NOTE: these are regex literals, do not use them inside single- or double quotes!
Details
^ - start of string
xyz - a common prefix
(?:34|12) - a non-capturing group matching 34 or 12
(?=[a-zA-Z]*\d) - there must be at least 1 digit after any 0+ letters to the right of the current location
(?=\d*[a-zA-Z]) - there must be at least 1 letter after any 0+ digtis to the right of the current location
[a-zA-Z\d]{13} - 13 letters or digits
$ - end of string.
JS demo:
var strs = ['xyz34abcdefghijkl1','xyz341bcdefghijklm','xyz34abcdefghijklm','xyz341234567890123','xyz14a234567890123'];
var rx = /^xyz(?:34|12)(?=[a-zA-Z]*\d)(?=\d*[a-zA-Z])[a-zA-Z\d]{13}$/;
for (var s of strs) {
console.log(s, "=>", rx.test(s));
}
.* will match any string, for your requirment you can use this:
/^xyz(34|12)[a-zA-Z0-9]{13}$/g
regex fiddle
/^(xyz34|xyz12)[a-zA-Z0-9]{13}$/
This should work,
^ asserts position at the start of a line
1st Capturing Group (xyz34|xyz12)
1st Alternative xyz34 matches the characters xyz34 literally (case sensitive)
2nd Alternative xyz12 matches the characters xyz12 literally (case sensitive)
Match a single character present in the list below [a-zA-Z0-9]{13}
{13} Quantifier — Matches exactly 13 times

Regex remove all leading and trailing special characters?

Let's say I have the following string in javascript:
&a.b.c. &a.b.c& .&a.b.c.&. *;a.b.c&*. a.b&.c& .&a.b.&&dc.& &ê.b..c&
I want to remove all the leading and trailing special characters (anything which is not alphanumeric or alphabet in another language) from all the words.
So the string should look like
a.b.c a.b.c a.b.c a.b.c a.b&.c a.b.&&dc ê.b..c
Notice how the special characters in between the alphanumeric is left behind. The last ê is also left behind.
This regex should do what you want. It looks for
start of line, or some spaces (^| +) captured in group 1
some number of symbol characters [!-\/:-#\[-``\{-~]*
a minimal number of non-space characters ([^ ]*?) captured in group 2
some number of symbol characters [!-\/:-#\[-``\{-~]*
followed by a space or end-of-line (using a positive lookahead) (?=\s|$)
Matches are replaced with just groups 1 and 2 (the spacing and the characters between the symbols).
let str = '&a.b.c. &a.b.c& .&a.b.c.&. *;a.b.c&*. a.b&.c& .&a.b.&&dc.& &ê.b..c&';
str = str.replace(/(^| +)[!-\/:-#\[-`\{-~]*([^ ]*?)[!-\/:-#\[-`\{-~]*(?=\s|$)/gi, '$1$2');
console.log(str);
Note that if you want to preserve a string of punctuation characters on their own (e.g. as in Apple & Sauce), you should change the second capture group to insist on there being one or more non-space characters (([^ ]+?)) instead of none and add a lookahead after the initial match of punctuation characters to assert that the next character is not punctuation:
let str = 'Apple &&& Sauce; -This + !That!';
str = str.replace(/(^| +)[!-\/:-#\[-`\{-~]*(?![!-\/:-#\[-`\{-~])([^ ]+?)[!-\/:-#\[-`\{-~]*(?=\s|$)/gi, '$1$2');
console.log(str);
a-zA-Z\u00C0-\u017F is used to capture all valid characters, including diacritics.
The following is a single regular expression to capture each individual word. The logic is that it will look for the first valid character as the beginning of the capture group, and then the last sequence of invalid characters before a space character or string terminator as the end of the capture group.
const myRegEx = /[^a-zA-Z\u00C0-\u017F]*([a-zA-Z\u00C0-\u017F].*?[a-zA-Z\u00C0-\u017F]*)[^a-zA-Z\u00C0-\u017F]*?(\s|$)/g;
let myString = '&a.b.c. &a.b.c& .&a.b.c.&. *;a.b.c&*. a.b&.c& .&a.b.&&dc.& &ê.b..c&'.replace(myRegEx, '$1$2');
console.log(myString);
Something like this might help:
const string = '&a.b.c. &a.b.c& .&a.b.c.&. *;a.b.c&*. a.b&.c& .&a.b.&&dc.& &ê.b..c&';
const result = string.split(' ').map(s => /^[^a-zA-Z0-9ê]*([\w\W]*?)[^a-zA-Z0-9ê]*$/g.exec(s)[1]).join(' ');
console.log(result);
Note that this is not one single regex, but uses JS help code.
Rough explanation: We first split the string into an array of strings, divided by spaces. We then transform each of the substrings by stripping
the leading and trailing special characters. We do this by capturing all special characters with [^a-zA-Z0-9ê]*, because of the leading ^ character it matches all characters except those listed, so all special characters. Between these two groups we capture all relevant characters with ([\w\W]*?). \w catches words, \W catches non-words, so \w\W catches all possible characters. By appending the ? after the *, we make the quantifier * lazy, so that the group stops catching as soon as the next group, which catches trailing special characters, catches something. We also start the regex with a ^ symbol and end it with an $ symbol to capture the entire string (they respectively set anchors to the start end the end of the string). With .exec(s)[1] we then execute the regex on the substring and return the first capturing group result in our transform function. Note that this might be null if a substring does not include proper characters. At the end we join the substrings with spaces.

Javascript in regexp not matching something

I want to match everything except the one with the string '1AB' in it. How do I do that? When I tried it, it said nothing is matched.
var text = "match1ABmatch match2ABmatch match3ABmatch";
var matches = text.match(/match(?!1AB)match/g);
console.log(matches[0]+"..."+matches[1]);
Lookarounds do not consume the text, i.e. the regex index does not move when their patterns are matched. See Lookarounds Stand their Ground for more details. You still must match the text with a consuming pattern, here, the digits.
Add \w+ word matching pattern after the lookahead. NOTE: You may also use \S+ if there can be any one or more non-whitespace chars. If there can be any chars, use .+ (to match 1 or more chars other than line break chars) or [^]+ (matches even line breaks).
var text = "match100match match200match match300match";
var matches = text.match(/match(?!100(?!\d))\w+match/g);
console.log(matches);
Pattern details
match - a literal substring
(?!100(?!\d)) - a negative lookahead that fails the match if, immediately to the right of the current location, there is 100 substring not followed with a digit (if you want to fail the matches where the number starts with 100, remove the (?!\d) lookahead)
\w+ - 1 or more word chars (letters, digits or _)
match - a literal substring
See the regex demo online.

How to split a string based of capital letters?

I have a string I need to split based on capital letters,my code below
let s = 'OzievRQ7O37SB5qG3eLB';
var res = s.split(/(?=[A-Z])/)
console.log(res);
But there is a twist,if the capital letters are contiguous I need the regex to "eat" until this sequence ends.In the example above it returns
..R,Q7,O37,S,B5q,G3e,L,B
And the result should be
RQ7,O37,SB5q,G3e,LB
Thoughts?Thanks.
You need to match these chunks with /[A-Z]+[^A-Z]*|[^A-Z]+/g instead of splitting with a zero-width assertion pattern, because the latter (in your case, it is a positive lookahead only regex) will have to check each position inside the string and it is impossible to tell the regex to skip a position once the lookaround pattern is found.
s = 'and some text hereOzievRQ7O37SB5qG3eLB';
console.log(s.match(/[A-Z]+[^A-Z]*|[^A-Z]+/g));
See the online regex demo at regex101.com.
Details:
[A-Z]+ - one or more uppercase ASCII letters
[^A-Z]* - zero or more (to allow matching uppercase only chunks) chars other than uppercase ASCII letters
| - or
[^A-Z]+ - one or more chars other than uppercase ASCII letters (to allow matching non-uppercase ASCII letters at the start of the string.
The g global modifier will let String#match() return all found non-overlapping matches.

RegEx start and finish with letter, allow commas and dashes

I've got this regex:
/^[\a-zøåæäöüß][\a-z0-9øåæäöüß]*(?:\-?[\a-z0-9øåæäöüß,]+)*$/i
It works fine for a crazy input like "K61-283øÅ,æk-ken,a-sd", but it fails on the cases "word," (so, when there's just one comma).
Also - how can I restrict it that it should start with a letter after every comma or dash (so basically - every word)?
The rule is: start with a letter and end with alphanumeric; allow alphanumeric, dashes and commas; after each dash or comma there should be a letter
You may use
/^[a-zøåæäöüß][a-z0-9øåæäöüß]*(?:[-,][a-zøåæäöüß][a-z0-9øåæäöüß]*)*$/i
See the regex demo
Details:
^ - start of string
[a-zøåæäöüß] - a letter from the defined set
[a-z0-9øåæäöüß]* - 0+ digits or letters from the defined set
(?:[-,][a-zøåæäöüß][a-z0-9øåæäöüß]*)* - zero or more sequences of:
[-,] - a - or ,
[a-zøåæäöüß] - a letter from the defined set
[a-z0-9øåæäöüß]* - 0+ digits or letters from the defined set
$ - end of string.
Update 2:
There are two ways to look at your requirements.
The top-down view
We treat the input as a list of one or more words, separated by comma or dash:
INPUT = WORD (?: [,\-] WORD )*
Each word consists of a letter, followed by zero or more letters or digits:
WORD = LETTER [ LETTER DIGIT ]*
Translated into JavaScript regex syntax this gives us:
WORD = [a-zøåæäöüß][a-zøåæäöüß\d]*
And for the whole input (with anchors):
/^[a-zøåæäöüß][a-zøåæäöüß\d]*(?:[,\-][a-zøåæäöüß][a-zøåæäöüß\d]*)*$/i
(This is Wiktor Stribiżew's answer.)
The bottom-up view
We start by looking at the allowed characters. We know that the first character has to be a letter. After that, there can be zero or more input elements:
INPUT = LETTER ELEMENT*
Each element is either
a letter or digit, or
a comma or dash, followed by a letter:
ELEMENT = [ LETTER DIGIT ] | [ COMMA DASH ] LETTER
Translating this into JavaScript gives us:
/^[a-zøåæäöüß](?:[a-zøåæäöüß\d]|[,\-][a-zøåæäöüß])*$/i
These two regexes are equivalent. The bottom-up regex is shorter and contains less repetitive code. On the other hand, the top-down regex may run faster on some regex engines if the input strings are mostly alphanumeric, with relatively few dashes/commas. On the gripping hand, if your inputs are short, you probably don't care about minuscule performance differences.
Here's a direct encoding of your (revised) requirements:
/^[a-zøåæäöüß](?:(?:[a-zøåæäöüß\d]|[,\-][a-zøåæäöüß])*[,\-]?[a-zøåæäöüß])?$/i
The idea is to match a letter, followed by either
the end of the string (this handles input strings of length 1), or
a list of 0 or more intermediates, optionally followed by a comma or dash, followed by another letter
Each intermediate is either
a letter, or
a digit, or
a comma or a dash followed by a letter
Try this out: (allows letters and digits after comma or dash)
/^[a-zøåæäöüß]([a-z0-9øåæäöüß]|(,|-)[a-z0-9øåæäöüß])*[a-zøåæäöüß]$/i
or this: (allows letters after comma or dash)
/^[a-zøåæäöüß]([a-z0-9øåæäöüß]|(,|-)[a-zøåæäöüß])*[a-zøåæäöüß]$/i

Categories