Match pattern not preceded by character - javascript

I want to make my regex match a pattern only if it is not preceded by a character, the ^ (circumflex) in my case.
My regex:
/[^\^]\w+/g
Text to test it on:
Test: ^Anotherword
Matches: "Test" and " Anotherword", even though the latter is preceded by a circumflex. Which I was trying to prevent by inserting the [^\^] at the start. So I'm not only trying to not match the circumflex, but also the word that comes after it. " Anotherword" should not be matched.
[^\^] - This is what should stop the regex from matching if an accent circonflexe is in front of it.
\w+ - Match any word that is not preceded by a circumflex.
I cannot use lookbehind because of JavaScript limitations.

Use ([^^\w]|^)\w+
(see http://regexr.com/3e85b)
It basically injects a word boundary while excluding the ^ as well.
[^\w] = \W\b\w
Otherwise [^^] will match a '^T'
and \w+ will match est.
You can see it if you put capture groups around it.

If matching is not strictly forbidden.
(?:\^\w+)|(\w+): matches both expressions but no group is generated for ^Anotherworld.
(?:\^\w+): matches ^Kawabanga but no group is generated.
(\w+): everything else for grouping.
I case you want ^Anotherworld to have a group simply remove ?:.

With the growing adoption of the ECMAScript 2018 standard, it makes sense to also consider the lookbehind approach:
const text = "One Test: ^Anotherword";
// Extracing words not preceded with ^:
console.log(text.match(/\b(?<!\^)\w+/g)); // => [ "One", "Test" ]
// Replacing words not preceded with ^ with some other text:
console.log(text.replace(/\b(?<!\^)\w+/g, '<SPAN>$&</SPAN>'));
// => <SPAN>One</SPAN> <SPAN>Test</SPAN>: ^Anotherword
The \b(?<!\^)\w+ regex matches one or more word chars (\w+) that have no word char (letter, digit or _) immediately on the left (achieved with a word boundary, \b) that have no ^ char immediately on the left (achieved with the negative lookbehind (?<!\^)). Note that ^ is a special regex metacharacter that needs to be escaped if one wants to match it as a literal caret char.
For older JavaScript environments, it is still necessary to use a workaround:
var text = "One Test: ^Anotherword";
// Extracing words not preceded with ^:
var regex = /(?:[^\w^]|^)(\w+)/g, result = [], m;
while (m = regex.exec(text)) {
result.push(m[1]);
}
console.log(result); // => [ "One", "Test" ]
// Replacing words not preceded with ^ with some other text:
var regex = /([^\w^]|^)(\w+)/g;
console.log(text.replace(regex, '$1<SPAN>$2</SPAN>'));
// => <SPAN>One</SPAN> <SPAN>Test</SPAN>: ^Anotherword
The extraction and replacement regexps differ in the amount of capturing groups, as when extracing, we only need one group, and when replacing we need both groups. If you decide to use a regex with two capturing groups for extraction, you would need to collect m[2] values.
Extraction pattern means
(?:[^\w^]|^) - a non-capturing group matching
[^\w^] - any char other than a word and ^ char
| - or
^ - start of string
(\w+) - Group 1: one or more word chars.

Related

Matching Arabic and English letters only javascript regex

I'm trying to write a regex that matches Arabic and English letters only (numbers and special characters are not allowed) spaces are allowed.
This regex worked fine but allows numbers in the middle of the string
/[\u0620-\u064A\040a-zA-Z]+$/
for example, it matches (سم111111ر) which suppose not to match.
The question is there a way not to match numbers in the middle of the letters.
Note in JavaScript you will have to use the ECMAScript 2018+ with Unicode category class support:
const texts = ['أسبوع أسبوع','week week','hunāka','سم111111ر'];
const re = /^(?:(?=[\p{Script=Arabic}A-Za-z])\p{L}|\s)+$/u;
for (const text of texts) {
console.log(text, '=>', re.test(text))
}
The ^(?:(?=[\p{Script=Arabic}A-Za-z])\p{L}|\s)+$ means
^ - start of string
(?: - start of a non-capturing group container:
(?=[\p{Script=Arabic}A-Za-z]) - a positive lookahead that requires a char from the Arabic script or an ASCII letter to occur immediately to the right of the current location
\p{L} - any Unicode letter (note \p{Alphabetic} includes a bit more "letter" chars, you may want to try it out)
| - or
\s - whitespace
)+ - repeat one or more times
$ - end of string.

regular expression to match hashtags in both left to right and right to left languages

I use the following code to find words that start with hashtags:
var regex = /(?:^|\W)#(\w+)(?!\w)/g;
but it only matches the English words and it can not match hashtags in other languages such as arabic. so, how can I find hashtags in a text like this:
this is a simple #text هذا #نص بسیط
If the value after the # should not contain a # itself, you could use a negated character class [^\s#] matching any character except # either way around using an alternation |
The value is in capture group 1.
(?:^|\s)(#[^\s#]+|[^\s#]+#)(?=$|\s)
Regex demo
const pattern = /(?:^|\s)(#[^\s#]+|[^\s#]+#)(?=$|\s)/;
[
"this is a simple #test1",
"هذا #نص بسیط",
"test #test2#",
"test #test3#test3",
"test ##test4",
"test test5##",
].forEach(s => {
const m = s.match(pattern);
if (m) console.log(m[1]);
});
You may use the following regex alternation:
(?<!\S)#\S+|\S+#(?!\S)
Demo
Bearing in mind that a Unicode aware \w can be represented with [\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control}] (see What's the correct regex range for javascript's regexes to match all the non word characters in any script?), the direct Unicode equivalent of your pattern is
const uw = String.raw`[\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control}]`; // uw = Unicode \w
const regex = new RegExp(`(?<!${uw})#(${uw}+)(?!${uw})`, "gu");
Now, to match both directions, you may use
const regex = new RegExp(`(?<!${uw})(?:#(${uw}+)|${uw}+#)(?!${uw})`, "gu");
^_________^_______^
That is, a non-capturing group with an alternation | char is used with two alernatives, that match # + Unicode word chars on the right, or Unicode word chars and then a # on the right. Details:
(?<!${uw}) - a negative lookbehind that fails the match if there is a Unicode word char immediately on the left
(?:#(${uw}+)|${uw}+#) - a non-capturing group that matches either
#(${uw}+) - a # char followed with one or more Unicode word chars
| - or
${uw}+# - one or more Unicode word chars followed with a # char
(?!${uw}) - a negative lookahead that fails the match if there is a Unicode word char immediately on the right.
The g flag ensures multiple matches and u enables the Unicode property classes support in the pattern.
A JavaScript demo:
const strings = ["this is a simple #text #text2", "هذا #نن*&ص بسیط","#نص2 هذا #نص بسیط"];
const uw = String.raw`[\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control}]`; // uw = Unicode \w
const regex = new RegExp(`(?<!${uw})(?:#(${uw}+)|${uw}+#)(?!${uw})`, "gu");
strings.forEach( string => console.log(string, '=>', string.match(regex)))

Javascript in regexp not matching something

I want to match everything except the one with the string '1AB' in it. How do I do that? When I tried it, it said nothing is matched.
var text = "match1ABmatch match2ABmatch match3ABmatch";
var matches = text.match(/match(?!1AB)match/g);
console.log(matches[0]+"..."+matches[1]);
Lookarounds do not consume the text, i.e. the regex index does not move when their patterns are matched. See Lookarounds Stand their Ground for more details. You still must match the text with a consuming pattern, here, the digits.
Add \w+ word matching pattern after the lookahead. NOTE: You may also use \S+ if there can be any one or more non-whitespace chars. If there can be any chars, use .+ (to match 1 or more chars other than line break chars) or [^]+ (matches even line breaks).
var text = "match100match match200match match300match";
var matches = text.match(/match(?!100(?!\d))\w+match/g);
console.log(matches);
Pattern details
match - a literal substring
(?!100(?!\d)) - a negative lookahead that fails the match if, immediately to the right of the current location, there is 100 substring not followed with a digit (if you want to fail the matches where the number starts with 100, remove the (?!\d) lookahead)
\w+ - 1 or more word chars (letters, digits or _)
match - a literal substring
See the regex demo online.

Regular expression capture with optional trailing underscore and number

I'm trying to find a regular expression that will match the base string without the optional trailing number (_123). e.g.:
lorem_ipsum_test1_123 -> capture lorem_ipsum_test1
lorem_ipsum_test2 -> capture lorem_ipsum_test2
I tried using the following expression, but it would only work when there is a trailing _number.
/(.+)(?>_[0-9]+)/
/(.+)(?>_[0-9]+)?/
Similarly, adding the ? (zero or more) quantifier only worked when there is no trailing _number, otherwise, the trailing _number would just be part of the first capture.
Any suggestions?
You may use the following expression:
^(?:[^_]+_)+(?!\d+$)[^_]+
^ Anchor beginning of string.
(?:[^_]+_)+ Repeated non capturing group. Negated character set for anything other than a _, followed by a _.
(?!\d+$) Negative lookahead for digits at the end of the string.
[^_]+ Negated character set for anything other than a _.
Regex demo here.
Please note that the \n in the character sets in the Regex demo are only for demonstration purposes, and should by all means be removed when using as a pattern in Javascript.
Javascript demo:
var myString = "lorem_ipsum_test1_123";
var myRegexp = /^(?:[^_]+_)+(?!\d+$)[^_]+/g;
var match = myRegexp.exec(myString);
console.log(match[0]);
var myString = "lorem_ipsum_test2"
var myRegexp = /^(?:[^_]+_)+(?!\d+$)[^_]+/g;
var match = myRegexp.exec(myString);
console.log(match[0]);
You might match any character and use a negative lookahead that asserts that what follows is not an underscore, one or more digits and the end of the string:
^(?:(?!_\d+$).)*
Explanation
^ Assert start of the string
(?: Non capturing group
(?! Negative lookahead to assert what is on the right side is not
_\d+$Match an underscore, one or more digits and assert end of the string
.) Match any character and close negative lookahead
)* Close non capturing group and repeat zero or more times
Regex demo
const strings = [
"lorem_ipsum_test1_123",
"lorem_ipsum_test2"
];
let pattern = /^(?:(?!_\d+$).)*/;
strings.forEach((s) => {
console.log(s + " ==> " + s.match(pattern)[0]);
});
You are asking for
/^(.*?)(?:_\d+)?$/
See the regex demo. The point here is that the first dot pattern must be non-greedy and the _\d+ should be wrapped with an optional non-capturing group and the whole pattern (especially the end) must be enclosed with anchors.
Details
^ - start of string
(.*?) - Capturing group 1: any zero or more chars other than line break chars, as few as possible due to the non-greedy ("lazy") quantifier *?
(?:_\d+)? - an optional non-capturing group matching 1 or 0 occurrences of _ and then 1+ digits
$ - end of string.
However, it seems easier to use a mere replacing approach,
s = s.replace(/_\d+$/, '')
If the string ends with _ and 1+ digits, the substring will get removed, else, the string will not change.
See this regex demo.
Try to check if the string contains the trailing number. If it does you get only the other part. Otherwise you get the whole string.
var str = "lorem_ipsum_test1_123"
if(/_[0-9]+$/.test(str)) {
console.log(str.match(/(.+)(?=_[0-9]+)/g))
} else {
console.log(str)
}
Or, a lot more concise:
str = str.replace(/_[0-9]+$/g, "")

javascript regex to check if first and last character are similar?

Is there any simple way to check if first and last character of a string are the same or not, only with regex?
I know you can check with charAt
var firstChar = str.charAt(0);
var lastChar = str.charAt(length-1);
console.log(firstChar===lastChar):
I'm not asking for this : Regular Expression to match first and last character
You can use regex with capturing group and its backreference to assert both starting and ending characters are same by capturing the first caharacter. To test the regex match use RegExp#test method.
var regex = /^(.).*\1$/;
console.log(
regex.test('abcdsa')
)
console.log(
regex.test('abcdsaasaw')
)
Regex explanation here :
^ asserts position at start of the string
1st Capturing Group (.)
.* matches any character (except newline) - between zero and unlimited times, as many times as possible, giving back as needed (greedy)
\1 matches the same text as most recently matched by the 1st capturing group
$ asserts position at the end of the string
The . doesn't include newline character, in order include newline update the regex.
var regex = /^([\s\S])[\s\S]*\1$/;
console.log(
regex.test(`abcd
sa`)
)
console.log(
regex.test(`ab
c
dsaasaw`)
)
Refer : How to use JavaScript regex over multiple lines?
Regex explanation here :
[.....] - Match a single character present
\s - matches any whitespace character (equal to [\r\n\t\f\v ])
\S - matches any non-whitespace character (equal to [^\r\n\t\f ])
finally [\s\S] is matches any character.
You can try it
const rg = /^([\w\W]+)[\w\W]*\1$/;
console.log(
rg.test(`abcda`)
)
console.log(
rg.test(`aebcdae`)
)
console.log(
rg.test(`aebcdac`)
)
var rg = /^([a|b])([a|b]+)\1$|^[a|b]$/;
console.log(rg.test('aabbaa'))
console.log(rg.test('a'))
console.log(rg.test('b'))
console.log(rg.test('bab'))
console.log(rg.test('baba'))
This will make sure that characters are none other than a and b which have the same start and end.
It will also match single characters because they too start and end with same character.

Categories