Regex remove all leading and trailing special characters? - javascript

Let's say I have the following string in javascript:
&a.b.c. &a.b.c& .&a.b.c.&. *;a.b.c&*. a.b&.c& .&a.b.&&dc.& &ê.b..c&
I want to remove all the leading and trailing special characters (anything which is not alphanumeric or alphabet in another language) from all the words.
So the string should look like
a.b.c a.b.c a.b.c a.b.c a.b&.c a.b.&&dc ê.b..c
Notice how the special characters in between the alphanumeric is left behind. The last ê is also left behind.

This regex should do what you want. It looks for
start of line, or some spaces (^| +) captured in group 1
some number of symbol characters [!-\/:-#\[-``\{-~]*
a minimal number of non-space characters ([^ ]*?) captured in group 2
some number of symbol characters [!-\/:-#\[-``\{-~]*
followed by a space or end-of-line (using a positive lookahead) (?=\s|$)
Matches are replaced with just groups 1 and 2 (the spacing and the characters between the symbols).
let str = '&a.b.c. &a.b.c& .&a.b.c.&. *;a.b.c&*. a.b&.c& .&a.b.&&dc.& &ê.b..c&';
str = str.replace(/(^| +)[!-\/:-#\[-`\{-~]*([^ ]*?)[!-\/:-#\[-`\{-~]*(?=\s|$)/gi, '$1$2');
console.log(str);
Note that if you want to preserve a string of punctuation characters on their own (e.g. as in Apple & Sauce), you should change the second capture group to insist on there being one or more non-space characters (([^ ]+?)) instead of none and add a lookahead after the initial match of punctuation characters to assert that the next character is not punctuation:
let str = 'Apple &&& Sauce; -This + !That!';
str = str.replace(/(^| +)[!-\/:-#\[-`\{-~]*(?![!-\/:-#\[-`\{-~])([^ ]+?)[!-\/:-#\[-`\{-~]*(?=\s|$)/gi, '$1$2');
console.log(str);

a-zA-Z\u00C0-\u017F is used to capture all valid characters, including diacritics.
The following is a single regular expression to capture each individual word. The logic is that it will look for the first valid character as the beginning of the capture group, and then the last sequence of invalid characters before a space character or string terminator as the end of the capture group.
const myRegEx = /[^a-zA-Z\u00C0-\u017F]*([a-zA-Z\u00C0-\u017F].*?[a-zA-Z\u00C0-\u017F]*)[^a-zA-Z\u00C0-\u017F]*?(\s|$)/g;
let myString = '&a.b.c. &a.b.c& .&a.b.c.&. *;a.b.c&*. a.b&.c& .&a.b.&&dc.& &ê.b..c&'.replace(myRegEx, '$1$2');
console.log(myString);

Something like this might help:
const string = '&a.b.c. &a.b.c& .&a.b.c.&. *;a.b.c&*. a.b&.c& .&a.b.&&dc.& &ê.b..c&';
const result = string.split(' ').map(s => /^[^a-zA-Z0-9ê]*([\w\W]*?)[^a-zA-Z0-9ê]*$/g.exec(s)[1]).join(' ');
console.log(result);
Note that this is not one single regex, but uses JS help code.
Rough explanation: We first split the string into an array of strings, divided by spaces. We then transform each of the substrings by stripping
the leading and trailing special characters. We do this by capturing all special characters with [^a-zA-Z0-9ê]*, because of the leading ^ character it matches all characters except those listed, so all special characters. Between these two groups we capture all relevant characters with ([\w\W]*?). \w catches words, \W catches non-words, so \w\W catches all possible characters. By appending the ? after the *, we make the quantifier * lazy, so that the group stops catching as soon as the next group, which catches trailing special characters, catches something. We also start the regex with a ^ symbol and end it with an $ symbol to capture the entire string (they respectively set anchors to the start end the end of the string). With .exec(s)[1] we then execute the regex on the substring and return the first capturing group result in our transform function. Note that this might be null if a substring does not include proper characters. At the end we join the substrings with spaces.

Related

Matching Arabic and English letters only javascript regex

I'm trying to write a regex that matches Arabic and English letters only (numbers and special characters are not allowed) spaces are allowed.
This regex worked fine but allows numbers in the middle of the string
/[\u0620-\u064A\040a-zA-Z]+$/
for example, it matches (سم111111ر) which suppose not to match.
The question is there a way not to match numbers in the middle of the letters.
Note in JavaScript you will have to use the ECMAScript 2018+ with Unicode category class support:
const texts = ['أسبوع أسبوع','week week','hunāka','سم111111ر'];
const re = /^(?:(?=[\p{Script=Arabic}A-Za-z])\p{L}|\s)+$/u;
for (const text of texts) {
console.log(text, '=>', re.test(text))
}
The ^(?:(?=[\p{Script=Arabic}A-Za-z])\p{L}|\s)+$ means
^ - start of string
(?: - start of a non-capturing group container:
(?=[\p{Script=Arabic}A-Za-z]) - a positive lookahead that requires a char from the Arabic script or an ASCII letter to occur immediately to the right of the current location
\p{L} - any Unicode letter (note \p{Alphabetic} includes a bit more "letter" chars, you may want to try it out)
| - or
\s - whitespace
)+ - repeat one or more times
$ - end of string.

Javascript in regexp not matching something

I want to match everything except the one with the string '1AB' in it. How do I do that? When I tried it, it said nothing is matched.
var text = "match1ABmatch match2ABmatch match3ABmatch";
var matches = text.match(/match(?!1AB)match/g);
console.log(matches[0]+"..."+matches[1]);
Lookarounds do not consume the text, i.e. the regex index does not move when their patterns are matched. See Lookarounds Stand their Ground for more details. You still must match the text with a consuming pattern, here, the digits.
Add \w+ word matching pattern after the lookahead. NOTE: You may also use \S+ if there can be any one or more non-whitespace chars. If there can be any chars, use .+ (to match 1 or more chars other than line break chars) or [^]+ (matches even line breaks).
var text = "match100match match200match match300match";
var matches = text.match(/match(?!100(?!\d))\w+match/g);
console.log(matches);
Pattern details
match - a literal substring
(?!100(?!\d)) - a negative lookahead that fails the match if, immediately to the right of the current location, there is 100 substring not followed with a digit (if you want to fail the matches where the number starts with 100, remove the (?!\d) lookahead)
\w+ - 1 or more word chars (letters, digits or _)
match - a literal substring
See the regex demo online.

Capture field names in a multiline representation of Django model class

Hi I need to write a regex expression for vscode extension, which matches fields of the class in a string representation:
const str = `
title = models.CharField(
blank=True
)
text = models.TextField()
author=ForeignKey(
User,
on_delete=models.CASCADE
)
test_num_10 = models.TextFIeld()
`
from the multiline string bellow I need to capture strings title, text, author, test_num_10
Each group follow by a pattern:
white space
capturing group, any character
optional white space
= sign
optional white space
any character
( sign
optional any character
) sign .
So far my regexp looks like this /\s+(.+)(\s+)?\=(\s+)?.+\((.+)?\).+/. But it doesn't match what I expect, please help me figure it out.
For you example data you could match your values using:
\s*(.+?)\s*=\s*.+\(([\s\S]*?)\)
That will match:
\s* Match zero or more times a whitespace character
(.+?) Capture in a group any character zero or more times non greedy
\s* Match zero or more times a whitespace character
= Match literally
\s*.+ Match zero or more times a whitespace character followed by any character one or more times
\(([\s\S]*?)\) Between parenthesis capture in a group any character non greedy
Your regex suffers from greediness (all that dot-stars consume up to the end of line then cause a backtrack). You'd better look for a restrictive pattern while there is a chance:
(\S+)\s*=\s*[^(]*\(([^)]*)\)
\S Matches non-whitespace characters
\s This is the opposite of \S
[^(]* Matches anything but an opening parenthesis (optional)
[^)]* Matches anything but a closing parenthesis (optional)
Live demo

JS RegEx for C# "lambda syntax"

/^[a-z][ ][=][>][ ][a-z?][.?][a-z0-9]+[ ][=][ ]['?][a-z0-9]+['?]]/i
I'm trying to figure out how to get a rexex pattern that would recognize a string of lambda syntax (used in c#)
In the case of strings
"p => p = 'some random string'" //Must alow for single quotes
In the case of number or boolean values
"p => p = true" /*or*/ "p => p = 25" //Must allow for a string without single quotes with no whitespace at all in the event there are no single quotes
Also it must allow for a single '.' in the letter chosen to the left of the '=' sign
"p => p.firstName = 'Jack'"
How can I modify my regex to fulfill the following requirments
start off with any letter
followed with a mandatory space
followed by a mandatory string '=>' (without single quotes)
followed by a mandatory space
followed by the same letter in the step 1 (or at least a single character)
followed by a period character (optional)
followed by any set of alphanumberic characters (required if there is a period from step 6)
followed by a space
followed by an equals sign
followed by a space
followed by any alphanumeric set of characters along with single quotes (but only if the single quotes encompass the set of alphanumeric characters)
First off, just the general point that you don't need [] around everything, only character classes (e.g [a-zA-Z] or [_\$0-9]).
So let's go through your steps in order:
Match any letter - you don't specify case, let's do both:
Lowercase only: ([a-z])
Uppercase only: ([A-Z])
Both: ([a-zA-Z]).
We wrap it in () so we can use it in a backreference later.
The mandatory string => (merging steps 2-4) is just that, literally: =>. As none of these are special characters there is no need for escaping.
To get the same letter as step 1, we insert a backref to the first group (set of ()): \1
For step 6 & 7, we take the period along with one alphanumeric character to be optional: (\.\w)? and then zero or more alphanumeric characters: \w*
Now we have the literal string =, again none of these chars need to be escaped so we include it directly: =
For the last step we have several options:
Some numeric characters without whitespace: \d+
True or False
Or, single quote, any characters but the single quote and then single quote again: '[^']*' (we use negative character classes to get everything but ')
Now we join these to together as alternatives using |
Putting all this together, we get the final regex:
/([a-zA-Z]) => \1(\.\w)?\w* = (\d+|true|false|'[^']*')/i

Regex to match # followed by square brackets containing a number

I want to parse a pattern similar to this using javascript:
#[10] or #[15]
With all my efforts, I came up with this:
#\\[(.*?)\\]
This pattern works fine but the problem is it matches anything b/w those square brackets. I want it to match only numbers. I tried these too:
#\\[(0-9)+\\]
and
#\\[([(0-9)+])\\]
But these match nothing.
Also, I want to match only pattern which are complete words and not part of a word in the string. i.e. should contain spaces both side if its not starting or ending the script. That means it should not match phrase like this:
abxdcs#[13]fsfs
Thanks in advance.
Use the regex:
/(?:^|\s)#\[([0-9]+)\](?=$|\s)/g
It will match if the pattern (#[number]) is not a part of a word. Should contain spaces both sides if its not starting or ending the string.
It uses groups, so if need the digits, use the group 1.
Testing code (click here for demo):
console.log(/(?:^|\s)#\[([0-9]+)\](?=$|\s)/g.test("#[10]")); // true
console.log(/(?:^|\s)#\[([0-9]+)\](?=$|\s)/g.test("#[15]")); // true
console.log(/(?:^|\s)#\[([0-9]+)\](?=$|\s)/g.test("abxdcs#[13]fsfs")); // false
console.log(/(?:^|\s)#\[([0-9]+)\](?=$|\s)/g.test("abxdcs #[13] fsfs")); // true
var r1 = /(?:^|\s)#\[([0-9]+)\](?=$|\s)/g
var match = r1.exec("#[10]");
console.log(match[1]); // 10
var r2 = /(?:^|\s)#\[([0-9]+)\](?=$|\s)/g
var match2 = r2.exec("abxdcs #[13] fsfs");
console.log(match2[1]); // 13
var r3 = /(?:^|\s)#\[([0-9]+)\](?=$|\s)/g
var match3;
while (match3 = r3.exec("#[111] #[222]")) {
console.log(match3[1]);
}
// while's output:
// 111
// 222
You were close, but you need to use square brackets:
#\[[0-9]+\]
Or, a shorter version:
#\[\d+\]
The reason you need those slashes is to "escape" the square bracket. Usually they are used for denoting a "character class".
[0-9] creates a character class which matches exactly one digit in the range of 0 to 9. Adding the + changes the meaning to "one or more". \d is just shorthand for [0-9].
Of course, the backslash character is also used to escape characters inside of a javascript string, which is why you must escape them. So:
javascript
"#\\[\\d+\\]"
turns into:
regex
#\[\d+\]
which is used to match:
# a literal "#" symbol
\[ a literal "[" symbol
\d+ one or more digits (nearly identical to [0-9]+)
\] a literal "]" symbol
I say that \d is nearly identical to [0-9] because, in some regex flavors (including .NET), \d will actually match numeric digits from other cultures in addition to 0-9.
You don't need so many characters inside the character class. More importantly, you put the + in the wrong place. Try this: #\\[([0-9]+)\\].

Categories