I have a string which will be used to create a filename. The original string pattern may include a dash. Recently, the pattern has changed and I need to handle the regular expression to remove the dashes near the end or middle of the string but not those at the beginning of the string.
Regular Expression Pattern Rules/Requirements:
Replace all special characters with an underscore with some exceptions
Remove dashes not located at the beginning of the string
The dashes which need to be kept are typically between numeric values [0-9] and can appear any number of times in the string (i.e. "23-564-8 Testing - The - String" -> "23-564-8_testing_the_string")
The dashes which should be converted to underscores are typically between [a-zA-Z] characters (i.e. "Testing - The - String" -> "testing_the_string")
Examples of Potential Strings:
23-564-8 Testing the String -> Expected Output: 23-564-8_testing_the_string
Testing - The String -> Expected Output: testing_the_string
23-564-8 Testing - The - String -> Expected Output: 23-564-8_testing_the_string
Opinion: Personally, I'm not a fan of including dashes in filename but it is a requirement
Current Regexp Solution:
var str = "23-564-8 Testing the String";
str.replace(/[^a-zA-Z0-9-]/g, '_').replace(/__/g, '_');
Question: What is the best way to handle this case? My current solution leaves all dashes in the string.
You may use this regex with a negative lookahead:
/[^a-zA-Z0-9-]+|-(?!\d)/g
RegEx Details:
[^a-zA-Z0-9-]+: Match 1 or more of any character that is not hyphen or alphanumeric
|: OR
-(?!\d): Match hyphen if it is NOT immediately followed by a digit
Code:
const arr = [
'23-564-8 Testing the String',
'Testing - The String',
'-23-564-8 Testing - The - String'
]
const re = /[^a-zA-Z0-9-]+|-(?!\d)/g
var result = []
arr.forEach(el => {
result.push( el.replace(re, '_').replace(/_{2,}/g, '_') )
})
console.log( result )
The following Regex pattern can be used with a replacement string $1_ (see demo):
(\d+(?:-+\d+)+)?[\W\-_]+
The pattern consists of two parts:
(\d+(?:-+\d+)+)? captures numbers with allowed dashes into the Group1
[\W\-_]+ captures special characters to be replaced
The Group1 is required to prevent allowed dashes from being replaced. The $1 token in the replacement string ensures that this content of Group1 will be kept in the result.
This Regex pattern also handles the scenario of duplicate _ characters, so .replace(/__/g, '_') is no longer required. The code can be transformed to:
var str = "23-564-8 Testing the String";
var res = str.replace(/(\d+(?:-+\d+)+)?[\W\-_]+/g, "$1_");
console.log(res);
I have string [FBWS-1] comes first than [FBWS-2]
In this string, I want to find all occurance of [FBWS-NUMBER]
I tried this :
var term = "[FBWS-1] comes first than [FBWS-2]";
alert(/^([[A-Z]-[0-9]])$/.test(term));
I want to get all the NUMBERS where [FBWS-NUMBER] string is matched.
But no success. I m new to regular expressions.
Can anyone help me please.
Note that ^([[A-Z]-[0-9]])$ matches start of a string (^), a [ or an uppercase ASCII letter (with [[A-Z]), -, an ASCII digit and a ] char at the end of the string. So,basically, strings like [-2] or Z-3].
You may use
/\[[A-Z]+-[0-9]+]/g
See the regex demo.
NOTE If you need to "hardcode" FBWS (to only match values like FBWS-123 and not ABC-3456), use it instead of [A-Z]+ in the pattern, /\[FBWS-[0-9]+]/g.
Details
\[ - a [ char
[A-Z]+ - one or more (due to + quantifier) uppercase ASCII letters
- - a hyphen
[0-9]+ - one or more (due to + quantifier) ASCII digits
] - a ] char.
The /g modifier used with String#match() returns all found matches.
JS demo:
var term = "[FBWS-1] comes first than [FBWS-2]";
console.log(term.match(/\[[A-Z]+-[0-9]+]/g));
You can use:
[\w+-\d]
var term = "[FBWS-1] comes first than [FBWS-2]";
alert(/[\w+-\d]/.test(term));
There are several reasons why your existing regex doesn't work.
You trying to match the beginning and ending of your string when you
actually want everything in between, don't use ^$
Your only trying to match one alpha character [A-Z] you need to make this greedy using the +
You can shorten [A-Z] and [0-9] by using the shorthands \w and \d. The brackets are generally unnecessary.
Note your code only returns a true false value (your using test) ATM it's unclear if this is what you want. You may want to use match with a global modifier (//g) instead of test to get a collection.
Here is an example using string.match(reg) to get all matches strings:
var term = "[FBWS-1] comes first than [FBWS-2]";
var reg1 = /\[[A-Z]+-[0-9]\]/g;
var reg2 = /\[FBWS-[0-9]\]/g;
var arr1 = term.match(reg1);
var arr2 = term.match(reg2)
console.log(arr1);
console.log(arr2);
Your regular expression /^([[A-Z]-[0-9]])$/ is wrong.
Give this regex a try, /\[FBWS-\d\]/g
remove the g if you only want to find 1 match, as g will find all similar matches
Edit: Someone mentioned that you want ["any combination"-"number"], hence if that's what you're looking for then this should work /\[[A-Z]+-\d\]/
I am building search and I am going to use javascript autocomplete with it. I am from Finland (finnish language) so I have to deal with some special characters like ä, ö and å
When user types text in to the search input field I try to match the text to data.
Here is simple example that is not working correctly if user types for example "ää". Same thing with "äl"
var title = "this is simple string with finnish word tämä on ääkköstesti älkää ihmetelkö";
// Does not work
var searchterm = "äl";
// does not work
//var searchterm = "ää";
// Works
//var searchterm = "wi";
if ( new RegExp("\\b"+searchterm, "gi").test(title) ) {
$("#result").html("Match: ("+searchterm+"): "+title);
} else {
$("#result").html("nothing found with term: "+searchterm);
}
http://jsfiddle.net/7TsxB/
So how can I get those ä,ö and å characters to work with javascript regex?
I think I should use unicode codes but how should I do that? Codes for those characters are:
[\u00C4,\u00E4,\u00C5,\u00E5,\u00D6,\u00F6]
=> äÄåÅöÖ
There appears to be a problem with Regex and the word boundary \b matching the beginning of a string with a starting character out of the normal 256 byte range.
Instead of using \b, try using (?:^|\\s)
var title = "this is simple string with finnish word tämä on ääkköstesti älkää ihmetelkö";
// Does not work
var searchterm = "äl";
// does not work
//var searchterm = "ää";
// Works
//var searchterm = "wi";
if ( new RegExp("(?:^|\\s)"+searchterm, "gi").test(title) ) {
$("#result").html("Match: ("+searchterm+"): "+title);
} else {
$("#result").html("nothing found with term: "+searchterm);
}
Breakdown:
(?: parenthesis () form a capture group in Regex. Parenthesis started with a question mark and colon ?: form a non-capturing group. They just group the terms together
^ the caret symbol matches the beginning of a string
| the bar is the "or" operator.
\s matches whitespace (appears as \\s in the string because we have to escape the backslash)
) closes the group
So instead of using \b, which matches word boundaries and doesn't work for unicode characters, we use a non-capturing group which matches the beginning of a string OR whitespace.
The \b character class in JavaScript RegEx is really only useful with simple ASCII encoding. \b is a shortcut code for the boundary between \w and \W sets or \w and the beginning or end of the string. These character sets only take into account ASCII "word" characters, where \w is equal to [a-zA-Z0-9_] and \W is the negation of that class.
This makes the RegEx character classes largely useless for dealing with any real language.
\s should work for what you want to do, provided that search terms are only delimited by whitespace.
this question is old, but I think I found a better solution for boundary in regular expressions with unicode letters.
Using XRegExp library you can implement a valid \b boundary expanding this
XRegExp('(?=^|$|[^\\p{L}])')
the result is a 4000+ char long, but it seems to work quite performing.
Some explanation: (?= ) is a zero-length lookahead that looks for a begin or end boundary or a non-letter unicode character. The most important think is the lookahead, because the \b doesn't capture anything: it is simply true or false.
\b is a shortcut for the transition between a letter and a non-letter character, or vice-versa.
Updating and improving on max_masseti's answer:
With the introduction of the /u modifier for RegExs in ES2018, you can now use \p{L} to represent any unicode letter, and \P{L} (notice the uppercase P) to represent anything but.
EDIT: Previous version was incomplete.
As such:
const text = 'A Fé, o Império, e as terras viciosas';
text.split(/(?<=\p{L})(?=\P{L})|(?<=\P{L})(?=\p{L})/);
// ['A', ' Fé', ',', ' o', ' Império', ',', ' e', ' as', ' terras', ' viciosas']
We're using a lookbehind (?<=...) to find a letter and a lookahead (?=...) to find a non-letter, or vice versa.
I would recommend you to use XRegExp when you have to work with a specific set of characters from Unicode, the author of this library mapped all kind of regional sets of characters making the work with different languages easier.
Despite the fact the issue seems to be 8 years old, I run into a similar problem (I had to match Cyrillic letters) not so far ago. I spend a whole day on this and could not find any appropriate answer here on StackOverflow. So, to avoid others making lots of effort, I'd like to share my solution.
Yes, \b word boundary works only with Latin letters (Word boundary: \b):
Word boundary \b doesn’t work for non-Latin alphabets
The word boundary test \b checks that there should be \w on the one side from the position and "not \w" – on the other side.
But \w means a Latin letter a-z (or a digit or an underscore), so the test doesn’t work for other characters, e.g. Cyrillic letters or hieroglyphs.
Yes, JavaScript RegExp implementation hardly supports UTF-8 encoding.
So, I tried implementing own word boundary feature with the support of non-Latin characters. To make word boundary work just with Cyrillic characters I created such regular expression:
new RegExp(`(?<![\u0400-\u04ff])${cyrillicSearchValue}(?![\u0400-\u04ff])`,'gi')
Where \u0400-\u04ff is a range of Cyrillic characters provided in the table of codes. It is not an ideal solution, however, it works properly in most cases.
To make it work in your case, you just have to pick up an appropriate range of codes from the list of Unicode characters.
To try out my example run the code snippet below.
function getMatchExpression(cyrillicSearchValue) {
return new RegExp(
`(?<![\u0400-\u04ff])${cyrillicSearchValue}(?![\u0400-\u04ff])`,
'gi',
);
}
const sentence = 'Будь-який текст кирилицею, де необхідно знайти слово з контексту';
console.log(sentence.match(getMatchExpression('текст')));
// expected output: ["текст"]
console.log(sentence.match(getMatchExpression('но')));
// expected output: null
I noticed something really weird with \b when using Unicode:
/\bo/.test("pop"); // false (obviously)
/\bä/.test("päp"); // true (what..?)
/\Bo/.test("pop"); // true
/\Bä/.test("päp"); // false (what..?)
It appears that meaning of \b and \B are reversed, but only when used with non-ASCII Unicode? There might be something deeper going on here, but I'm not sure what it is.
In any case, it seems that the word boundary is the issue, not the Unicode characters themselves. Perhaps you should just replace \b with (^|[\s\\/-_&]), as that seems to work correctly. (Make your list of symbols more comprehensive than mine, though.)
My idea is to search with codes representing the Finnish letters
new RegExp("\\b"+asciiOnly(searchterm), "gi").test(asciiOnly(title))
My original idea was to use plain encodeURI but the % sign seemed to interfere with the regexp.
http://jsfiddle.net/7TsxB/5/
I wrote a crude function using encodeURI to encode every character with code over 128 but removing its % and adding 'QQ' in the beginning. It is not the best marker but I couldn't get non alphanumeric to work.
What you are looking for is the Unicode word boundaries standard:
http://unicode.org/reports/tr29/tr29-9.html#Word_Boundaries
There is a JavaScript implementation here (unciodejs.wordbreak.js)
https://github.com/wikimedia/unicodejs
I had a similar problem, where I was trying to replace all of a particular unicode word with a different unicode word, and I cannot use lookbehind because it's not supported in the JS engine this code will be used in. I ultimately resolved it like this:
const needle = "КАРТОПЛЯ";
const replace = "БАРАБОЛЯ";
const regex = new RegExp(
String.raw`(^|[^\n\p{L}])`
+ needle
+ String.raw`(?=$|\P{L})`,
"gimu",
);
const result = (
'КАРТОПЛЯ сдффКАРТОПЛЯдадф КАРТОПЛЯ КАРТОПЛЯ КАРТОПЛЯ??? !!!КАРТОПЛЯ ;!;!КАРТОПЛЯ/#?#?'
+ '\n\nКАРТОПЛЯ КАРТОПЛЯ - - -КАРТОПЛЯ--'
)
.replace(regex, function (match, ...args) {
return args[0] + replace;
});
console.log(result)
output:
БАРАБОЛЯ сдффКАРТОПЛЯдадф БАРАБОЛЯ БАРАБОЛЯ БАРАБОЛЯ??? !!!БАРАБОЛЯ ;!;!БАРАБОЛЯ/#?#?
БАРАБОЛЯ БАРАБОЛЯ - - -БАРАБОЛЯ--
Breaking it apart
The first regex: (^|[^\n\p{L}])
^| = Start of the line or
[^\n\p{L}] = Any character which is not a letter or a newline
The second regex: (?=$|\P{L})
?= = Lookahead
$| = End of the line or
\P{L} = Any character which is not a letter
The first regex captures the group and is then used via args[0] to put it back into the string during replacement, thereby avoiding a lookbehind. The second regex utilized lookahead.
Note that the second one MUST be a lookahead because if we capture it then overlapping regex matches will not trigger (e.g. КАРТОПЛЯ КАРТОПЛЯ КАРТОПЛЯ would only match on the 1st and 3rd ones).
Trying to find text "myTest":
/(?<![\p{L}\p{N}_])myTest(?![\p{L}\p{N}_])/gu
Similar to NetBeans or Notepad++ form. Trying to find the expression without any letter or number or underscore (like \w characters of word boundary \b) in any unicode characters of letter and number before or after the expression.
I have had a similar problem, but I had to replace an array of terms. All solutions, which I have found did not worked, if two terms were in the text next to each other (because their boundaries overlaped). So I had to use a little modified approach:
var text = "Ještě. že; \"už\" à. Fürs, 'anlässlich' že že že.";
var terms = ["à","anlässlich","Fürs","už","Ještě", "že"];
var replaced = [];
var order = 0;
for (i = 0; i < terms.length; i++) {
terms[i] = "(^\|[ \n\r\t.,;'\"\+!?-])(" + terms[i] + ")([ \n\r\t.,;'\"\+!?-]+\|$)";
}
var re = new RegExp(terms.join("|"), "");
while (true) {
var replacedString = "";
text = text.replace(re, function replacer(match){
var beginning = match.match("^[ \n\r\t.,;'\"\+!?-]+");
if (beginning == null) beginning = "";
var ending = match.match("[ \n\r\t.,;'\"\+!?-]+$");
if (ending == null) ending = "";
replacedString = match.replace(beginning,"");
replacedString = replacedString.replace(ending,"");
replaced.push(replacedString);
return beginning+"{{"+order+"}}"+ending;
});
if (replacedString == "") break;
order += 1;
}
See the code in a fiddle: http://jsfiddle.net/antoninslejska/bvbLpdos/1/
The regular expression is inspired by: http://breakthebit.org/post/3446894238/word-boundaries-in-javascripts-regular
I can't say, that I find the solution elegant...
The correct answer to the question is given by andrefs.
I will only rewrite it more clearly, after putting all required things together.
For ASCII text, you can use \b for matching a word boundary both at the start and the end of a pattern. When using Unicode text, you need to use 2 different patterns for doing the same:
Use (?<=^|\P{L}) for matching the start or a word boundary before the main pattern.
Use (?=\P{L}|$) for matching the end or a word boundary after the main pattern.
Additionally, use (?i) in the beginning of everything, to make all those matchings case-insensitive.
So the resulting answer is: (?i)(?<=^|\P{L})xxx(?=\P{L}|$), where xxx is your main pattern. This would be the equivalent of (?i)\bxxx\b for ASCII text.
For your code to work, you now need to do the following:
Assign to your variable "searchterm", the pattern or words you want to find.
Escape the variable's contents. For example, replace '\' with '\\' and also do the same for any reserved special character of regex, like '\^', '\$', '\/', etc. Check here for a question on how to do this.
Insert the variable's contents to the pattern above, in the place of "xxx", by simply using the string.replace() method.
bad but working:
var text = " аб аб АБ абвг ";
var ttt = "(аб)"
var p = "(^|$|[^A-Za-zА-Я-а-я0-9()])"; // add other word boundary symbols here
var exp = new RegExp(p+ttt+p,"gi");
text = text.replace(exp, "$1($2)$3").replace(exp, "$1($2)$3");
const t1 = performance.now();
console.log(text);
result (without qutes):
" (аб) (аб) (АБ) абвг "
I struggled hard on this. Working with French accented characters, and I managed to find this solution :
const myString = "MyString";
const regex = new RegExp(
"(?:[^À-ú]|^)\\b(" + myString + ")\\b(?:[^À-ú]|$)",
"ig"
);
What id does :
It keeps checking word-boundaries with \b before and after "MyString".
In addition to that, (?:[^À-ú]|^) and (?:[^À-ú]|$) will check if MyString is not surrounded by any accented characters
It will not work with cyrillic but it may be possible to find the range of cirillic charactes and edit [^À-ú] in consequence.
Warning, it captures only the group (MyString) but the total match contains previous and next characters
See example : https://regex101.com/r/5P0ZIe/1
Match examples :
MyString
match : "MyString"
group 1 : "MyString"
Lorem ipsum. MyString dolor sit amet
match : " MyString "
group 1 : "MyString"
(MyString)
match : "(MyString)"
group 1 : "MyString"
BetweenCharactersMyStringIsNotFound
match : Nothing
group 1 : Nothing
éMyStringé
match : Nothing
group 1 : Nothing
ùMyString
match : Nothing
group 1 : Nothing
MyStringÖ
match : Nothing
group 1 : Nothing
I am stuck with creating regex such that if the word is preceded or ended by special character more than one regex on each side regex 'exec' method should throw null. Only if word is wrap with exactly one bracket on each side 'exec' method should give result Below is the regular expression I have come up with.
If the string is like "(test)" or then only regex.exec should have values for other combination such as "((test))" OR "((test)" OR "(test))" it should be null. Below code is not throwing null which it should. Please suggest.
var w1 = "\(test\)";
alert(new RegExp('(^|[' + '\(\)' + '])(' + w1 + ')(?=[' + '\(\)' + ']|$)', 'g').exec("this is ((test))"))
If you have a list of words and want to filter them, you can do the following.
string.split(' ').filter(function(word) {
return !(/^[!##$%^&*()]{2,}.+/).test(word) || !(/[!##$%^&*()]{2,}$).test(word)
});
The split() function splits a string at a space character and returns an array of words, which we can then filter.
To keep the valid words, we will test two regex expressions to see if the word starts or ends with 2 or more special characters respectively.
RegEx Breakdown
^ - Expression starts with the following
[] - A single character in the block
!##$%^&*() - These are the special characters I used. Replace them with the ones you want.
{2,} - Matches 2 or more of the preceeding characters
.+ - Matches 1 or more of any character
$ - Expression ends with the following
To use the exec function this way do this
!(/^[!##$%^&*()]{2,}.+/).exec(string) || !(/[!##$%^&*()]{2,}$).exec(string)
If I understand correctly, you are looking for any string which contains (test), anywhere in it, and exactly that, right?
In that case, what you probably need is the following:
var regExp = /.*[^)]\(test\)[^)].*/;
alert(regExp.exec("this is ((test))")); // → null
alert(regExp.exec("this is (test))" )); // → null
alert(regExp.exec("this is ((test)" )); // → null
alert(regExp.exec("this is (test) ...")); // → ["this is (test) ..."]
Explanation:
.* matches any character (except newline) between zero and unlimited times, as many times as possible.
[^)] match a single character but not the literal character )
This makes sure there's your test string in the given string, but it is only ever wrapped with one brace in every side!
You can use the following regex:
(^|[^(])(\(test\))(?!\))
See regex demo here, replace with $1<span style="new">$2</span>.
The regex features an alternation group (^|[^(]) that matches either start of string ^ or any character other than (. This alternation is a kind of a workaround since JS regex engine does not support look-behinds.
Then, (\(test\)) matches and captures (test). Note the round brackets are escaped. If they were not, they would be treated as a capturing group delimiters.
The (?!\)) is a look-ahead that makes sure there is no literal ) right after test). Look-aheads are supported fully by JS regex engine.
A JS snippet:
var re = /(^|[^(])(\(test\))(?!\))/gi;
var str = 'this is (test)\nthis is ((test))\nthis is ((test)\nthis is (test))\nthis is ((test\nthis is test))';
var subst = '$1<span style="new">$2</span>';
var result = str.replace(re, subst);
alert(result);
Here is a string str = '.js("aaa").js("bbb").js("ccc")', I want to write a regular expression to return an Array like this:
[aaa, bbb, ccc];
My regular expression is:
var jsReg = /.js\(['"](.*)['"]\)/g;
var jsAssets = [];
var js;
while ((js = jsReg.exec(find)) !== null) {
jsAssets.push(js[1]);
}
But the jsAssets result is
[""aaa").js("bbb").js("ccc""]
What's wrong with this regular expression?
Use the lazy version of .*:
/\.js\(['"](.*?)['"]\)/g
^
And it would be better if you escape the first dot.
This will match the least number of characters until the next quote.
jsfiddle demo
If you want to allow escaped quotes, use something like this:
/\.js\(['"]((?:\\['"]|[^"])+)['"]\)/g
regex101 demo
I believe it can be done in one-liner with replace and match method calls:
var str = '.js("aaa").js("bbb").js("ccc")';
str.replace(/[^(]*\("([^"]*)"\)[^(]*/g, '$1,').match(/[^,]+/g);
//=> ["aaa", "bbb", "ccc"]
The problem is that you are using .*. That will match any character. You'll have to be a bit more specific with what you are trying to capture.
If it will only ever be word characters you could use \w which matches any word character. This includes [a-zA-Z0-9_]: uppercase, lowercase, numbers and an underscore.
So your regex would look something like this :
var jsReg = /js\(['"](\w*)['"]\)/g;
In
/.js\(['"](.*)['"]\)/g
matches as much as possible, and does not capture group 1, so it matches
"aaa").js("bbb").js("ccc"
but given your example input.
Try
/\.js\(('(?:[^\\']|\\.)*'|"(?:[\\"]|\\.)*"))\)/
To break this down,
\. matches a literal dot
\.js\( matches the literal string ".js("
( starts to capture the string.
[^\\']|\\. matches a character other than quote or backslash or an escaped non-line terminator.
(?:[\\']|\\.)* matches the body of a string
'(?:[\\']|\\.)*' matches a single quoted string
(...|...) captures a single quoted or double quoted string
)\) closes the capturing group and matches a literal close parenthesis
The second major problem is your loop.
You're doing a global match repeatedly which makes no sense.
Get rid of the g modifier, and then things should work better.
Try this one - http://jsfiddle.net/UDYAq/
var str = new String('.js("aaa").js("bbb").js("ccc")');
var regex = /\.js\(\"(.*?)\"\){1,}/gi;
var result = [];
result = str.match (regex);
for (i in result) {
result[i] = result[i].match(/\"(.*?)\"/i)[1];
}
console.log (result);
To be sure that matched characters are surrounded by the same quotes:
/\.js\((['"])(.*?)\1\)/g