Regex to replace word in comma separated list - javascript

I am using javascript .replace() to remove a word from a comma separated list of words and have a solution that works, but it is clumsy and I think there must be a better one.
regex looks like:
/,word1|word1,?/
The handles the cases of leading, embedded and trailing and removes the correct number of commas eg.
word1,word2,word3
word2,word1,word3
word2,word3,word1
All result in word2,word3
word1 on its own is removed completely
Is there a way to do it without that repeated word1 in the regex pattern?

There are no really better or shorter patterns to do that. All you can do is to avoid false positive by adding word boundaries where the comma may be missing:
,word1\b|\bword1\b,?
If you have to deal with a large string, and only if the first letter isn't too frequent, you can try to use the first character discrimination technic to reduce the number of positions where the pattern will be tested:
(?=[,w])(?:,word1\b|\bword1\b,?)
(note that you need to test that in real life to see if it is really benefic.)
Other way without regex
if your target string is only a comma separated list, using a split/join stay the best solution.

Related

how to reduce complexity in regex?

I have a regex which finds all kind of money denoted in dollars,like $290,USD240,$234.45,234.5$,234.6usd
(\$)[0-9]+\.?([0-9]*)|usd+[0-9]+\.?([0-9]*)|[0-9]+\.?[0-9]*usd|[0-9]+\.?[0-9]*(\$)
This seems to works, but how can i avoid the complexity in my regex?
It is possible to make the regex a bit shorter by collapsing the currency indicators:
You can say USD OR $ amount instead of USD amount OR $ amount. This results in the following regex:
((\$|usd)[0-9]+\.?([0-9]*))|([0-9]+\.?[0-9]*(\$|usd))
Im not sure if you'll find this less complex, but at least it's easier to read because it's shorter
The character set [0-9] can also be replaced by \d -- the character class which matches any digit -- making the regex even shorter.
Doing this, the regex will look as follows:
((\$|usd)\d+\.?\d*)|(\d+\.?\d*(\$|usd))
Update:
According to #Toto this regex would be more performant using non-capturing groups (also removed the not-necessary capture group as pointed out by #Simon MᶜKenzie):
(?:\$|usd)\d+\.?\d*|\d+\.?\d*(?:\$|usd)
$.0 like amounts are not matched by the regex as #Gangnus pointed out. I updated the regex to fix this:
((\$|usd)((\d+\.?\d*)|(\.\d+)))|(((\d+\.?\d*)|(\.\d+))(\$|usd))
Note that I changed \d+\.?\d* into ((\d+\.?\d*)|(\.\d+)): It now either matches one or more digits, optionally followed by a dot, followed by zero or more digits; OR a dot followed by one or more digits.
Without unnecessary capturing groups and using non-capturing groups:
(?:\$|usd)(?:\d+\.?\d*|\.\d+)|(?:\d+\.?\d*|\.\d+)(?:\$|usd)
Try this
^(?:\$|usd)?(?:\d+\.?\d*)(?:\$|usd)?$
Reducing the complexity you are reducing the correctness. The following regex works correctly, but even it doesn't take lowcase. (but that could be managed by a key). All other current answers here simply haven't the correct substring for the decimal number.
^\s*(?:(?:(?:-?(?:usd|\$)|(?:usd|\$)-)(?:(?:0|[1-9]\d*)?(?:\.\d+)?(?<=\d)))|(?:-?(?:(?:0|[1-9]\d*)?(?:\.\d+)?(?<=\d))(?:usd|\$)))\s*$
Look here at the test results.
Make a correct line and only after that try to shorten it.

Parsing units with javascript regex

Say I have a string which contains some units (which may or may not have prefixes) that I want to break into the individual units. For example the string may contain "Btu(th)" or "Btu(th).ft" or even "mBtu(th).ft" where mBtu(th) is the bastardised unit milli thermochemical BTU's (this is purely an example).
I currently have the following (simplified) regex however it fails for the case "mBtu(th).ft":
/(m|k)??(Btu\(th\)|ft|m)(?:\b|\s|$)/g
Currently this does not correctly detect the boundary between the end of 'Btu(th)' and the start of 'ft'. I understand javascript regex does not support look back so how do I accurately parse the string?
Additional notes
The regex presented above is greatly simplified around the prefixes and units groups. The prefixes could span multiple characters like 'Ki' and therefore character sets are not suitable.
The desire is for each group to catch the prefix match as group 1 and the unit as match two i.e for 'mBtu(th).ft' match one would be ['m','Btu(th)'] and match two would be ['','ft'].
The prefix match needs to be lazy so that the string 'm' would be matched as the unit metres rather than the prefix milli. Likewise the match for 'mm' would need to be the prefix milli and the unit metres.
I would try with:
/((m)|(k)|(Btu(\(th\))?)|(ft)|(m)|(?:\.))+/g
at least with example above, it matches all units merged into one string.
DEMO
EDIT
Another try (DEMO):
/(?:(m)|(k)|(Btu)|(th)|(ft)|[\.\(\)])/g
this one again match only one part, but if you use $1,$2,$3,$4, etc, (DEMO) you can extract other fragments. It ignores ., (, ), characters. The problem is to count proper matched groups, but it works to some degree.
Or if you accept multiple separate matches I think simple alternative is:
/(m|k|Btu|th|ft)/g
A word boundary will not separate two non-word characters. So, you don't actually want a word boundary since the parentheses and period are not valid word characters. Instead, you want the string to not be followed by a word character, so you can use this instead:
[mk]??(Btu\(th\)|ft|m)(?!\w)
Demo
I believe you're after something like this. If I understood you correctly that want to match any kind of element, possibly preceded by the m or k character and separated by parantheses or dots.
/[\s\.\(]*(m|k?)(\w+)[\s\.\)]*/g
https://regex101.com/r/eQ5nR4/2
If you don't care about being able to match the parentheses but just return the elements you can just do
/(m|k?)(\w+)/g
https://regex101.com/r/oC1eP5/1

Regex to find last space character

I am looking for a regex that will give me the index of the last space in a string using javascript.
I was using goolge to find a suitable regex, but no success.
Even the SO-Question Regex to match last space character does not hold a solution because the goal there was to remove more than one character in the end.
What is the correct regex?
As I commented I would just use lastIndexOf() but here is a regex solution:
The regex / [^ ]*$/ finds the last space character in a string. Use it like this:
// Alerts 9
alert("this is a str".search(/ [^ ]*$/));
The correct solution is not using a regex at all but the built-in lastIndexOf method strings have. Regexes are meant to match strings, not give you indexes (even though grouped matchs may be returned as index+length instead of a string - C-based regex libraries usually do so to avoid unnecessary copying)

JavaScript RegEx to white list chars, how bad is my approach?

I'm using JavaScript RegEx to filter input (white list only acceptable chars). As .match() returns an array, the best way I found to 'glue' back together the string is as follows, which seems ugly, as then I have to remove the comma.
myString.match(/[A-Za-z-_0-9]/g).toString().replace(/,/g,'')
Is there a better RegEx approach in JS, or a better way to handle the array (e.g. like .join in Ruby)?
Thanks
Brian
There is a join in JavaScript as well. For instance:
myString.match(/[A-Za-z-_0-9]/g).join("")
The "" is the separator between each element of the array, so [1, 2, 3].join("") gives "123". However, you could also simply replace all characters not in your whitelist:
myString.replace(/[^A-Za-z-_0-9]/g, "")
Which will simply remove any character that isn't alphanumeric, a dash, or an underscore.

Struggling with regex to match only two of a character, not three

I need to match all occurrences of // in a string in a Javascript regex
It can't match /// or /
So far I have (.*[^\/])\/{2}([^\/].*)
which is basically "something that isn't /, followed by // followed by something that isn't /"
The approach seems to work apart from when the string I want to match starts with //
This doesn't work:
//example
This does
stuff // example
How do I solve this problem?
Edit: A bit more context - I am trying to replace // with !, so I am then using:
result = result.replace(myRegex, "$1 ! $2");
Replace two slashes that either begin the string or do not follow a slash,
and are followed by anything not a slash or the end of the string.
s=s.replace(/(^|[^/])\/{2}([^/]|$)/g,'$1!$2');
It looks like it wouldn't work for example// either.
The problem is because you're matching // preceded and followed by at least one non-slash character. This can be solved by anchoring the regex, and then you can make the preceding/following text optional:
^(.*[^\/])?\/{2}([^\/].*)?$
Use negative lookahead/lookbehind assertions:
(.*)(?<!/)//(?!/)(.*)
Use this:
/([^/]*)(\/{2})([^/]*)/g
e.g.
alert("///exam//ple".replace(/([^/]*)(\/{2})([^/]*)/g, "$1$3"));
EDIT: Updated the expression as per the comment.
/[/]{2}/
e.g:
alert("//example".replace(/[/]{2}/, ""));
This does not answer the OP's question about using regex, but since some of the original comments suggested using .replaceAll, since not everyone who reads the question in the future wants to use regex, since people might mistakenly assume that regex is the only alternative, and since these details cannot be accommodated by submitting a comment, here's a poor man's non-regex approach:
Temporarily replace the three contiguous characters with something that would never naturally occur — really important when dealing with user-entered values.
Replace the remaining two contiguous characters using .replaceAll().
Return the original three contiguous characters.
For instance, let's say you wanted to remove all instances of ".." without affecting occurrences of "...".
var cleansedText = $(this).text().toString()
.replaceAll("...", "☰☸☧")
.replaceAll("..", "")
.replaceAll("☰☸☧", "...")
;
$(this).text(cleansedText);
Perhaps not as fast as regex for longer strings, but works great for short ones.

Categories