I'm trying to construct a regular expression to treat delimited speech marks (\") as a single character.
The following code compiles fine, but terminates on trying to initialise rgx, throwing the error Abort trap: 6 using libc++.
std::regex rgx("[[.\\\\\".]]");
std::smatch results;
std::string test_str("\\\"");
std::regex_search(test_str, results, rgx);
If I remove the [[. .]], it runs fine, results[0] returning \" as intended, but as said, I'd like for this sequence to be usable as a character class.
Edit: Ok, I realise now that my previous understanding of collated sequences was incorrect, and the reason it wouldn't work is that \\\\\" is not defined as a sequence. So my new question: is it possible to define collated sequences?
So I figured out where I was going wrong and thought I'd leave this here in case anyone stumbles across it.
You can specify a passive group of characters with (?:sequence), allowing quantifiers to be applied as with a character class. Perhaps not exactly what I'd originally asked, but fulfils the same purpose, in my case at least.
To match a string beginning and ending with double quotation marks (including these characters in the results), but allowing delimited quotation marks within the the string, I used the expression
\"(?:[^\"^\\\\]+|(?:\\\\\\\\)+|\\\\\")*\"
which says to grab the as many characters as possible, provided characters are not quotation marks or backslashes, then if this does not match, to firstly attempt to match an even number of backslashes (to allow delimiting of this character), or secondly a delimited quotation mark. This non-capturing group is matched as many times as possible, stopping only when it reaches a \".
I couldn't comment on the efficiency of this, but it definitely works.
Related
I want to build a localization application for my javascript (pebble.js) application. It should find all Strings between l._(" and ") or ') or ", or ',.
So for example I have the the line
console.log(l._("This is a Test") + l._('%# times %# equals %#', 2, 4, (2*4)));
With the Swift application I should get an Array like this:
["This is a Test", "%# times %# equals %#"]
Right now I have no clue how I should manage it. Should I use a Regex, NSScanner or should I split the strings?
Thanks!
I'm not familiar with NSScanner, but with regexp the solution would be quite easy, with the assumption, that a string has the same delimiter at both ends. (I.e., you don't have stuff like l._("Hello world').) I think that wouldn't be valid syntax in JavaScript, so let's assume that is the case.
Also, let's assume that the strings don't contain any escaped quotes (of the same kind that is used as delimiter), i.e. there are no such strings: l._("Hello \" world).
Now you could use the following two regexps to find strings delimited by double quotes and those delimited by single quotes:
l._\("((?:[^"])*)" -- for double quotes
l._\('((?:[^'])*)' -- for single quotes
Then, you have to run these two regexps on your input, and get the result of the first capturing group for each match. (I'm not sure how exactly swift uses regexps, but note that in many implementation, capturing group #0 is usually the whole match, and capturing group #1 is what is located between the first pair of parenthesis -- you need the latter.)
Also, note that you don't have to care what comes after the closing quote: whether it's a parenthesis, or a comma, it's the same for you, as you only have to look up to the quote.
EDIT: Corrected the regexp (repetition should be inside capturing group).
EDIT2: If we allow escaped quotes, than you can use this regexp for the single quoted case:
l._\('((?:\\'|[^'])*)'. And a similar one for the double quoted.
You can play around with the regex on this link:
https://regex101.com/r/tY1jB5/1
I've seen the various posts regarding this topic, but I'm getting a strange result when I do the following:
var dirtyString = '<>I\really|\re\ad?"the/wh\ole*:da|\y?.'
var cleanString = dirtyString.replace(/[\/:*?"<>|.]/g, "");
console.log(cleanString);
It removes all the illegal characters, but the "r" letters are also removed. In the console log I'm getting "Ieallyeadthewholeday" It seems that "\" before "r" erases the "r". "\" isn't erasing other letters it comes before. Am I missing something?
If you would try console.log(dirtyString) you would also see that your "r" are "missing" too.
This is because '\r' is actually an escape sequence for Carriage Return character (code 13). Your replace() does nothing to it. It is still there just isn't displayed. Try playing with String.charAt() and String.charCodeAt() and you will see that the character is still there.
As a side note you are trying to remove "blacklisted" characters and blacklisting is almost never right approach. As you can see in your own case you forgot to blacklist '\r' character (and many others). Much safer is whitelisting. For example you may decide that you accept only latin letters and digits, then remove everything not whitelisted: var cleanString = dirtyString.replace(/[^a-z0-9]/gi, "");.
\r is the Carriage Return character. If you want a backslash followed by an r then you need to escape the backslash: \\r.
\y is not a reserved escape sequence, so JavaScript interprets it as \ followed by y. Other programming languages, like C#, will instead raise a compiler error about an unrecognised escape sequence.
Further confounding things: most regular-expression syntaxes have their own backslash escape sequences that are distinct from the hosting language's, such as the character-classes \W, \d etc. Fortunately they work because \W and \d are not reserved in JavaScript, but in this author's opinion it makes sense to escape the backslashes then just to make things really clear to the reader, or if you're wanting to make your regexes portable between languages.
Tried to search for /\,$/ online, but coudnt find anything.
I have:
coords = coords.replace(/\,$/, "");
Im guessing it returns coords string index number. What I have to search online for this, so I can learn more?
/\,$/ finds the comma character (,) at the end of a string (denoted by the $) and replaces it with empty (""). You sometimes see this in regex code aiming to clean up excerpts of text.
It's a regular expression to remove a trailing comma.
That thing is a Regular Expression, also known as regex or regexp. It is a way to "match" strings using some rules. If you want to learn how to use it in JavaScript, read the Mozilla Developer Network page about RegExp.
By the way, regular expressions are also available on most languages and in some tools. It is a very useful thing to learn.
That's a regular expression that finds a comma at the end of a string. That code removes the comma.
// defines a JavaScript regular expression, used to match a pattern within a string.
\,$ is the pattern
In this case \, translates to ,. A backslash is used to escape special characters, but in this case, it's not necessary. An example where it would be necessary would be to remove trailing periods. If you tried to do that with /.$/ the period here has a different meaning; it is used as a wildcard to match [almost] any character (aside for some newlines). So in this case to match on "." (period character) you would have to escape the wildcard (/\.$/).
When $ is placed at the end of the pattern, it means only look at the end of the string. This means that you can't mistakingly find a comma anywhere in the middle of the string (e.g., not after help in help, me,), only at the end (trailing). It also speeds of the regular expression search considerably. If you wanted to match on characters only at the beginning of the string, you would start off the pattern with a carat (^), for instance /^,/ would find a comma at the start of a string if one existed.
It's also important to note that you're only removing one comma, whereas if you use the plus (+) after the comma, you'd be replacing one or more: /,+$/.
Without the +; trailing commas,, becomes trailing commas,
With the +; no trailing comma,, becomes no trailing comma
The title might seem a bit recursive, and indeed it is.
I am working on a Javascript which can highlight/color Javascript code displayed in HTML. Thus, in the Internet Browser, comments will be turned green, definitions (for, if, while, etc.) will be turned a dark blue and italic, numbers will be red, and so on for other elements. However, the coloring is not all that important.
I am trying to figure out two different regular expressions which have started to cause a minor headache.
1. Finding a regular expression using a regular expression
I want to find regular expressions within the script-tags of HTML using a Javascript, such as:
match(/findthis/i);
, where the regex part of course is "/findthis/i".
The rules are as follows:
Finding multiple occurrences (/g) is not important.
It must be on the same line (not /m).
Caseinsensitive (/i).
If a backward slash (ignore character) is followed directly by a forward slash, "/", the forward slash is part of the expression - not an escape character. E.g.: /itdoesntstop\/untilnow:/
Two forward slashes right next to each other (//) is: (A) At the beginning: Not a regex; it's a comment. (B) Later on: First slash is the end of the regex and the second slash is nothing but a character.
Regex continues until the line breaks or end of input (\n|$), or the escape character (second forward slash which complies with rule 4) is encountered. However, also as long as only alphabetic characters are encountered, following the second forward slash, they are considered part of the regex. E.g.: /aregex/allthisispartoftheregex
So far what I've got is this:
'\\/(?:[^\\/\\\\]|\\/\\*)*\\/([a-zA-Z]*)?'
However, it isn't consistent. Any suggestions?
2. Find digits (alphanumeric, floating) using a regular expression
Finding digits on their own is simple. However, finding floating numbers (with multiple periods) and letters including underscore is more of a challenge.
All of the below are considered numbers (a new number starts after each space):
3 3.1 3.1.4 3a 3.A 3.a1 3_.1
The rules:
Finding multiple occurrences (/g) is not important.
It must be on the same line (not /m).
Caseinsensitive (/i).
A number must begin with a digit. However, the number can be preceeded or followed by a non-word (\W) character. E.g.: "=9.9;" where "9.9" is the actual number. "a9" is not a number. A period before the number, ".9", is not considered part of the number and thus the actual number is "9".
Allowed characters: [a-zA-Z0-9_.]
What I've got:
'(^|\\W)\\d([a-zA-Z0-9_.]*?)(?=([^a-zA-Z0-9_.]|$))'
It doesn't work quite the way I want it.
For the first part, I think you are quite close. Here is what I would use (as a regex literal, to avoid all the double escapes):
/\/(?:[^\/\\\n\r]|\\.)+\/([a-z]*)/i
I don't know what you intended with your second alternative after the character class. But here the second alternative is used to consume backslashes and anything that follows them. The last part is important, so that you can recognize the regex ending in something like this: /backslash\\/. And the ? at the end of your regex was redundant. Otherwise this should be fine.
Test it here.
Your second regex is just fine for your specification. There are a few redundant elements though. The main thing you might want to do is capture everything but the possible first character:
/(?:^|\W)(\d[\w.]*)/i
Now the actual number (without the first character) will be in capturing group 1. Note that I removed the ungreediness and the lookahead, because greediness alone does exactly the same.
Test it here.
I have an array of keywords, and I want to know whether at least one of the keywords is found within some string that has been submitted. I further want to be absolutely sure that it is the keyword that has been matched, and not something that is very similar to the word.
Say, for example, that our keywords are [English, Eng, En] because we are looking for some variation of English.
Now, say that the input from a user is i h8 eng class, or something equally provocative and illiterate - then the eng should be matched. It should also fail to match a word like england or some odd thing chen, even though it's got the en bit.
So, in my infinite lack of wisdom I believed I could do something along the lines of this in order to match one of my array items with the input:
.match(RegExp('\b('+array.join('|')+')\b','i'))
With the thinking that the regular expression would look for matches from the array, now presented like (English|Eng|En) and then look to see whether there were zero-width word bounds on either side.
You need to double the backslashes.
When you create a regex with the RegExp() constructor, you're passing in a string. JavaScript string constant syntax also treats the backslash as a meta-character, for quoting quotes etc. Thus, the backslashes will be effectively stripped out before the RegExp() code even runs!
By doubling them, the step of parsing the string will leave one backslash behind. Then the RegExp() parser will see the single backslash before the "b" and do the right thing.
You need to double the backslashes in a JavaScript string or you'll encode a Backspace character:
.match(RegExp('\\b('+array.join('|')+')\\b','i'))
You need to double-escape a \b, cause it have special value in strings:
.match(RegExp('\\b('+array.join('|')+')\\b','i'))
\b is an escape sequence inside string literals (see table 2.1 on this page). You should escape it by adding one extra slash:
.match(RegExp('\\b('+array.join('|')+')\\b','i'))
You do not need to escape \b when used inside a regular expression literal:
/\b(english|eng|en)\b/i