Parse all strings between strings of a file in Swift

Parse all strings between strings of a file in Swift - javascript

I want to build a localization application for my javascript (pebble.js) application. It should find all Strings between l._(" and ") or ') or ", or ',.
So for example I have the the line
console.log(l._("This is a Test") + l._('%# times %# equals %#', 2, 4, (2*4)));
With the Swift application I should get an Array like this:
["This is a Test", "%# times %# equals %#"]
Right now I have no clue how I should manage it. Should I use a Regex, NSScanner or should I split the strings?
Thanks!

I'm not familiar with NSScanner, but with regexp the solution would be quite easy, with the assumption, that a string has the same delimiter at both ends. (I.e., you don't have stuff like l._("Hello world').) I think that wouldn't be valid syntax in JavaScript, so let's assume that is the case.
Also, let's assume that the strings don't contain any escaped quotes (of the same kind that is used as delimiter), i.e. there are no such strings: l._("Hello \" world).
Now you could use the following two regexps to find strings delimited by double quotes and those delimited by single quotes:
l._\("((?:[^"])*)" -- for double quotes
l._\('((?:[^'])*)' -- for single quotes
Then, you have to run these two regexps on your input, and get the result of the first capturing group for each match. (I'm not sure how exactly swift uses regexps, but note that in many implementation, capturing group #0 is usually the whole match, and capturing group #1 is what is located between the first pair of parenthesis -- you need the latter.)
Also, note that you don't have to care what comes after the closing quote: whether it's a parenthesis, or a comma, it's the same for you, as you only have to look up to the quote.
EDIT: Corrected the regexp (repetition should be inside capturing group).
EDIT2: If we allow escaped quotes, than you can use this regexp for the single quoted case:
l._\('((?:\\'|[^'])*)'. And a similar one for the double quoted.
You can play around with the regex on this link:
https://regex101.com/r/tY1jB5/1

Related

RegExp lookbehind assertion alternative

Given the string below, how would you split this into an array containing only the double quoted strings (ignoring nested quoted strings) without using a lookbehind assertion?
source string: 1|2|3|"A"|"B|C"|"\"D\"|\"E\""
target array:
[
'"A"',
'"B|C"',
'"\"D\"|\"E\""'
]
Basically, I'm trying to find an alternative to /(?<!\\)".*?(?<!\\)"/g since Firefox currently doesn't support lookbehinds. The solution doesn't have to use regular expressions, but it should be reasonably efficient.

Just find all the quoted text /"[^"\\]*(?:\\[\S\s][^"\\]*)*"/g
Don't need split for this.
https://regex101.com/r/r5SJsR/1
Formatted
"
[^"\\]* # Double quoted text
(?: \\ [\S\s] [^"\\]* )*
"

How about the simple regex /"[^\\"]+"|"\S*"/g.
The first two sets ("A"' and "B|C") are covered by "[^\\"]+" - anything that is not a backslash or a quotation mark wrapped inside a set of quotation marks
A pipe (|) separates the two conditionals
The third set ("\"D\"|\"E\"") is simply covered by "\S*" - anything non-whitespace wrapped inside a set of quotation marks
This returns the same results as your initial regex, has no lookbehinds and can be seen working on Regex101 here.

Match simple regex pattern using JS (key: value)

I have a simple scenario where I want to match the follow and capture the value:
stuff_in_string,
env: 'local', // want to match this and capture the content in quotes
more_stuff_in_string
I have never written a regex pattern before so excuse my attempt, I am well aware it is totally wrong.
This is what I am trying to say:
Match "env:"
Followed by none or more spaces
Followed by a single or double quote
Capture all until..
The next single or double quote
/env:*?\s+('|")+(.*?)+('|")/g
Thanks
PS here is a #failed fiddle: http://jsfiddle.net/DfHge/
Note: this is the regex I ended up using (not the answer below as it was overkill for my needs): /env:\s+(?:"|')(\w+)(?:"|')/

You can use this:
/\benv: (["'])([^"']*)\1/g
where \1 is a backreference to the first capturing group, thus your content is in the second. This is the simple way for simple cases.
Now, other cases like:
env: "abc\"def"
env: "abc\\"
env: "abc\\\def"
env: "abc'def"
You must use a more constraining pattern:
first: avoid the different quotes problem:
/\benv: (["'])((?:[^"']+|(?!\1)["'])*)\1/g
I put all the possible content in a non capturing group that i can repeat at will, and I use a negative lookahead (?!\1) to check if the allowed quote is not the same as the captured quote.
second: the backslash problem:
If a quote is escaped, it can't be the closing quote! Thus you must check if the quote is escaped or not and allow escaped quotes in the string.
I remove the backslashes from allowed content:
/\benv: (["'])((?:[^"'\\]+|(?!\1)["'])*)\1/g
I allow escaped characters:
/\benv: (["'])((?:[^"'\\]+|(?!\1)["']|\\[\s\S])*)\1/g
To allow a variable number of spaces before the quoted part, you can replace : by :\s*
/\benv:\s*(["'])((?:[^"'\\]+|(?!\1)["']|\\[\s\S])*)\1/g
You have now a working pattern.
third: pattern optimization
a simple alternation:
Using a capture group and a backreferences can be seducing to deal with the different type of quotes since it allow to write the pattern in a concise way. However, this way needs to create a capture group and to test a lookahead in this part (?!\1)["']`, so it is not so efficient. Writing a simple alternation increases the pattern length and needs to use two captures groups for the two cases but is more efficient:
/\benv:\s*(?:"((?:[^"\\]+|\\[\s\S])*)"|'((?:[^'\\]+|\\[\s\S])*)')/g
(note: if you decided to do that you must check which one of the two capture groups is defined.)
unrolling the loop:
To match the content inside quotes we use (?:[^"\\]+|\\[\s\S])* (for double quotes here) that works but can be improved to reduce the amount of steps needed. To do that we will unroll the loop that consists to avoid the alternation:
[^"\\]*(?:\\[\s\S][^"\\]*)*
finally the whole pattern can be written like this:
/\benv:\s*(?:"([^"\\]*(?:\\[\s\S][^"\\]*)*)"|'([^'\\]*(?:\\[\s\S][^'\\]*)*)')/g

env *('|").*?\1 is what you're looking for
the * means none or more
('|") matches either a single or double quote, and also saves it into a group for backreferencing
.*? is a reluctant greedy match all
\1 will reference the first group, which was either a single or double quote

regex=/env: ?['"]([^'"])+['"]/
answer=str.match(regex)[1]
even better:
regex=/env: ?(['"])([^\1]*)\1/

How to collate sequence \" using ECMAScript regular expressions?

I'm trying to construct a regular expression to treat delimited speech marks (\") as a single character.
The following code compiles fine, but terminates on trying to initialise rgx, throwing the error Abort trap: 6 using libc++.
std::regex rgx("[[.\\\\\".]]");
std::smatch results;
std::string test_str("\\\"");
std::regex_search(test_str, results, rgx);
If I remove the [[. .]], it runs fine, results[0] returning \" as intended, but as said, I'd like for this sequence to be usable as a character class.
Edit: Ok, I realise now that my previous understanding of collated sequences was incorrect, and the reason it wouldn't work is that \\\\\" is not defined as a sequence. So my new question: is it possible to define collated sequences?

So I figured out where I was going wrong and thought I'd leave this here in case anyone stumbles across it.
You can specify a passive group of characters with (?:sequence), allowing quantifiers to be applied as with a character class. Perhaps not exactly what I'd originally asked, but fulfils the same purpose, in my case at least.
To match a string beginning and ending with double quotation marks (including these characters in the results), but allowing delimited quotation marks within the the string, I used the expression
\"(?:[^\"^\\\\]+|(?:\\\\\\\\)+|\\\\\")*\"
which says to grab the as many characters as possible, provided characters are not quotation marks or backslashes, then if this does not match, to firstly attempt to match an even number of backslashes (to allow delimiting of this character), or secondly a delimited quotation mark. This non-capturing group is matched as many times as possible, stopping only when it reaches a \".
I couldn't comment on the efficiency of this, but it definitely works.

Alternation operator inside square brackets does not work

I'm creating a javascript regex to match queries in a search engine string. I am having a problem with alternation. I have the following regex:
.*baidu.com.*[/?].*wd{1}=
I want to be able to match strings that have the string 'word' or 'qw' in addition to 'wd', but everything I try is unsuccessful. I thought I would be able to do something like the following:
.*baidu.com.*[/?].*[wd|word|qw]{1}=
but it does not seem to work.

replace [wd|word|qw] with (wd|word|qw) or (?:wd|word|qw).
[] denotes character sets, () denotes logical groupings.

Your expression:
.*baidu.com.*[/?].*[wd|word|qw]{1}=
does need a few changes, including [wd|word|qw] to (wd|word|qw) and getting rid of the redundant {1}, like so:
.*baidu.com.*[/?].*(wd|word|qw)=
But you also need to understand that the first part of your expression (.*baidu.com.*[/?].*) will match baidu.com hello what spelling/handle????????? or hbaidu-com/ or even something like lkas----jhdf lkja$##!3hdsfbaidugcomlaksjhdf.[($?lakshf, because the dot (.) matches any character except newlines... to match a literal dot, you have to escape it with a backslash (like \.)
There are several approaches you could take to match things in a URL, but we could help you more if you tell us what you are trying to do or accomplish - perhaps regex is not the best solution or (EDIT) only part of the best solution?

Trouble with word-boundary (\b)

I have an array of keywords, and I want to know whether at least one of the keywords is found within some string that has been submitted. I further want to be absolutely sure that it is the keyword that has been matched, and not something that is very similar to the word.
Say, for example, that our keywords are [English, Eng, En] because we are looking for some variation of English.
Now, say that the input from a user is i h8 eng class, or something equally provocative and illiterate - then the eng should be matched. It should also fail to match a word like england or some odd thing chen, even though it's got the en bit.
So, in my infinite lack of wisdom I believed I could do something along the lines of this in order to match one of my array items with the input:
.match(RegExp('\b('+array.join('|')+')\b','i'))
With the thinking that the regular expression would look for matches from the array, now presented like (English|Eng|En) and then look to see whether there were zero-width word bounds on either side.

You need to double the backslashes.
When you create a regex with the RegExp() constructor, you're passing in a string. JavaScript string constant syntax also treats the backslash as a meta-character, for quoting quotes etc. Thus, the backslashes will be effectively stripped out before the RegExp() code even runs!
By doubling them, the step of parsing the string will leave one backslash behind. Then the RegExp() parser will see the single backslash before the "b" and do the right thing.

You need to double the backslashes in a JavaScript string or you'll encode a Backspace character:
.match(RegExp('\\b('+array.join('|')+')\\b','i'))

You need to double-escape a \b, cause it have special value in strings:
.match(RegExp('\\b('+array.join('|')+')\\b','i'))

\b is an escape sequence inside string literals (see table 2.1 on this page). You should escape it by adding one extra slash:
.match(RegExp('\\b('+array.join('|')+')\\b','i'))
You do not need to escape \b when used inside a regular expression literal:
/\b(english|eng|en)\b/i

We Keep Coding

JavaScript is the programming language of the Web.

Parse all strings between strings of a file in Swift - javascript

Related

RegExp lookbehind assertion alternative

Match simple regex pattern using JS (key: value)

How to collate sequence \" using ECMAScript regular expressions?

Alternation operator inside square brackets does not work

Trouble with word-boundary (\b)

Categories

Resources