How is RegEx handled differently in VBA and JavaScript? - javascript

I'm using a regular expression in Excel VBA to parse the results of a swim meet. The code reads a row of text that was copied from a PDF and outputs the important data into individual cells. Since the format of the string varies throughout the source PDF, the regular expression is quite complicated. Still, I'm able to parse 95% of the data at this point.
Some of the rows that are not being parsed are confusing me, though. VBA is clearly not able to find a match with the regular expression, but when I copy the exact same regex and string into this website, JavaScript is able to find a match without a problem. Is there something different in the way VBA and JavaScript handle regular expressions that might account for this?
Here's the string that VBA refuses to match:
12. NUNEZ CHENG, Walter 74 Club Tennis Las Terr 3:44.57 123
Here's the function I'm using in Excel (mostly successfully):
Function singleLineResults(SourceString As String) As Variant
Dim cSubmatches As Variant
Dim collectionArray(11) As String
Dim cnt As Integer
Dim oMatches As MatchCollection
With New RegExp
.MultiLine = MultiLine
.IgnoreCase = IgnoreCase
.Global = False
'1. JAROSOVA, Lenka 27 Swimmpower Prague 2:26.65 605 34.45 37.70 37.79 36.71
.Pattern = "(\d*)\.?\s?([^,]+),\s([^\d]+)\s?(\d+)\s((?:[A-Z]{3})?)\s?((?:(?!\d\:\d).)*)\s?((?:\d+:)?\d+\.\d+)(?:\s(\d+))?(?:\s((?:\d+:)?\d+.\d+))?(?:\s((?:\d+:)?\d+.\d+))?(?:\s((?:\d+:)?\d+.\d+))?(?:\s((?:\d+:)?\d+.\d+))?(?:Splash Meet Manager 11, Build \d{5} Registered to [\w\s]+ 2014-08-\d+ \d+:\d+ - Page \d+)?$"
Set oMatches = .Execute(SourceString)
If oMatches.Count > 0 Then
For Each submatch In oMatches(0).SubMatches
collectionArray(cnt) = submatch '.Value
cnt = cnt + 1
Next
Else
singleLineResults = Null
End If
End With
singleLineResults = collectionArray()
End Function

Could you add more examples to what actually matches? E.g. the surrounding lines that matches, and better yet, examples that are not supposed to match if any?
I've tried "cleaning" up a bit in the regex, removing groups that are not used to match that particular line, to make the error more obvious, and changed how one of the groups works, which might actually fix the issue:
(\d*)
\.?\s?
([^,]+)
,\s
([^\d]+)
\s?
(\d+)
\s
(
(?:[A-Z]{3})?
)
\s?
(
# OLD SOLUTION
# (?:
# (?!\d\:\d)
# .
# )*
# NEW SOLUTION
.*?
)
\s?
(
(?:\d+:)?
\d+\.\d+
)
(?:
\s
(\d+)
)?
$
See example on regex101.
The group that puzzles me the most, however, is this one:
(?:[A-Z]{3})?
Why the 3 character limit, when it only matches the first 3 letters in the street name?

Related

matching regex for custom params in URI

I am building a web framework and REGEX is really hostile today.
I do not like the django way of formatting custom params with angle brackets
url/<param>/... or
<str:token>/
I would prefer the way js and other programs handle this
name/:token/:another_param
After trying for 45 minutes I am giving up. I would like to allow only characters like a-zA-Z0-9_:/. The main issue here is that I do not want to allow recursive colons like this
:::id
These are strings I would like to match
empty string (although I could check prior matching)
/
:id
/:id
/:id/
person/:name/:id/:token_person2/image
person////
////
Could someone help me?
You don't want to match :: and apparently also not :/
What you can do is use a single negative lookahead to assert that those 2 strings do not occur.
^(?![/\w:]*:[:/])[/\w:]*$
Explanation
^ Start of string
(?![/\w:]*:[:/]) Negative lookahead, assert not :: or :/ to the right
[/\w:]* Optionally repeat matching one of /, \w (word character) or :
$ End of string
const regex = /^(?![/\w:]*:[:/])[/\w:]*$/;
[
"",
"/",
":id",
"/:id",
"/:id/",
"person/:name/:id/:token_person2/image",
"person////",
"////",
":::id",
"person/:/name",
].forEach(s =>
console.log((regex.test(s) ? "" : "No ") + `Match --> '${s}'`)
)
See a regex demo

split line via regex in javascript?

I have this structure of text :
1.6.1 Members................................................................ 12
1.6.2 Accessibility.......................................................... 13
1.6.3 Type parameters........................................................ 13
1.6.4 The T generic type aka <T>............................................. 13
I need to create JS objects :
{
num:"1.6.1",
txt:"Members"
},
{
num:"1.6.2",
txt:"Accessibility"
} ...
That's not a problem.
The problem is that I want to extract values via Regex split via positive lookahead :
Split via the first time you see that next character is a letter
What have i tried :
'1.6.1 Members........... 12'.split(/\s(?=(?:[\w\. ])+$)/i)
This is working fine :
["1.6.1", "Members...........", "12"] // I don't care about the 12.
But If I have 2 words or more :
'1.6.3 Type parameters................ 13'.split(/\s(?=(?:[\w\. ])+$)/i)
The result is :
["1.6.3", "Type", "parameters................", "13"] //again I don't care about 13.
Of course I can join them , but I want the words to be together.
Question :
How can I enhance my regex NOT to split words ?
Desired result :
["1.6.3", "Type parameters"]
or
["1.6.3", "Type parameters........"] // I will remove extras later
or
["1.6.3", "Type parameters........13"]// I will remove extras later
NB
I know I can do split via " " or by other simpler solution but I'm seeking ( for pure knowledge) for an enhancement for my solution which uses positive lookahead split.
Full online example :
nb2 :
The text can contain capital letter in the middle also.
You can use this regex:
/^(\d+(?:\.\d+)*) (\w+(?: \w+)*)/gm
And get your desired matches using matched group #1 and matched group #2.
Online Regex Demo
Update: For String#split you can use this regex:
/ +(?=[A-Z\d])/g
Regex Demo
Update 2: With the possibility of having capital letters also in chapter names following more complex regex is needed:
var re = /(\D +(?=[a-z]))| +(?=[a-z\d])/gmi;
var str = '1.6.3 Type Foo Bar........................................................ 13';
var m = str.split( re );
console.log(m[0], ',', m.slice(1, -1).join(''), ',', m.pop() );
//=> 1.6.3 , Type Foo Bar........................................................ , 13
EDIT: Since you added 1.6.1 The .net 4.5 framework.... to the requirements, we can tweak the answer to this:
^([\d.]+) ((?:[^.]|\.(?!\.))+)
And if you want to allow sequences of up to three dots in the title, as in 1.6.1 She said... Boo!..........., it's an easy tweak from there ({3} quantifier):
^([\d.]+) ((?:[^.]|\.(?!\.{3}))+)
Original:
^([\d.]+) ([^.]+)
In the regex demo, see the Groups in the right pane.
To retrieve Groups 1 and 2, something like:
var myregex = /^([\d.]+) ((?:[^.]|\.(?!\.))+)/mg;
var theMatchObject = myregex.exec(yourString);
while (theMatchObject != null) {
// the numbers: theMatchObject[1]
// the title: theMatchObject[1]
theMatchObject = myregex.exec(yourString);
}
OUTPUT
Group 1 Group 2
1.6.1 Members
1.6.2 Accessibility
1.6.3 Type parameters
1.6.4 The T generic type aka <T>**
1.6.1 The .net 4.5 framework
Explanation
^ asserts that we are a the beginning of the line
The parentheses in ([\d.]+) capture digits and dots to Group 1
The parentheses in ((?:[^.]|\.(?!\.))+) capture to Group 2...
[^.] one char that is not a dot, | OR...
\.(?!\.) a dot that is not followed by a dot...
+ one or more times
You can use this pattern too:
var myStr = "1.6.1 Members................................................................ 12\n1.6.2 Accessibility.......................................................... 13\n1.6.3 Type parameters........................................................ 13\n1.6.4 The T generic type aka <T>............................................. 13";
console.log(myStr.split(/ (.+?)\.{2,} ?\d+$\n?/m));
About a way with a lookahead :
I don't think it is possible. Because the only way to skip a character (here a space between two words), is to match it on the occasion of the previous occurence of a space (between the number and the first word). In other words, you use the fact that characters can not be matched more than one time.
But if, except the space where you want to split, all the pattern is enclosed in a lookahead, and since the substring matched by this subpattern in the lookahead isn't a part of the match result (in other words, it's only a check and the corresponding characters are not eaten by the regex engine), you can't skip the next spaces, and the regex engine will continue his road until the next space character.

Javascript regular exp: referring to a nested group

So I've got this RegExp to validate some input like this:
1 12919840 T C
1 35332717 C A
1 55148456 G T
1 70504789 C T
1 167059520 A T
1 182496864 A T
1 197073351 C T
1 216373211 G T
The exp i came up with is:
/^([0-9]\s+[0-9]+\s+[ATCG]\s+[ATCG][\s|\n]+)*[0-9]\s+[0-9]+\s+[ATCG]\s+[ATCG][\s|\n]*$/g
This worked in something like
/^([0-9]\s+[0-9]+\s+[ATCG]\s+[ATCG][\s|\n]+)*[0-9]\s+[0-9]+\s+[ATCG]\s+[ATCG][\s|\n]*$/g.test("1 12919840 T C\n1 35332717 C A"); //this returns true
But when trying use group reference to make it shorter it doesn't work anymore
/^(([0-9]\s+[0-9]+\s+[ATCG]\s+[ATCG])[\s|\n]+)*\2[\s|\n]*$/g.test("1 12919840 T C\n1 35332717 C A"); //this returns false
I'm using \2 here since from my research the numbering of the groups starts from the left most parenthesis. what did I miss? thx!
My answer addresses your what did I miss? question. For a workaround, see the answer by #Jack.
A Capture Group is Not a Subroutine
What you're missing is that a capture group is not a subroutine.
When you say \1, you are referring to the exact characters that were captured by the parentheses of Group 1. For instance, (\d)\1 would match 11 or 22, but not 12.
Regex Subroutines
In Perl and PCRE, you can refer to a subexpression by using (?1). For instance, (\d)(?1) would match 11 as well as 12.
This is also available in the regex module for Python. Sadly, this is not available in JavaScript, which you seem to be using.
Since you're working with DNA, if you have a chance, I'd suggest working in a life-embracing language such as Python (JS has poor regex abilities, though the XregeExp library fills some holes.)
The problem with your expression is that back references match whatever was matched by a previous memory capture; the expression that generated the memory capture itself can't be referenced in this manner.
That said, you can still shorten the expression by using a multiplier on your expression:
var re = /((?:^\d+\s+\d+\s+[TCGA]\s+[TCGA][\s\r\n]*){2})/gm,
m;
The expression matches the same thing twice, with an optional set of spaces in between. To iterate over all matches:
while ((m = re.exec(str)) !== null) {
console.log('match' + m[1]);
}

Javascript Regular Expression: alternation and nesting

Here is what i've got so far:
/(netscape)|(navigator)\/(\d+)(\.(\d+))?/.test(UserAgentString.toLowerCase()) ? ' netscape'+RegExp.$3+RegExp.$4 : ''
I'm trying to do several different things here.
(1). I want to match either netscape or navigator, and it must be followed by a single slash and one or more digits.
(2). It can optionally follow those digits with up to one of: one period and one or more digits.
The expression should evaluate to an empty string if (1) is not true.
The expression should return ' netscape8' if UserAgentString is Netscape/8 or Navigator/8.
The expression should return ' netscape8.4' if UserAgentString is Navigator/8.4.2.
The regex is not working. In particular (this is an edited down version for my testing, and it still doesn't work):
// in Chrome this produces ["netscape", "netscape", undefined, undefined]
(/(netscape)|(navigator)\/(\d+)/.exec("Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.7.5) Gecko/20060912 Netscape/8.1.2".toLowerCase()))
Why does the 8 not get matched? Is it supposed to show up in the third entry or the fourth?
There are a couple things that I want to figure out if they are supported. Notice how I have 5 sets of capture paren groups. group #5 \d+ is contained within group #4: \.(\d+). Is it possible to retrieve the matched groups?
Also, what happens if I specify a group like this? /(\.\d+)*/ This matches any number of "dot-number" strings contatenated together (like in a version number). What's RegExp.$1 supposed to match here?
Your "or" expression is not doing what you think.
Simplified, you're doing this:
(a)|(b)cde
Which matches either a or bcde.
Put parentheses around your "or" expression: ((a)|(b))cde and that will match either acde or bcde.
I find http://regexpal.com/ to be a very useful tool for quickly checking my regex syntax.
Regex (netscape|navigator)\/(\d+(?:\.\d+)?) will return 2 groups (if match found):
netscape or navigator
number behind the name
var m = /(netscape|navigator)\/(\d+(?:\.\d+)?)/.exec(text);
if (m != null) {
var r = m[1] + m[2];
}
(....) Creates a group. Everything inside that group is returned with that group's variable.
The following will match netscape or navigator and the first two numbers of the version separated by a period.
$1 $2
|------------------| |------------|
/(netscape|navigator)[^\/]*\/((\d+)\.(\d+))/
The final code looks like this:
/(netscape|navigator)[^\/]*\/((\d+)\.(\d+))/.test(
navigator.userAgent.toLowerCase()
) ? 'netscape'+RegExp.$2 : ''
Which will give you
netscape5.0
Check out these great tuts (there are many more):
http://perldoc.perl.org/perlrequick.html
http://perldoc.perl.org/perlre.html
http://perldoc.perl.org/perlretut.html

Sort lines on webpage using javascript/ regex

I'd like to write a Greasemonkey script that requires finding lines ending with a string ("copies.") & sorting those lines based on the number preceding that string.
The page I'm looking to modify does not use tables unfortunately, just the br/ tag, so I assume that this will involve Regex:
http://www.publishersweekly.com/article/CA6591208.html
(Lines without the matching string will just be ignored.)
Would be grateful for any tips to get me started.
Most times, HTML and RegEx do not go together, and when parsing HTML your first thought should not be RegEx.
However, in this situation, the markup looks simple enough that it should be okay - at least until Publisher Weekly change how they do that page.
Here's a function that will extract the data, grab the appropriate lines, sort them, and put them back again:
($j is jQuery)
function reorderPwList()
{
var Container = $j('#article span.table');
var TargetLines = /^.+?(\d+(?:,\d{3})*) copies\.<br ?\/?>$/gmi
var Lines = Container.html().match( TargetLines );
Lines.sort( sortPwCopies );
Container.html( Lines.join('\n') );
function sortPwCopies()
{
function getCopyNum()
{ return arguments[0].replace(TargetLines,'$1').replace(/\D/g,'') }
return getCopyNum(arguments[0]) - getCopyNum(arguments[1]);
}
}
And an explanation of the regex used there:
^ # start of line
.+? # lazy match one or more non-newline characters
( # start capture group $1
\d+ # match one or more digits (0-9)
(?: # non-capture group
,\d{3} # comma, then three digits
)* # end group, repeat zero or more times
) # end group $1
copies\. # literal text, with . escaped
<br ?\/?> # match a br tag, with optional space or slash just in case
$ # end of line
(For readability, I've indented the groups - only the spaces before 'copies' and after 'br' are valid ones.)
The regex flags gmi are used, for global, multi-line mode, case-insensitive matching.
<OLD ANSWER>
Once you've extracted just the text you want to look at (using DOM/jQuery), you can then pass it to the following function, which will put the relevant information into a format that can then be sorted:
function makeSortable(Text)
{
// Mark sortable lines and put number before main content.
Text = Text.replace
( /^(.*)([\d,]+) copies\.<br \/>/gm
, "SORT ME$2 $1"
);
// Remove anything not marked for sorting.
Text = Text.replace( /^(?!SORT ME).*$/gm , '' );
// Remove blank lines.
Text = Text.replace( /\n{2,}/g , '\n' );
// Remove sort token.
Text = Text.replace( /SORT ME/g , '' );
return Text;
}
You'll then need a sort function to ensure that the numbers are sorted correctly (the standard JS array.sort method will sort on text, and put 100,000 before 20,000).
Oh, and here's a quick explanation of the regexes used here:
/^(.*)([\d,]+) copies\.<br \/>/gm
/.../gm a regex with global-match and multi-line modes
^ matches start of line
(.*) capture to $1, any char (except newline), zero or more times
([\d,]+) capture to $2, any digit or comma, one or more times
copies literal text
\.<br \/> literal text, with . and / escaped (they would be special otherwise)
/^(?!SORT ME).*$/gm
/.../gm again, enable global and multi-line
^ match start of line
(?!SORT ME) a negative lookahead, fails the match if text 'SORT ME' is after it
.* any char (except newline), zero or more times
$ end of line
/\n{2,}/g
\n{2,} a newline character, two or more times
</OLD ANSWER>
you can start with something like this (just copypaste into the firebug console)
// where are the things
var elem = document.getElementById("article").
getElementsByTagName("span")[1].
getElementsByTagName("span")[0];
// extract lines into array
var lines = []
elem.innerHTML.replace(/.+?\d+\s+copies\.\s*<br>/g,
function($0) { lines.push($0) });
// sort an array
// lines.sort(function(a, b) {
// var ma = a.match(/(\d+),(\d+)\s+copies/);
// var mb = b.match(/(\d+),(\d+)\s+copies/);
//
// return parseInt(ma[1] + ma[2]) -
// parseInt(mb[1] + mb[2]);
lines.sort(function(a, b) {
function getNum(p) {
return parseInt(
p.match(/([\d,]+)\s+copies/)[1].replace(/,/g, ""));
}
return getNum(a) - getNum(b);
})
// put it back
elem.innerHTML = lines.join("");
It's not clear to me what it is you're trying to do. When posting questions here, I encourage you to post (a part of) your actual data and clearly indicate what exactly you're trying to match.
But, I am guessing you know very little regex, in which case, why use regex at all? If you study the topic a bit, you will soon know that regex is not some magical tool that produces whatever it is you're thinking of. Regex cannot sort in whatever way. It simply matches text, that's all.
Have a look at this excellent on-line resource: http://www.regular-expressions.info/
And if after reading you think a regex solution to your problem is appropriate, feel free to elaborate on your question and I'm sure I, or someone else is able to give you a hand.
Best of luck.

Categories