matching regex for custom params in URI - javascript

I am building a web framework and REGEX is really hostile today.
I do not like the django way of formatting custom params with angle brackets
url/<param>/... or
<str:token>/
I would prefer the way js and other programs handle this
name/:token/:another_param
After trying for 45 minutes I am giving up. I would like to allow only characters like a-zA-Z0-9_:/. The main issue here is that I do not want to allow recursive colons like this
:::id
These are strings I would like to match
empty string (although I could check prior matching)
/
:id
/:id
/:id/
person/:name/:id/:token_person2/image
person////
////
Could someone help me?

You don't want to match :: and apparently also not :/
What you can do is use a single negative lookahead to assert that those 2 strings do not occur.
^(?![/\w:]*:[:/])[/\w:]*$
Explanation
^ Start of string
(?![/\w:]*:[:/]) Negative lookahead, assert not :: or :/ to the right
[/\w:]* Optionally repeat matching one of /, \w (word character) or :
$ End of string
const regex = /^(?![/\w:]*:[:/])[/\w:]*$/;
[
"",
"/",
":id",
"/:id",
"/:id/",
"person/:name/:id/:token_person2/image",
"person////",
"////",
":::id",
"person/:/name",
].forEach(s =>
console.log((regex.test(s) ? "" : "No ") + `Match --> '${s}'`)
)
See a regex demo

Related

How can I get a specific part of a URL using RegEx?

I am trying to get a part of a file download using RegEx (or other methods). I have pasted below the link that I am trying to parse and put the part I am trying to select in bold.
https://minecraft.azureedge.net/bin-linux/bedrock-server-1.7.0.13.zip
I have looked around and thought about trying Named Capture Groups, however I couldn't figure it out. I would like to be able to do this in JavaScript/Node.js, even if it requires a module 👻.
You can use node.js default modules to ease the match
URL and path to identify filename, and an easy regexp finally.
const { URL } = require('url')
const path = require('path')
const test = new URL(
'https://minecraft.azureedge.net/bin-linux/bedrock-server-1.7.0.13.zip'
)
/*
test.pathname = '/bin-linux/bedrock-server-1.7.0.13.zip'
path.parse(test.pathname) = { root: '/',
dir: '/bin-linux',
base: 'bedrock-server-1.7.0.13.zip',
ext: '.zip',
name: 'bedrock-server-1.7.0.13' }
match = [ '1.7.0.13', index: 15, input: 'bedrock-server-1.7.0.13' ]
*/
const match = path.parse(test.pathname)
.name
.match(/[0-9.]*$/)
You could use the below regex:
[\d.]+(?=\.\w+$)
This matches dots and digits that are following a file extension. You could also make it more accurate:
\d+(?:\.\d+)*(?=\.\w+$)
I'd stick with this:
-(\d+(?:\.\d+)*)(?:\.\w+)$
It matches a dash before any numbers
The parenthesis will make a capture group
Then, \d+ will match from one to any number of digits
?: will make a group but not capture it
Inside this group, \.\d+ will match a dot followed by any number of digits
The last expression will repeat from zero to any times thanks to *
After that, (?:\.\w+)$ will make a group that matches the extension toward the end of the string but not capture it
So, basically, this format would allow you to capture all the numbers that are after the dash and before the extension, be it 1, 1.7, 1.7.0, 1.7.0.13, 1.7.0.13.5 etc. On the match array, at index [0] you will have the entire regex match, and on [1] you will have your captured group, the number you're looking for.
Perhaps a regular expression like this is what you need?
var url = 'https://minecraft.azureedge.net/bin-linux9.9.9/bedrock-server-1.7.0.13.zip'
var match = url.match(/(\d+[.\d+]*)(?=\.\w+$)/gi)
console.log( match )
The way this pattern /\d+[.\d+]*\d+/gi works is to basically say that we want a sub string match that:
first contains one or more digit characters, ie \d+
immediately following this, there can be optional groupings of digits and decimal characters, ie [.\d+]
and finally, (?=\.\w+$) requires a file extension like .zip to follow immediately after our matched string
For more information on special characters like + and *, see this documentation. Hope that helps!

Match pattern except under one condition Regex

I'm trying to match a patterned with regex except when the pattern is escaped.
Test text:
This is AT\&T® is really cool Regex
You can see with my \& I'm manually escaping. And therefore, do not want the regex to match.
Regex:
const str = 'This is AT\&T® is really cool Regex'
str.replace(/\&(.*?)\;/g, '<sup>&$1;</sup>');
Expected output
This is AT&T<sup>®</sup> is really cool Regex
Hard to explain I guess but when the start of this regex looks for a & and ends with a ; however, if & is preceded with at \ like \& than do not match and look for the next \&(.*?)\;
You can use negative lookbehind
This regex works fine with the example
/(?<!\\)\&(.*?)\;/g
Edit 1
To workaround in JS you can use [^\\] that will match everything except backslash. The overall regex /[^\\]\&(.*?)\;/g It works for your example.
Since JavaScript have no support for lookbehind assertions - it is possible to add some custom substitution logic to achieve desired results. I've updated test string with examples of different kinds of html entities for test purposes:
const str = '&T;his is AT\\&T® is & really &12345; &xAB05; \\&cool; Regex'
console.log(str.replace(/&([a-z]+|[0-9]{1,5}|x[0-9a-f]{1,4});/ig, function (m0, m1, index, str) {
return (str.substr(index - 1, 1) !== '\\') ? '<sup>' + m0 + '</sup>' : m0;
}));

How to make regex match pattern from the beginning?

I need a little assistance with a Regular Expressions.
I'm doing the following from JavaScript to "mask" all special URLs that may be composed using the following rule:
They may begin with something like this 0> or 1223> or 1_23>
They may begin with a protocol, ex: http:\\ or https:\\
They may also have www. subdomain
So for instance, for https://www.example.com it should produce https://www. ....
So I came up with the following JS:
var url = "0>https://www.example.com/plugins/page.php?href=https://forum.example.com/topic/some_topic";
m = url.match(/\b((?:[\d_]+>)?.+\:\/\/(?:www.)?)/i);
if (m) {
url = m[1] + " ...";
}
console.log(url);
It works for most cases, except that "repeating" URL in my example, in which case I get this:
0>https://www.example.com/plugins/page.php?href=https:// ...
when I was expecting:
0>https:// www. ...
How do I make it pick the match from the beginning? I thought adding \b would do it...
Just make the .+, non-greedy, like this
m = url.match(/\b((?:[\d_]+>)?.+?\:\/\/(?:www.)?)/i);
Note the ? after .+. It means that, the RegEx has to match till the first : after the current expression. If you don't use the ?, it will make it greedy and it will consume all the characters till the last : in the string.
And, you don't have to escape : and you have to escape . after www. So your RegEx will become like this
m = url.match(/\b((?:[\d_]+>)?.+?:\/\/(?:www\.)?)/i);

How is RegEx handled differently in VBA and JavaScript?

I'm using a regular expression in Excel VBA to parse the results of a swim meet. The code reads a row of text that was copied from a PDF and outputs the important data into individual cells. Since the format of the string varies throughout the source PDF, the regular expression is quite complicated. Still, I'm able to parse 95% of the data at this point.
Some of the rows that are not being parsed are confusing me, though. VBA is clearly not able to find a match with the regular expression, but when I copy the exact same regex and string into this website, JavaScript is able to find a match without a problem. Is there something different in the way VBA and JavaScript handle regular expressions that might account for this?
Here's the string that VBA refuses to match:
12. NUNEZ CHENG, Walter 74 Club Tennis Las Terr 3:44.57 123
Here's the function I'm using in Excel (mostly successfully):
Function singleLineResults(SourceString As String) As Variant
Dim cSubmatches As Variant
Dim collectionArray(11) As String
Dim cnt As Integer
Dim oMatches As MatchCollection
With New RegExp
.MultiLine = MultiLine
.IgnoreCase = IgnoreCase
.Global = False
'1. JAROSOVA, Lenka 27 Swimmpower Prague 2:26.65 605 34.45 37.70 37.79 36.71
.Pattern = "(\d*)\.?\s?([^,]+),\s([^\d]+)\s?(\d+)\s((?:[A-Z]{3})?)\s?((?:(?!\d\:\d).)*)\s?((?:\d+:)?\d+\.\d+)(?:\s(\d+))?(?:\s((?:\d+:)?\d+.\d+))?(?:\s((?:\d+:)?\d+.\d+))?(?:\s((?:\d+:)?\d+.\d+))?(?:\s((?:\d+:)?\d+.\d+))?(?:Splash Meet Manager 11, Build \d{5} Registered to [\w\s]+ 2014-08-\d+ \d+:\d+ - Page \d+)?$"
Set oMatches = .Execute(SourceString)
If oMatches.Count > 0 Then
For Each submatch In oMatches(0).SubMatches
collectionArray(cnt) = submatch '.Value
cnt = cnt + 1
Next
Else
singleLineResults = Null
End If
End With
singleLineResults = collectionArray()
End Function
Could you add more examples to what actually matches? E.g. the surrounding lines that matches, and better yet, examples that are not supposed to match if any?
I've tried "cleaning" up a bit in the regex, removing groups that are not used to match that particular line, to make the error more obvious, and changed how one of the groups works, which might actually fix the issue:
(\d*)
\.?\s?
([^,]+)
,\s
([^\d]+)
\s?
(\d+)
\s
(
(?:[A-Z]{3})?
)
\s?
(
# OLD SOLUTION
# (?:
# (?!\d\:\d)
# .
# )*
# NEW SOLUTION
.*?
)
\s?
(
(?:\d+:)?
\d+\.\d+
)
(?:
\s
(\d+)
)?
$
See example on regex101.
The group that puzzles me the most, however, is this one:
(?:[A-Z]{3})?
Why the 3 character limit, when it only matches the first 3 letters in the street name?

split line via regex in javascript?

I have this structure of text :
1.6.1 Members................................................................ 12
1.6.2 Accessibility.......................................................... 13
1.6.3 Type parameters........................................................ 13
1.6.4 The T generic type aka <T>............................................. 13
I need to create JS objects :
{
num:"1.6.1",
txt:"Members"
},
{
num:"1.6.2",
txt:"Accessibility"
} ...
That's not a problem.
The problem is that I want to extract values via Regex split via positive lookahead :
Split via the first time you see that next character is a letter
What have i tried :
'1.6.1 Members........... 12'.split(/\s(?=(?:[\w\. ])+$)/i)
This is working fine :
["1.6.1", "Members...........", "12"] // I don't care about the 12.
But If I have 2 words or more :
'1.6.3 Type parameters................ 13'.split(/\s(?=(?:[\w\. ])+$)/i)
The result is :
["1.6.3", "Type", "parameters................", "13"] //again I don't care about 13.
Of course I can join them , but I want the words to be together.
Question :
How can I enhance my regex NOT to split words ?
Desired result :
["1.6.3", "Type parameters"]
or
["1.6.3", "Type parameters........"] // I will remove extras later
or
["1.6.3", "Type parameters........13"]// I will remove extras later
NB
I know I can do split via " " or by other simpler solution but I'm seeking ( for pure knowledge) for an enhancement for my solution which uses positive lookahead split.
Full online example :
nb2 :
The text can contain capital letter in the middle also.
You can use this regex:
/^(\d+(?:\.\d+)*) (\w+(?: \w+)*)/gm
And get your desired matches using matched group #1 and matched group #2.
Online Regex Demo
Update: For String#split you can use this regex:
/ +(?=[A-Z\d])/g
Regex Demo
Update 2: With the possibility of having capital letters also in chapter names following more complex regex is needed:
var re = /(\D +(?=[a-z]))| +(?=[a-z\d])/gmi;
var str = '1.6.3 Type Foo Bar........................................................ 13';
var m = str.split( re );
console.log(m[0], ',', m.slice(1, -1).join(''), ',', m.pop() );
//=> 1.6.3 , Type Foo Bar........................................................ , 13
EDIT: Since you added 1.6.1 The .net 4.5 framework.... to the requirements, we can tweak the answer to this:
^([\d.]+) ((?:[^.]|\.(?!\.))+)
And if you want to allow sequences of up to three dots in the title, as in 1.6.1 She said... Boo!..........., it's an easy tweak from there ({3} quantifier):
^([\d.]+) ((?:[^.]|\.(?!\.{3}))+)
Original:
^([\d.]+) ([^.]+)
In the regex demo, see the Groups in the right pane.
To retrieve Groups 1 and 2, something like:
var myregex = /^([\d.]+) ((?:[^.]|\.(?!\.))+)/mg;
var theMatchObject = myregex.exec(yourString);
while (theMatchObject != null) {
// the numbers: theMatchObject[1]
// the title: theMatchObject[1]
theMatchObject = myregex.exec(yourString);
}
OUTPUT
Group 1 Group 2
1.6.1 Members
1.6.2 Accessibility
1.6.3 Type parameters
1.6.4 The T generic type aka <T>**
1.6.1 The .net 4.5 framework
Explanation
^ asserts that we are a the beginning of the line
The parentheses in ([\d.]+) capture digits and dots to Group 1
The parentheses in ((?:[^.]|\.(?!\.))+) capture to Group 2...
[^.] one char that is not a dot, | OR...
\.(?!\.) a dot that is not followed by a dot...
+ one or more times
You can use this pattern too:
var myStr = "1.6.1 Members................................................................ 12\n1.6.2 Accessibility.......................................................... 13\n1.6.3 Type parameters........................................................ 13\n1.6.4 The T generic type aka <T>............................................. 13";
console.log(myStr.split(/ (.+?)\.{2,} ?\d+$\n?/m));
About a way with a lookahead :
I don't think it is possible. Because the only way to skip a character (here a space between two words), is to match it on the occasion of the previous occurence of a space (between the number and the first word). In other words, you use the fact that characters can not be matched more than one time.
But if, except the space where you want to split, all the pattern is enclosed in a lookahead, and since the substring matched by this subpattern in the lookahead isn't a part of the match result (in other words, it's only a check and the corresponding characters are not eaten by the regex engine), you can't skip the next spaces, and the regex engine will continue his road until the next space character.

Categories