While writing an API service for my site, I realized that String.split() won't do it much longer, and decided to try my luck with regular expressions. I have almost done it but I can't find the last bit. Here is what I want to do:
The URL represents a function call:
/api/SECTION/FUNCTION/[PARAMS]
This last part, including the slash, is optional. Some functions display a JSON reply without having to receive any arguments. Example: /api/sounds/getAllSoundpacks prints a list of available sound packs. Though, /api/sounds/getPack/8Bit prints the detailed information.
Here is the expression I have tried:
req.url.match(/\/(.*)\/(.*)\/?(.*)/);
What am I missing to make the last part optional - or capture it in whole?
This will capture everything after FUNCTION/ in your URL, independent of the appearance of any further / after FUNCTION/:
FUNCTION\/(.+)$
The RegExp will not match if there is no part after FUNCTION.
This regex should work by making last slash and part after optional:
/^\/[^/]*\/[^/]*(?:\/.*)?$/
This matches all of these strings:
/api/SECTION/FUNCTION/abc
/api/SECTION
/api/SECTION/
/api/SECTION/FUNCTION
Your pattern /(.*)/(.*)/?(.*) was almost correct, it's just a bit too short - it allows 2 or 3 slashes, but you want to accept anything with 3 or 4 slashes. And if you want to capture the last (optional) slash AND any text behind it as a whole, you simply need to create a group around that section and make it optional:
/.*/.*/.*(?:/.+)?
should do the trick.
Demo. (The pattern looks different because multiline mode is enabled, but it still works. It's also a little "better" because it won't match garbage like "///".)
Related
I've racked my brain over this JS regex and have so far only managed to get parts of it to work or the whole thing to work in certain circumstances.
I have a string like this:
Some string<br>http://anysubdomain.particulardomain.com<br>Rest of string
The goal is to move the domain part to the end of the string, if it's there. The http part is also optional and can also be https. The TLD is always particulardomain.com, the subdomain can be anything.
I've managed to get everything into capture groups when the domain with protocol is present with this regex:
(.*)(https?\:\/\/[a-z\d\-]*\.particulardomain\.com)(.*)
But any attempt at making the domain part and the protocol part within it optional has resulted in no or the wrong matches.
The end result I'm looking for is to have the three parts of the string – beginning, domain, end – in separate capture groups so I can move capture group 2 (the domain part) to the end, or, if there's no domain present, the whole string in the first capture group.
To clarify, here are some examples with the expected output/capture groups:
INPUT:
Some string<br>http://anysubdomain.particulardomain.com<br>Rest of string
OR (no protocol):
Some string<br>anysubdomain.particulardomain.com<br>Rest of string
OUTPUT:
$1: Some string<br>
$2: http://anysubdomain.particulardomain.com
$3: <br>Rest of string
INPUT:
Some string<br>Rest of string
OUTPUT:
$1: Some string<br>Rest of string
$2: empty
$3: empty
One mistake in your regex is that it contains only particular whereas
the source text contains particulardomain, but this is a detail.
Now let's move to the protocol part. You put only one ? (after s),
which means that only s is optional, but both http and :
are still required.
To make the whole protocol optional, you must:
enclose it with a group (either capturing or not),
make this group optional (put ? after it).
And now maybe the most important thing: Your regex starts with (.*).
Note that it is greedy version, which:
initially tries to capture the whole rest of source string,
then moves back one char by one, to allow matching by the
following part of regex.
Change it to reluctant version (.*?) and then optional
group (https?:)? will match as expected.
Another detail: \ before : is not needed. It does not do
any harm either, but due to the principle "Keep It Simple...",
I recommend to delete it (as I did above).
One more detail: After [a-z\d\-] (subdomain part) you should put
+, not *, as this part may not be empty.
So the whole regex can be:
(.*?)((https?:)?\/\/[a-z\d\-]+\.particulardomain\.com)(.*)
And the last remark: I am in doubt, whether you really need three
capturing groups. Maybe it would be enough to leave only the content
of the middle capturing group, i.e.:
(https?:)?\/\/[a-z\d\-]+\.particulardomain\.com
Found a solution. Since, as stated, the goal is to move the domain to the end of the string, if it's present, I'm just matching the domain and anything after it. If there's no domain, nothing matches and hence nothing gets replaced. The problem was the two .* both at the beginning and the end of the regex. Only the one at the end is needed.
REGEX:
([a-z\d\-:\/]+\.particulardomain\.com)(.*)
Works for the following strings:
Domain present:
Start of string 1234<br>https://subdomain.particulardomain.com<br>End of string 999
Domain without protocol:
Start of string 1234<br>subdomain.particulardomain.com<br>End of string 999
No domain:
Start of string 1234<br>End of string 999
Thanks everyone for helping me rethink the problem!
I see good answer here, as you explained you need three group and set the domain to the back of the string(to be clear the entire url or only the domain e.g particulardomain.com)
You can do this:
//Don't know if the <br> tag matter for you problem, suppose it not
//this is you input
let str = "Start of string 1234<br>https://subdomain.particulardomain.com<br>End of string 99";
let group = str.split(<br>);
let indexOfDomain;
/*moere code like a for loop or work with a in-build funcion of the array with the regExp you made /[a-z\d\-:\/]+\.particulardomain\.com/ you can validated the domain separately.
}
TO HAVE IN MIND:
With your solution will not work at 100%, why?
your regExp:
([a-z\d\-:\/]+\.particulardomain\.com)(.*)
will mach a http, https, *(any other thing that is not a protocol) and will not work for this input you can test if you like and do a comment
Start of string 1234<br>End of string 999
The regExp that #Valdi_Bo answer:
(.*?)((https?:)?\/\/[a-z\d\-]+\.particulardomain\.com)(.*)
will fit to the what you described in the question
This regExp don't fit all yours input maybe he did not test it for all your input as you did not explained in your question like you did in your own answer
In conclusion at the end you need to extract the domain (wich don't know if is the entire url as you mix up the idea). If you are not going to use the do a split and then validated the regExp it will be more easy
I seem to have a love/hate relationship with RegEx in that I love how incredibly powerful it is, but at the same time, I don't quite understand all of the nuances of it yet.
I've got rather lengthy JSON feed that I need to parse and capture ALL of the matches between two specific strings. I've included a link to the regex101.com example with a few of the JSON results.
regex101.com Example
I'm trying to match every string between each /content/usergenerated and /jcr:content
...
I guess what I should really be trying to match is a string that starts with /content/webAppName/en/home and ends before /jcr:content
The path that I care about will always start with /content/webAppName/en/home
you have to use "positive look-ahead" that match a sequence of digits if they are followed by something
https://regex101.com/r/fU1iD1/4
Just wrap the two things you're looking to remove in parenthesis, and then remove them from the output. So...
(\/content\/usergenerated)(.*)(\/jcr\:content)
replaced by
/2
Which is everything in the middle of those two.
edit: Sorry, didn't look at your example :) - there was a deleted answer that said to add the g modifier, which looks like it works.
/content/usergenerated/content/webAppName/en/home([a-zA-Z/-]+)/jcr:content
This should work. It matches 3 out of 4 don't know why it doesn't match one of em. You could use exec() in a loop till it returns null and get hold of the object[1] which contains data for the first and only capture group.
all the best.
PS: I used gmi in options for the regex.
Consider my few input strings.
http://local.app.com/local/frontend/v12/#/abcde/
http://local.app.com/local/frontend/v12/#/abcde/!/fghij/
http://local.app.com/local/frontend/v12/#/abcde/!/ghijk/!/klmno/
I have written this regex which works fine for input string 1.
(?:([a-zA-Z0-9.://_]*)(/#/(?=([a-zA-Z0-9]{5})/)))
Output:
http://local.app.com/local/frontend/v12/#/,http://local.app.com/local/frontend/v12,/#/,abcde
But when I extend it to support repetitive !/.../ place holder for input string 1,2 and 3, it doesn't work and gives empty string rather than token.
(?:([a-zA-Z0-9.://_]*)(/#/(?=([a-zA-Z0-9]{5})/))(!/(?=([a-zA-Z0-9]{5})/))*)
Output:
http://local.app.com/local/frontend/v12/#/,http://local.app.com/local/frontend/v12,/#/,abcde,,
?= captures in fact a position defined by what you specify after the ?=
It does not (also) capture whatever may match the specification of the lookaround (?=).
Try
(.+? # (/[a-zA-Z0-9]{5}/) (!/([a-zA-Z0-9]{5})/)* )
(hope I didn't make a typo, can't test it right now.)
This should capture the complete input, but the various captures inside give you access to the captured "tokens".
You can, in addition, give names to the various captures inside, making it easier to identify them in the match:
(.+?#(/(?<tokenFirst>[a-zA-Z0-9]{5})/)(!/(?<tokenMore>[a-zA-Z0-9]{5})/)*)
Success
Hope this will clarify my comment and earlier remarks.
I 'borrowed' a regex from this website : http://daringfireball.net/2010/07/improved_regex_for_matching_urls that is almost complete but i want to match exemple.com
I know that stackoverflow is not doyourhomework.com but I passed a long time thinking without results. Here is a fiddle to test : http://jsfiddle.net/BGnMm/25/ and you can see at the end that exemple.com is not a link.
var reg=/\b((?:[a-z][\w-]+:(?:\/*)|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))/gi;
var allurl="http:foo.com/blah_blah http://foo.com/blah_blah/ (Something like http://foo.com/blah_blah) http://foo.com/blah_blah_(wikipedia) http://foo.com/more_(than)_one_(parens) (Something like http://foo.com/blah_blah_(wikipedia)) http://foo.com/blah_(wikipedia)#cite-1 http://foo.com/blah_(wikipedia)_blah#cite-1 http://foo.com/unicode_(✪)_in_parens http://foo.com/(something)?after=parens http://foo.com/blah_blah. http://foo.com/blah_blah/. <http://foo.com/blah_blah> <http://foo.com/blah_blah/> http://foo.com/blah_blah, http://www.extinguishedscholar.com/wpglob/?p=364. http://✪df.ws/1234 rdar://1234 rdar:/1234 x-yojimbo-item://6303E4C1-6A6E-45A6-AB9D-3A908F59AE0E message://%3c330e7f840905021726r6a4ba78dkf1fd71420c1bf6ff#mail.gmail.com%3e http://➡.ws/䨹 www.c.ws/䨹 <tag>http://example.com</tag> Just a www.example.com link. http://example.com/something?with,commas,in,url, but not at end What about <mailto:gruber#daringfireball.net?subject=TEST> (including brokets). mailto:name#example.com bit.ly/foo “is.gd/foo/” WWW.EXAMPLE.COM http://www.asianewsphoto.com/(S(neugxif4twuizg551ywh3f55))/Web_ENG/View_DetailPhoto.aspx?PicId=752 http://www.asianewsphoto.com/(S(neugxif4twuizg551ywh3f55)) http://lcweb2.loc.gov/cgi-bin/query/h?pp/horyd:#field(NUMBER+#band(thc+5a46634)) 6:00p filename.txt http://example.com/quotes-are-“part” ✪df.ws/1234 example.com example.com/";
document.write(allurl.replace(reg,"<a href='$1' >$1</a><br />"));
Add an alternation operator (|) after the {2,4}\/, i.e.
var reg=/\b((?:[a-z][\w-]+:(?:\/*)|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/|)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))/gi;
There's something you should understand about this. The first non-captured group, (?: … ), looks for "indicators" of URLs. One indicator, for example, is the www (followed by up to 3 digits of numbers). You however are asking for a way to identify URLs without any indicator at all. So, what we've done above is we've added a clause, "or an empty match," as a "valid" indicator. The consequence of this is that your regular expression is less selective now: all sorts of strings, not only example.com but also filename.txt, 3.141593, and omg...really are identified as URLs! Your only other (readily available) option is to be more selective about suffixes, e.g. require specific suffixes (com|org|net), but then this takes away from the generality of the original regex, which doesn't specify any suffixes at all.
In other words, you are probably faced with a limitation of logic, not a limitation of regex-writing skills or the regex language itself.
Please check if
var reg=/\b((?:[a-z][\w-]+:(?:\/*)|(?:www\d{0,3}[.])|[a-z0-9.\-]+[.][a-z]{2,4}\/{0,1})(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))*(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))/gi;
suits your needs. www(anyNumber) has just been put to appear one or zero times. Sorry for the first answer, did not notice the texts.
In Jeff Roberson's jQuery Regular Expressions Review he proposes changing the rts regular expression in jQuery's ajax.js from /(\?|&)_=.*?(&|$)/ to /([?&])_=[^&\r\n]*(&?)/. In both versions, what is the purpose of the second capture group? The code does a replacement of the current random timestamp with a new random timestamp:
var ts = jQuery.now();
// try replacing _= if it is there
var ret = s.url.replace(rts, "$1_=" + ts + "$2");
Doesn't it only replace what it matches? I am thinking this does the same:
var ret = s.url.replace(/([?&])_=[^&\r\n]*/, "$1_=" + ts);
Can someone explain the purpose of the second capture group?
It's to pick up the next delimiter in the query string on the URL, so that it still works properly as a query string. Thus if the url is
http://foo.bar/what/ever?blah=blah&_=12345&zebra=banana
then the second group picks up the "&" before "zebra".
That's an awesome blog post by the way and everybody should read it.
edit — now that I think about it, I'm not sure why it's necessary to bother with replacing that second delimiter. In the "fixed" expression, that greedy * will pick up the whole parameter value and stop at the delimiter (or the end of the string) anyway.
I think you're right. It was needed in the original because matching the ampersand or end-of-string was how the .*? knew when to stop. In Jeff's version that's no longer necessary.
As the author of the article I can't tell you the reason for the second capture group. My intent with the article was to take existing regexes and simply make them more efficient - i.e. they should all match the same text - just do it faster. Unfortunately I did not have time to delve deeply into the code to see exactly how each and every one of them was being used. I assumed that the capture group for this one was there for a reason so I did not mess with it.