I try to figure out the correct regex to replace the last segment of an url with a modified version of that very last segment. (I know that there are similar threads out there, but none seemed to help...)
Example:
https://www.test.com/one/two/three/mypost/
--->
one/two/three?id=mypost
https://www.test.com/one/mypost/
--->
one?id=mypost
Now I am stuck here:
https://regex101.com/r/9GqYaU/1
I can get the last segment in capturing group 2 but how would I replace it?
I think I will have to something like this:
const url = 'https://www.test.com/one/two/three/mypost/'
const regex = /(http[s]?:\/\/)([^\/]+\/)*(?=\/$|$)/
const path = url.replace(regex, `${myUrlWithoutTheLastSegmentAnd WithoutHTTPS}?id=$2`)
return path
But I have no idea how to get the url without the last segment. I have currently only access to the whole string or group 1 (which is useless in this case) and then group 2, but not the string without group 2.
I would be very glad for any help here. Sometimes I just lack the knowledge of what is possible with regex and how to achieve it.
Thank you in advance.
Cheers
You could use the URL class to extract the pathname and substring to remove the first '/'.
Then, you could put the last part of the pathname in a group and use it as a reference $1 for the replacement.
const url = new URL('https://www.test.com/one/two/three/mypost/').pathname.substring(1)
console.log(url.replace(/\/([^/]*)\/$/, '?id=$1'))
I came across your question yesterday and agree with going down the route of parsing the URL. Once you get there you could even use JavaScript array methods which I prefer to string methods like:
pathname.split("/").filter(p => p.length).pop()
This would separate each folder, ignore any with no length (i.e. handle a trailing slash) and return the last one (mypost).
Anyway, I am also learning regex so sometimes when I find a question like this I just try to find the answer anyway as the best way of learning is doing. It took 24 hours 😂 I came up with this:
/(https?:\/\/).+?([a-z-]*)\/?$/gm
(https?:\/\/) you know what this does. Small correction, you don't need the square brackets. Question mark matches 0 or 1 of the preceding character. As we're only matching s this just works. If you wanted to match s or z you would use [sz]?. I think.
.+? this is the cool one I think I will use in future now I found it. The question mark here has a different meaning - it makes .+ (which means one or more of any character) non-greedy. That means it stops applying once it reaches the next rule. Which is...
([a-z-]*) any number of letters or a hyphen. You should maybe change this to include numbers and upper case.
\/? Optional slash
$ all this must apply at the end of the string.
Here is a demo
https://regex101.com/r/mQNkIS/1
Related
I was a bit surprised, that actually no one had the exact same issue in javascript...
I tried several different solutions none of them parse the content correctly.
The closest one I tried : (I stole its regex query from a PHP solution)
const test = `abc?aaa.abcd?.aabbccc!`;
const sentencesList = test.split("/(\?|\.|!)/");
But result just going to be
["abc?aaa.abcd?.aabbccc!"]
What I want to get is
['abc?', 'aaa.', 'abcd?','.', 'aabbccc!']
I am so confused.. what exactly is wrong?
/[a-z]*[?!.]/g) will do what you want:
const test = `abc?aaa.abcd?.aabbccc!`;
console.log(test.match(/[a-z]*[?!.]/g))
To help you out, what you write is not a regex. test.split("/(\?|\.|!)/"); is simply an 11 character string. A regex would be, for example, test.split(/(\?|\.|!)/);. This still would not be the regex you're looking for.
The problem with this regex is that it's looking for a ?, ., or ! character only, and capturing that lone character. What you want to do is find any number of characters, followed by one of those three characters.
Next, String.split does not accept regexes as arguments. You'll want to use a function that does accept them (such as String.match).
Putting this all together, you'll want to start out your regex with something like this: /.*?/. The dot means any character matches, the asterisk means 0 or more, and the questionmark means "non-greedy", or try to match as few characters as possible, while keeping a valid match.
To search for your three characters, you would follow this up with /[?!.]/ to indicate you want one of these three characters (so far we have /.*?[?!.]/). Lastly, you want to add the g flag so it searches for every instance, rather than only the first. /.*?[?!.]/g. Now we can use it in match:
const rawText = `abc?aaa.abcd?.aabbccc!`;
const matchedArray = rawText.match(/.*?[?!.]/g);
console.log(matchedArray);
The following code works, I do not think we need pattern match. I take that back, I have been answering in Java.
final String S = "An sentence may end with period. Does it end any other way? Ofcourse!";
final String[] simpleSentences = S.split("[?!.]");
//now simpleSentences array has three elements in it.
So I'm making a markdown editor, and I want some function like "This is *italics*".replace("*$1*","<i>$1</i>");
Any easy way to do this? (Client Side, this'll be hosted on Github Pages or something, so a random npm package probably won't help)
Edit: An equal number of people have upvoted and downvoted this. It would help if you tell me why you downvoted.
Short answer: 'This is *italics*'.replace(/\*(.+)\*/, '<i>$1</i>');
Explanation: Using RegExp is the easiest way to go about this, specifically the grouping section.
Let's strip down /\*(.+)\*/:
The starting and ending / are defining that the thing in between is actually a RegExp
We need to check for asterisks at the start and at the end, but * is a quantity selector in the RegExp, therefore we need to escape them using a \ (basically saying "hey, the next chracter is not an actual selector, but something literal")
Next we need to specify that we need to check for any character between those asterisks (that's the .), appearing more than once (that's the +)
Finally we need to group this and tell the RegExp that what we want to remember is the thing between the asterisks and not the whole thing, that's where the parenthesis come to action.
Using those parenthesis, we can do $n (where n is the matched quantity number, in this case 1) in the replacing string to replace for the matching group
Been struggling for the last hour to try and get this regexp to work but cannot seem to crack it.
It must be a regexp and I cannot use split etc as it is part of a bigger regexp that searches for numerous other strings using .test().
(public\/css.*[!\/]?)
public/css/somefile.css
public/css/somepath/somefile.css
public/css/somepath/anotherpath/somefile.css
Here I am trying to look for path starting with public/css followed by any character except for another forward slash.
so "public/css/somefile.css" should match but the other 2 should not.
A better solution may be to somehow specify the number of levels to match after the prefix using something like
(public\/css\/{1,2}.*)
but I can't seem to figure that out either, some help with this would be appreciated.
edit
No idea why this question has been marked down twice, I have clearly stated the requirement with sample code and test cases and also attempted to solve the issue, why is it being marked down ?
You can use this regex:
/^(public\/css\/[^\/]*?)$/gm
^ : Starts with
[^/] : Not /
*?: Any Characters
$: Ends with
g: Global Flag
m: Multi-line Flag
Something like this?
/public\/css\/[^\/]+$/
This will match
public/css/[Any characters except for /]$
$ is matching the end of the string in regex.
I can't post the exact data i'm trying to extract but here's a basic scenario with the same outcome. I'm grabbing the body of a page and trying to extract a bit.ly link from it. So let's say for example, this is the chunk of data where i'm trying to grab the link from.
String:
http://bit.ly/Pq8AkS</div><div class="shareUnit"><div class="-cx-PRIVATE-fbTimelineExternalShareUnit__wrapper"><div><div class="-cx-PRIVATE-fbTimelineExternalShareUnit__root -cx-PRIVATE-fbTimelineExternalShareUnit__hasImage"><a class="-cx-PRIVATE-fbTimelineExternalShareUnit__video -cx-PRIVATE-fbTimelineExternalShareUnit__image -cx-PRIVATE-fbTimelineExternalShareUnit__content" ajaxify="/ajax/flash/expand_inline.php?target_div=uikk85_59&share_id=271663136271285&max_width=403&max_height=403&context=timelineSingle" rel="async" href="#" onclick="CSS.addClass(this, "-cx-PRIVATE-fbTimelineExternalShareUnit__loading");CSS.removeClass(this, "-cx-PRIVATE-fbTimelineExternalShareUnit__video");"><i class="-cx-PRIVATE-fbTimelineExternalShareUnit__play"></i><img class="img" src="http://external.ak.fbcdn.net/safe_image.php?d=AQDoyY7_wjAyUtX2&w=155&h=114&url=http%3A%2F%2Fi1.ytimg.com%2Fvi%2FDre21lBu2zU%2Fmqdefault.jpg" alt="" /></a>
Now, I can get what i'm looking for with the following code but the link isn't always going to be exactly 6 characters long. So this causes an issue...
Body = document.getElementsByTagName("body")[0].innerHTML;
regex = /2Fbit.ly%2F(.{6})&h/g;
Matches = regex.exec(Body);
Here's what I was orginally trying but the problem I have is that it grabs too much data. It's going all the way to the last "&h" in the string above instead of stopping at the first one it hits.
Body = document.getElementsByTagName("body")[0].innerHTML;
regex = /2Fbit.ly%2F(.*)&h/g;
Matches = regex.exec(Body);
So basically the main part of the string i'm trying to focus on is "%2Fbit.ly%2FPq8AkS&h" so that I can get the "Pq8AkS" out of it. When I use the (.*) it's grabbing everything between "%2F" and the very last "&h" in the large string above.
You should not be using a regex on HTML. Use DOM functions to get the desired link object, then get the href attribute from that, then you can use a regex on just the href.
By default .* is greedy meaning that it matches the most it can match and still find a match. If you want it to be non-greedy (match the least possible), you can use this .*? instead like this:
regex = /2Fbit.ly%2F(.*?)&h/;
I also don't think you want the g flag on the regex as there should only be one match in the right URL.
If you show the rest of your HTML, we could offer advice on finding the right link object rather than trying to match the entire body HTML.
FYI, another trick for a non-greedy match is to do something like this:
regex = /2Fbit.ly%2F([^&]*)&h/;
Which matches a series of characters that are not & followed by &h which accomplishes the same goal as long as & can't be in the matched sequence.
By default + and * are greedy and match as much as possible. You need a non-greedy match for your (.+). A quick search gives the solution as
? directly following a quantifier makes the quantifier non-greedy (makes it match minimum instead of maximum of the interval defined).
So try changing your regex= line to
regex = /2Fbit.ly%2F(.*?)&h/g;
Edit: #jfriend00's answer below is more complete.
In Jeff Roberson's jQuery Regular Expressions Review he proposes changing the rts regular expression in jQuery's ajax.js from /(\?|&)_=.*?(&|$)/ to /([?&])_=[^&\r\n]*(&?)/. In both versions, what is the purpose of the second capture group? The code does a replacement of the current random timestamp with a new random timestamp:
var ts = jQuery.now();
// try replacing _= if it is there
var ret = s.url.replace(rts, "$1_=" + ts + "$2");
Doesn't it only replace what it matches? I am thinking this does the same:
var ret = s.url.replace(/([?&])_=[^&\r\n]*/, "$1_=" + ts);
Can someone explain the purpose of the second capture group?
It's to pick up the next delimiter in the query string on the URL, so that it still works properly as a query string. Thus if the url is
http://foo.bar/what/ever?blah=blah&_=12345&zebra=banana
then the second group picks up the "&" before "zebra".
That's an awesome blog post by the way and everybody should read it.
edit — now that I think about it, I'm not sure why it's necessary to bother with replacing that second delimiter. In the "fixed" expression, that greedy * will pick up the whole parameter value and stop at the delimiter (or the end of the string) anyway.
I think you're right. It was needed in the original because matching the ampersand or end-of-string was how the .*? knew when to stop. In Jeff's version that's no longer necessary.
As the author of the article I can't tell you the reason for the second capture group. My intent with the article was to take existing regexes and simply make them more efficient - i.e. they should all match the same text - just do it faster. Unfortunately I did not have time to delve deeply into the code to see exactly how each and every one of them was being used. I assumed that the capture group for this one was there for a reason so I did not mess with it.