Regex to remove all but file name from links

Regex to remove all but file name from links - javascript

I am trying to write a regexp that removes file paths from links and images.
href="path/path/file" to href="file"
href="/file" to href="file"
src="/path/file" to src="file"
and so on...
I thought that I had it working, but it messes up if there are two paths in the string it is working on. I think my expression is too greedy. It finds the very last file in the entire string.
This is my code that shows the expression messing up on the test input:
<script type="text/javascript" src="/javascripts/jquery.js"></script>
<script type="text/javascript">
$(document).ready(function(){
var s = '<img src="/one/two/keep.this">';
var t = s.replace(/(src|href)=("|').*\/(.*)\2/gi,"$1=$2$3$2");
alert(t);
});
</script>
It gives the output:
The correct output should be:
<img src="keep.this">
Thanks for any tips!

It doesn't have to be a regular expression (assuming / delimiters):
var fileName = url.split('/').pop(); //pop takes the last element

I would suggest run separate regex replacement, one for a links and another for img, easier and clearer, thus more maintainable.

This seems to work in case anyone else has the problem:
var t = s.replace(/(src|href)=('|")([^ \2]*\/)*\/?([^ \2]*)\2/gi,"$1=$2$4$2");

Try adding ? to make the * quantifiers non-greedy. You want them to stop matching when they encounter the ending quote character. The greedy versions will barrel right on past the ending quote if there's another quote later in the string, finding the longest possible match; the non-greedy ones will find the shortest possible match.
/(src|href)=("|').*?\/([^/]*?)\2/gi
Also I changed the second .* to [^/]* to allow the first .* to still match the full path now that it's non-greedy.

Related

Regex: Replace last segment of url

I try to figure out the correct regex to replace the last segment of an url with a modified version of that very last segment. (I know that there are similar threads out there, but none seemed to help...)
Example:
https://www.test.com/one/two/three/mypost/
--->
one/two/three?id=mypost
https://www.test.com/one/mypost/
--->
one?id=mypost
Now I am stuck here:
https://regex101.com/r/9GqYaU/1
I can get the last segment in capturing group 2 but how would I replace it?
I think I will have to something like this:
const url = 'https://www.test.com/one/two/three/mypost/'
const regex = /(http[s]?:\/\/)([^\/]+\/)*(?=\/$|$)/
const path = url.replace(regex, `${myUrlWithoutTheLastSegmentAnd WithoutHTTPS}?id=$2`)
return path
But I have no idea how to get the url without the last segment. I have currently only access to the whole string or group 1 (which is useless in this case) and then group 2, but not the string without group 2.
I would be very glad for any help here. Sometimes I just lack the knowledge of what is possible with regex and how to achieve it.
Thank you in advance.
Cheers

You could use the URL class to extract the pathname and substring to remove the first '/'.
Then, you could put the last part of the pathname in a group and use it as a reference $1 for the replacement.
const url = new URL('https://www.test.com/one/two/three/mypost/').pathname.substring(1)
console.log(url.replace(/\/([^/]*)\/$/, '?id=$1'))

I came across your question yesterday and agree with going down the route of parsing the URL. Once you get there you could even use JavaScript array methods which I prefer to string methods like:
pathname.split("/").filter(p => p.length).pop()
This would separate each folder, ignore any with no length (i.e. handle a trailing slash) and return the last one (mypost).
Anyway, I am also learning regex so sometimes when I find a question like this I just try to find the answer anyway as the best way of learning is doing. It took 24 hours 😂 I came up with this:
/(https?:\/\/).+?([a-z-]*)\/?$/gm
(https?:\/\/) you know what this does. Small correction, you don't need the square brackets. Question mark matches 0 or 1 of the preceding character. As we're only matching s this just works. If you wanted to match s or z you would use [sz]?. I think.
.+? this is the cool one I think I will use in future now I found it. The question mark here has a different meaning - it makes .+ (which means one or more of any character) non-greedy. That means it stops applying once it reaches the next rule. Which is...
([a-z-]*) any number of letters or a hyphen. You should maybe change this to include numbers and upper case.
\/? Optional slash
$ all this must apply at the end of the string.
Here is a demo
https://regex101.com/r/mQNkIS/1

Js ReGex non-capturing group not working

my problem is, i need to capture an script src, but i need to get it only if it has an script tag before the src.
So here follow my regex and the options i tried
String: <script src="http://example.net"></script>
Regex: /(?:\<script[^]+src=("|'))([^]+)(?="|')/g
Match: <script src="http://example.net
Second option:
String: <script src="http://example.net"></script>
Regex: /(?!\<script[^]+src=("|'))([^]+)(?="|')/g
Match: script src="http://example.net
What i need to get is: http://example.net
I really do appreciate any help.
This is the tool i'm using for testing: http://www.regexr.com/
Thanks,

Regular expression is not the right tool for parsing HTML, but to fix the problem you can use the exec() method in a loop to grab all your submatches and then push the match results of the captured group into an array.
var s = '<script src="http://foo.net"></script><script src="http://bar.com"></script>';
var re = /<script[^>]+?src=['"]([^'"]+)['"]/g,
matches = [];
while (m = re.exec(s)) {
matches.push(m[1]);
}
console.log(matches) //=> [ 'http://foo.net', 'http://bar.com' ]

Not sure exactly what you're trying to do or where you got that syntax.
If you want values of the src attribute in all script tags, why not just search for /<script[^>]*\ssrc="([^"]*)"/ and examine the first subexpression match..

This syntax [^]+ as far i know, works only with old versions of internet explorer (but perhaps with newer versions too, you know microsoft) and means all that is not nothing (i.e. everything), one or several times.
If you want to match all the characters until the end of the tag and before the attribute you want, you need to use [^>]+? (as you can see) with a lazy quantifier.
For the second ugly [^], since it is between quotes, you only need to replace it with [^"'] that excludes quotes.
The result you need is not the whole match but the content of the capture group.
<script[^>]+?src=["']([^"']+)["']

Here's a start for you:
/<script src=\"(.*)(?=\")/g
Retrieve the value of the first capturing group returned by this expression.

Here is the regexr.com result:
String: <script src="http://example.net"></script>
Regex: /(?:<script src=")([^"]+)/g
group#1: http://example.net
And here is the example javascript code:
s = '<script src="http://example.net"></script>';
url = s.split(/(?:<script src=")([^"]+)/g)[1];
Since javascript doesn't support lookbehind assertions, - AFAIK - You can't both match only the url and check if there is a script tag before the url. Therefore, As an alternative of lookbehind assertions, this is the fastest and easiest solution that i know.

Using Regex to match the middle of a path?

I have a var showNo that's an input for the beginning of a directory.
example: var showNo = "101B"
After showNo are characters that include spaces and other junk set up on the network set by another department.
example: /101B A Trip to the Beach/
I need to use sub directories inside of this one:
example: /101B A Trip to the Beach/assets/tools/
Is there a way to use regex and the variable to avoid scanning all of the directories and trying to match a substring of the first 4 characters?

var directory = str.match(/\/101B[^\/]+\//)[0];
Will match to the first directory name that starts with you variable.
More importantly the idea is as follows :
Match the first four character string literal that starts with a directory slash.
Then match any character that is not a directory slash. The "is not" is indicated by the ^.
Then repeat 2 an additional 0 or more times.
Finally match the directory slash.
I suspect you had trouble with the "anything that is NOT" character class. It is sometimes tricky but once you get it it is a very useful short cut.
--edit--
Actually on re reading I suspect you had trouble with using the variable inside the regex, correct?
That's easy enough, too, once you know how.
You can construct it as a string first:
var regex_string = "/" + showNo + "[^/]+/";
And then "compile" it into a regex which you can use as normally :
var regex_dir = RegExp(regex_string);
var directory = str.match(regex_dir);
Hope this helps!

How to get data from string using Javascript Regex

I can't post the exact data i'm trying to extract but here's a basic scenario with the same outcome. I'm grabbing the body of a page and trying to extract a bit.ly link from it. So let's say for example, this is the chunk of data where i'm trying to grab the link from.
String:
http://bit.ly/Pq8AkS</div><div class="shareUnit"><div class="-cx-PRIVATE-fbTimelineExternalShareUnit__wrapper"><div><div class="-cx-PRIVATE-fbTimelineExternalShareUnit__root -cx-PRIVATE-fbTimelineExternalShareUnit__hasImage"><a class="-cx-PRIVATE-fbTimelineExternalShareUnit__video -cx-PRIVATE-fbTimelineExternalShareUnit__image -cx-PRIVATE-fbTimelineExternalShareUnit__content" ajaxify="/ajax/flash/expand_inline.php?target_div=uikk85_59&share_id=271663136271285&max_width=403&max_height=403&context=timelineSingle" rel="async" href="#" onclick="CSS.addClass(this, "-cx-PRIVATE-fbTimelineExternalShareUnit__loading");CSS.removeClass(this, "-cx-PRIVATE-fbTimelineExternalShareUnit__video");"><i class="-cx-PRIVATE-fbTimelineExternalShareUnit__play"></i><img class="img" src="http://external.ak.fbcdn.net/safe_image.php?d=AQDoyY7_wjAyUtX2&w=155&h=114&url=http%3A%2F%2Fi1.ytimg.com%2Fvi%2FDre21lBu2zU%2Fmqdefault.jpg" alt="" /></a>
Now, I can get what i'm looking for with the following code but the link isn't always going to be exactly 6 characters long. So this causes an issue...
Body = document.getElementsByTagName("body")[0].innerHTML;
regex = /2Fbit.ly%2F(.{6})&h/g;
Matches = regex.exec(Body);
Here's what I was orginally trying but the problem I have is that it grabs too much data. It's going all the way to the last "&h" in the string above instead of stopping at the first one it hits.
Body = document.getElementsByTagName("body")[0].innerHTML;
regex = /2Fbit.ly%2F(.*)&h/g;
Matches = regex.exec(Body);
So basically the main part of the string i'm trying to focus on is "%2Fbit.ly%2FPq8AkS&h" so that I can get the "Pq8AkS" out of it. When I use the (.*) it's grabbing everything between "%2F" and the very last "&h" in the large string above.

You should not be using a regex on HTML. Use DOM functions to get the desired link object, then get the href attribute from that, then you can use a regex on just the href.
By default .* is greedy meaning that it matches the most it can match and still find a match. If you want it to be non-greedy (match the least possible), you can use this .*? instead like this:
regex = /2Fbit.ly%2F(.*?)&h/;
I also don't think you want the g flag on the regex as there should only be one match in the right URL.
If you show the rest of your HTML, we could offer advice on finding the right link object rather than trying to match the entire body HTML.
FYI, another trick for a non-greedy match is to do something like this:
regex = /2Fbit.ly%2F([^&]*)&h/;
Which matches a series of characters that are not & followed by &h which accomplishes the same goal as long as & can't be in the matched sequence.

By default + and * are greedy and match as much as possible. You need a non-greedy match for your (.+). A quick search gives the solution as
? directly following a quantifier makes the quantifier non-greedy (makes it match minimum instead of maximum of the interval defined).
So try changing your regex= line to
regex = /2Fbit.ly%2F(.*?)&h/g;
Edit: #jfriend00's answer below is more complete.

JavaScript negative lookbehind issue

I've got some JavaScript that looks for Amazon ASINs within an Amazon link, for example
http://www.amazon.com/dp/B00137QS28
For this I use the following regex: /([A-Z0-9]{10})
However, I don't want it to match artist links which look like:
http://www.amazon.com/Artist-Name/e/B000AQ1JZO
So I need to exclude any links where there's a '/e' before the slash and the 10-character alphanumeric code. I thought the following would do that: (?<!/e)([A-Z0-9]{10}), but it turns out negative lookbehinds don't work in JavaScript. Is that right? Is there another way to do this instead?
Any help would be much appreciated!
As a side note, be aware there are plenty of Amazon link formats, which is why I want to blacklist rather than whitelist, eg, these are all the same page:
http://www.amazon.com/gp/product/B00137QS28/
http://www.amazon.com/dp/B00137QS28
http://www.amazon.com/exec/obidos/ASIN/B00137QS28/
http://www.amazon.com/Product-Title-Goes-Here/dp/B00137QS28/

In your case an expression like this would work:
/(?!\/e)..\/([A-Z0-9]{10})/

([A-Z0-9]{10}) will work equally well on the reverse of its input, so you can
reverse the string,
use positive lookahead,
reverse it back.

You need to use a lookahead to filter the /e/* ones out. Then trim the leading /e/ from each of the matches.
var source; // the source you're matching against the RegExp
var matches = source.match(/(?!\/e)..\/[A-Z0-9]{10}/g) || [];
var ids = matches.map(function (match) {
return match.substr(3);
});

We Keep Coding

JavaScript is the programming language of the Web.

Regex to remove all but file name from links - javascript

It doesn't have to be a regular expression (assuming / delimiters): var fileName = url.split('/').pop(); //pop takes the last element

I would suggest run separate regex replacement, one for a links and another for img, easier and clearer, thus more maintainable.

This seems to work in case anyone else has the problem: var t = s.replace(/(src|href)=('|")([^ \2]\/)\/?([^ \2]*)\2/gi,"$1=$2$4$2");

Related

Regex: Replace last segment of url

Js ReGex non-capturing group not working

Using Regex to match the middle of a path?

How to get data from string using Javascript Regex

JavaScript negative lookbehind issue

Categories

Resources

We Keep Coding

JavaScript is the programming language of the Web.

Regex to remove all but file name from links - javascript

It doesn't have to be a regular expression (assuming / delimiters): var fileName = url.split('/').pop(); //pop takes the last element

I would suggest run separate regex replacement, one for a links and another for img, easier and clearer, thus more maintainable.

This seems to work in case anyone else has the problem: var t = s.replace(/(src|href)=('|")([^ \2]*\/)*\/?([^ \2]*)\2/gi,"$1=$2$4$2");

Related

Regex: Replace last segment of url

Js ReGex non-capturing group not working

Using Regex to match the middle of a path?

How to get data from string using Javascript Regex

JavaScript negative lookbehind issue

Categories

Resources

This seems to work in case anyone else has the problem: var t = s.replace(/(src|href)=('|")([^ \2]\/)\/?([^ \2]*)\2/gi,"$1=$2$4$2");