Match link patterns in HTML code with a RegEx

Match link patterns in HTML code with a RegEx - javascript

I'm using a linkify function, which detects link-like patterns by using regex and replaces those with a-tags to reveal a clickable link.
The regex looks like that:
// http://, https://, ftp://
var urlPattern = /\b(?![^<]*>|[^<>]*<\/)(?:https?|ftp):\/\/[a-z0-9-+&##\/%?=~_|!:,.;]*[a-z0-9-+&##\/%=~_|]/gim;
/* Some explanations:
(?! # Negative lookahead start (will cause match to fail if contents match)
[^<]* # Any number of non-'<' characters
> # A > character
| # Or
[^<>]* # Any number of non-'<' and non-'>' characters
</ # The characters < and /
) # End negative lookahead.
*/
and replaces the link like this:
return textInput.replace(urlPattern, '<a target="_blank" rel="noopener" href="$&">$&</a>')
The regex works perfectly for in-text links. However, I am using it in HTML-Code also, such as
<ul><li>Link: https://www.link.com</li></ul> //linkify not working
<ul><li>Link: https://www.link.com <br/></li></ul> //linkify working
where just the secont example is working. I dont't know why the behavior is different and would be very glad to get some help from you. What should my regex look like, to linkify without the break in list elements?

If I understood correctly your issue I think that this regex should be ok to detect the links in both the scenarios:
\b(?![^<]*>)(?:https?|ftp):\/\/([a-z0-9-+&##\/%?=~_|!:,.;]*)
Essentially with the first part we are segmenting in this way:
Then we go and grab the different parts of interest: the first part is a non-capturing group as in your original expression to strip the protocol later, if really unneeded. The last part takes the remaining part of the URL
For the way we created the regex we can now decide if taking the entire URL or just the second part. This is evident looking to the bottom-right of this screenshot:
Now in order to log the two parts we can take this nice snippet:
const str = '<ul><li>Link: https://www.link.com</li></ul>';
var myRegexp = /\b(?![^<]*>)(?:https?|ftp):\/\/([a-z0-9-+&##\/%?=~_|!:,.;]*)/gim;
var match = myRegexp.exec(str);
console.log(match[0]);
console.log(match[1]);
Possible variations:
in a situation like the one presented above you can simplify further your regex to:
(?:https?|ftp):\/\/([a-z0-9-+&##\/%?=~_|!:,.;]*)
getting the same output
if the full URL is enough you can remove the round parentheses of the second group
(?:https?|ftp):\/\/[a-z0-9-+&##\/%?=~_|!:,.;]*
PS - I'm assuming that your examples were meant to be:
<ul><li>Link: https://www.link.com</li></ul>
<ul><li>Link: https://www.link.com <br/></li></ul>
i.e. with https, http or ftp which makes the second case work with your original regex

Related

jQuery parameters replace and upload between two strings in all URL's

WordPress' FacetWP plugin has a 'facetwp-loaded' jQuery event that allows for changes when facets are refreshed.
This is the 'facetwp-loaded' event's usage from FacetWP's documentation:
(function($) {
$(document).on('facetwp-loaded', function() {
// Changes go here
});
})(jQuery);
Facets produce URL's like:
http://website.com/hotels?fwp_location=worldwide
or
http://website.com/hotels/worldwide?fwp_location=europe
So I would like to make a global Regex redirection to substitute what is between
hotels
and
=
with
/
In the above examples, that would result in:
http://website.com/hotels/worldwide
or
http://website.com/hotels/europe
Can someone help me with this?
Thanks in advance
UPDATE
I've tried different Regex methods, but it seems to need jQuery parameter replace/update.

I don't have a way to test this using the Wordpress regex engine, so you'll have to check it, but it works in the R regex engine. Hopefully Wordpress supports perl style regex expressions.
Regex: Match: (?<=hotels).*?= and replace with /
In this case the piece of the string we want to remove is preceded by "hotels" and ends with an equal sign. So we want to match everything immediately after hotels, ending at the equal sign. To start matching immediately after "hotels" but not include it, we need to look backwards. So we use a look behind before the match. (?<=hotels) means look backwards from the current position in the string, and see if "hotels" precedes the current position. So when the engine gets to the "/" after hotels, it looks back and sees hotels (but it doesn't match, because it's a look behind). . matches any character, * means match zero or more (so zero or more of any character), and ? modifies the * telling the star to match zero or more characters, but only until the next character can be matched, in this case =.

Regex to detect urls without www and http

Could you update my regex to match with next requirements
Must match urls without www and http
If query contains - match too
Url ends when space or comma(,) or string end meet
match only with TopLevelDomains from list
var srg = new RegExp(/(^|[\s])([\w\.]+\.(com|cc|net))/ig);
For sample, must match:
jsfiddle.net
jmitty.cc:8080/test3s.html
www.ru,sample.com,google.com/?l=en
very.secure.dotster.com/i?ewe
As result i need
<a>jsfiddle.net</a>
<a>jmitty.cc:8080/test3s.html</a>
<a>www.ru</a>,<a>sample.com</a>,<a>google.com/?l=en</a>
<a>very.secure.dotster.com/i?ewe</a>
Fiddle http://jsfiddle.net/tYnU7/

Well, I guess you can change some little things in your regex:
([\w\.]+\.(?:com|cc|net|ru)[^,\s]*)
Replace by:
$1
I'm not sure why you were having (^|[\s]) at the beginning and it didn't seem useful to me, so I removed it. If you had your reasons, you can put it back.
I added ru to the extensions to match www.ru as you required and added [^,\s]* to continue matching until a comma or space is encountered.
Your updated fiddle is here.

This is a very complex problem with no perfect answer, but if you don't need perfection, check out Jeff Roberson's Linkify page and this post by Van Goyvaerts discussing Jeff Atwood's blog post, "The Problem with URLs".

/
(?:^|\b) # match word boundary or beginning of line
( # begin cpature
[\w.]+ # domain part
\.[a-z]{2,3} # domain suffix
(?:\:[0-9]{1,5})? # optional port
(?:\/.*)? # path details
) # end capture
(?:[,\s]|$) # comma, space or eol
/ig
Some details:
[\w.]+ may need more work depending on what you classify as acceptable domain characters (I've heard they're accepting unicode characters now?)
You can change [a-z]{2,3} in to a list of acceptable top-level domains (e.g. (?:com|org|net|info|edu). In your example you only list com, cc & net, but your result shows www.ru as captured.
(?:\/.*)? is greedy by default, but should be okay since you want query information.
And the fiddle
Oh, and if you want your links clickable (because those without a protocol don't work):
var r = t.replace(srg, function(match,b,m,e){
return b + '' + m + '' + e;
});
Which is demonstrated here

How to turn urls padded by space into links

I have the following code that is used to turn http URLs in text into anchor tags. It's looking for anything starting with http, surrounded by white space (or the beginning/end of input)
function linkify (str) {
var regex = /(^|\s)(https?:\/\/\S+)($|\s)/ig;
return str.replace(regex,'$1$2$3')
}
// This works
linkify("Go to http://www.google.com and http://yahoo.com");
// This doesn't, yahoo.com doesn't become a link
linkify("Go to http://www.google.com http://yahoo.com");
The case where it doesn't work is if I only have a single space between two links. I'm assuming it's because the space in between the two links can't be used to match both URLs, after the first match, the space after the URL has already been consumed.
To play with: http://jsfiddle.net/NgMw8/
Can somebody suggest a regex way of doing this? I could scan the string myself, looking for a regex way of doing it (or some way that doesn't require scanning the string my self and building a new string on my own.

Don't capture the final \s. This way, the second url will match the preceding \s, as required:
function linkify (str) {
var regex = /(^|\s)(https?:\/\/\S+)/ig;
return str.replace(regex,'$1$2')
}
http://jsfiddle.net/NgMw8/3/

Just use a positive lookahead when matching your final $|\s, like so:
var regex = /(^|\s)(https?:\/\/\S+)(?=($|\s))/ig;

None will work if there are any html element stuck to the url ...
Similar question and it's answers HERE
Some solutions can handle url like "test.com/anothertest?get=letsgo" and append http://
Workaround may be done to handle https and miscellaneous tld ...

How to get data from string using Javascript Regex

I can't post the exact data i'm trying to extract but here's a basic scenario with the same outcome. I'm grabbing the body of a page and trying to extract a bit.ly link from it. So let's say for example, this is the chunk of data where i'm trying to grab the link from.
String:
http://bit.ly/Pq8AkS</div><div class="shareUnit"><div class="-cx-PRIVATE-fbTimelineExternalShareUnit__wrapper"><div><div class="-cx-PRIVATE-fbTimelineExternalShareUnit__root -cx-PRIVATE-fbTimelineExternalShareUnit__hasImage"><a class="-cx-PRIVATE-fbTimelineExternalShareUnit__video -cx-PRIVATE-fbTimelineExternalShareUnit__image -cx-PRIVATE-fbTimelineExternalShareUnit__content" ajaxify="/ajax/flash/expand_inline.php?target_div=uikk85_59&share_id=271663136271285&max_width=403&max_height=403&context=timelineSingle" rel="async" href="#" onclick="CSS.addClass(this, "-cx-PRIVATE-fbTimelineExternalShareUnit__loading");CSS.removeClass(this, "-cx-PRIVATE-fbTimelineExternalShareUnit__video");"><i class="-cx-PRIVATE-fbTimelineExternalShareUnit__play"></i><img class="img" src="http://external.ak.fbcdn.net/safe_image.php?d=AQDoyY7_wjAyUtX2&w=155&h=114&url=http%3A%2F%2Fi1.ytimg.com%2Fvi%2FDre21lBu2zU%2Fmqdefault.jpg" alt="" /></a>
Now, I can get what i'm looking for with the following code but the link isn't always going to be exactly 6 characters long. So this causes an issue...
Body = document.getElementsByTagName("body")[0].innerHTML;
regex = /2Fbit.ly%2F(.{6})&h/g;
Matches = regex.exec(Body);
Here's what I was orginally trying but the problem I have is that it grabs too much data. It's going all the way to the last "&h" in the string above instead of stopping at the first one it hits.
Body = document.getElementsByTagName("body")[0].innerHTML;
regex = /2Fbit.ly%2F(.*)&h/g;
Matches = regex.exec(Body);
So basically the main part of the string i'm trying to focus on is "%2Fbit.ly%2FPq8AkS&h" so that I can get the "Pq8AkS" out of it. When I use the (.*) it's grabbing everything between "%2F" and the very last "&h" in the large string above.

You should not be using a regex on HTML. Use DOM functions to get the desired link object, then get the href attribute from that, then you can use a regex on just the href.
By default .* is greedy meaning that it matches the most it can match and still find a match. If you want it to be non-greedy (match the least possible), you can use this .*? instead like this:
regex = /2Fbit.ly%2F(.*?)&h/;
I also don't think you want the g flag on the regex as there should only be one match in the right URL.
If you show the rest of your HTML, we could offer advice on finding the right link object rather than trying to match the entire body HTML.
FYI, another trick for a non-greedy match is to do something like this:
regex = /2Fbit.ly%2F([^&]*)&h/;
Which matches a series of characters that are not & followed by &h which accomplishes the same goal as long as & can't be in the matched sequence.

By default + and * are greedy and match as much as possible. You need a non-greedy match for your (.+). A quick search gives the solution as
? directly following a quantifier makes the quantifier non-greedy (makes it match minimum instead of maximum of the interval defined).
So try changing your regex= line to
regex = /2Fbit.ly%2F(.*?)&h/g;
Edit: #jfriend00's answer below is more complete.

Extending an existing regex to drop punctuation after URL links

I have an existing replace that matches http within a text string and creates a working URL from the text.
Working Example:
var Text = "Visit Gmail at http://gmail.com"
var linkText = Text.replace(/http:\/\/\S+/gi, '$&');
document.write(linkText);
Output:
Visit Gmail at http://gmail.com
Problem:
The problem arises when the link appears at the end of a sentence and the punctuation incorrectly becomes appended to the end of the URL.
Can someone advise on a way of extending my regex (or maybe adding a second replacement after this has been transformed) to overcome this?
I think the right answer will include adding something along the lines of /\W$/g to my original regex, but I can't see how this can be applied to just one word within the whole string.
As always, very grateful for any help.
Thanks,
Pete
Examples of problem links
http://gmail.com/.
http://gmail.com,
http://gmail.com/?
http://gmail.com!
All of these should resolve the link to http://gmail.com
Note how some could end in a slash then punctuation and others with punctuation directly after the domain name.

Try
/http:\/\/(.(?![.?] |$))*/
My logic is, if the last char is a dot, or question mark followed by either a space or end of string, you don't need it.
var Text = "Visit Gmail at http://gmail.com"
var linkText = Text.replace(/http:\/\/(.(?![.?](?:\s|$)))*./gi, '$&');
document.write(linkText);
Gives
"Visit Gmail at http://gmail.com"
Edit:
This may be better (it doesn't match white space now)
http:\/\/(.(?!(?:[.?](?: |$))))*.

Why not just use a negative character class?
/http://\S+[^.,?!]/gi

You could account for trailing unwanted characters, whether stripping them or not.
The replacement for both is capture buffer 1: <a href="$1">$1<\/a>
This also asumes you can do lookbehind. though I'm not sure if client side JS can do lookbehind assertions.
Strip unwanted chars
/(http:\/\/\S+)(?<![\/.,?!])[\/.,?!]*/
Or, leave unwanted characters
/(http:\/\/\S+)(?<![\/.,?!])/
Alternate, using lookahead
Strip
/(http:\/\/\S+?(?=[\/.,?!]+(?:\s|$)|\s|$))[\/.,?!]*/
Leave
/(http:\/\/\S+?(?=[\/.,?!]+(?:\s|$)|\s|$))/

We Keep Coding

JavaScript is the programming language of the Web.

Match link patterns in HTML code with a RegEx - javascript

Related

jQuery parameters replace and upload between two strings in all URL's

Regex to detect urls without www and http

How to turn urls padded by space into links

How to get data from string using Javascript Regex

Extending an existing regex to drop punctuation after URL links

Categories

Resources