Extending an existing regex to drop punctuation after URL links - javascript

I have an existing replace that matches http within a text string and creates a working URL from the text.
Working Example:
var Text = "Visit Gmail at http://gmail.com"
var linkText = Text.replace(/http:\/\/\S+/gi, '$&');
document.write(linkText);
Output:
Visit Gmail at http://gmail.com
Problem:
The problem arises when the link appears at the end of a sentence and the punctuation incorrectly becomes appended to the end of the URL.
Can someone advise on a way of extending my regex (or maybe adding a second replacement after this has been transformed) to overcome this?
I think the right answer will include adding something along the lines of /\W$/g to my original regex, but I can't see how this can be applied to just one word within the whole string.
As always, very grateful for any help.
Thanks,
Pete
Examples of problem links
http://gmail.com/.
http://gmail.com,
http://gmail.com/?
http://gmail.com!
All of these should resolve the link to http://gmail.com
Note how some could end in a slash then punctuation and others with punctuation directly after the domain name.

Try
/http:\/\/(.(?![.?] |$))*/
My logic is, if the last char is a dot, or question mark followed by either a space or end of string, you don't need it.
var Text = "Visit Gmail at http://gmail.com"
var linkText = Text.replace(/http:\/\/(.(?![.?](?:\s|$)))*./gi, '$&');
document.write(linkText);
Gives
"Visit Gmail at http://gmail.com"
Edit:
This may be better (it doesn't match white space now)
http:\/\/(.(?!(?:[.?](?: |$))))*.

Why not just use a negative character class?
/http://\S+[^.,?!]/gi

You could account for trailing unwanted characters, whether stripping them or not.
The replacement for both is capture buffer 1: <a href="$1">$1<\/a>
This also asumes you can do lookbehind. though I'm not sure if client side JS can do lookbehind assertions.
Strip unwanted chars
/(http:\/\/\S+)(?<![\/.,?!])[\/.,?!]*/
Or, leave unwanted characters
/(http:\/\/\S+)(?<![\/.,?!])/
Alternate, using lookahead
Strip
/(http:\/\/\S+?(?=[\/.,?!]+(?:\s|$)|\s|$))[\/.,?!]*/
Leave
/(http:\/\/\S+?(?=[\/.,?!]+(?:\s|$)|\s|$))/

Related

Match link patterns in HTML code with a RegEx

I'm using a linkify function, which detects link-like patterns by using regex and replaces those with a-tags to reveal a clickable link.
The regex looks like that:
// http://, https://, ftp://
var urlPattern = /\b(?![^<]*>|[^<>]*<\/)(?:https?|ftp):\/\/[a-z0-9-+&##\/%?=~_|!:,.;]*[a-z0-9-+&##\/%=~_|]/gim;
/* Some explanations:
(?! # Negative lookahead start (will cause match to fail if contents match)
[^<]* # Any number of non-'<' characters
> # A > character
| # Or
[^<>]* # Any number of non-'<' and non-'>' characters
</ # The characters < and /
) # End negative lookahead.
*/
and replaces the link like this:
return textInput.replace(urlPattern, '<a target="_blank" rel="noopener" href="$&">$&</a>')
The regex works perfectly for in-text links. However, I am using it in HTML-Code also, such as
<ul><li>Link: https://www.link.com</li></ul> //linkify not working
<ul><li>Link: https://www.link.com <br/></li></ul> //linkify working
where just the secont example is working. I dont't know why the behavior is different and would be very glad to get some help from you. What should my regex look like, to linkify without the break in list elements?
If I understood correctly your issue I think that this regex should be ok to detect the links in both the scenarios:
\b(?![^<]*>)(?:https?|ftp):\/\/([a-z0-9-+&##\/%?=~_|!:,.;]*)
Essentially with the first part we are segmenting in this way:
Then we go and grab the different parts of interest: the first part is a non-capturing group as in your original expression to strip the protocol later, if really unneeded. The last part takes the remaining part of the URL
For the way we created the regex we can now decide if taking the entire URL or just the second part. This is evident looking to the bottom-right of this screenshot:
Now in order to log the two parts we can take this nice snippet:
const str = '<ul><li>Link: https://www.link.com</li></ul>';
var myRegexp = /\b(?![^<]*>)(?:https?|ftp):\/\/([a-z0-9-+&##\/%?=~_|!:,.;]*)/gim;
var match = myRegexp.exec(str);
console.log(match[0]);
console.log(match[1]);
Possible variations:
in a situation like the one presented above you can simplify further your regex to:
(?:https?|ftp):\/\/([a-z0-9-+&##\/%?=~_|!:,.;]*)
getting the same output
if the full URL is enough you can remove the round parentheses of the second group
(?:https?|ftp):\/\/[a-z0-9-+&##\/%?=~_|!:,.;]*
PS - I'm assuming that your examples were meant to be:
<ul><li>Link: https://www.link.com</li></ul>
<ul><li>Link: https://www.link.com <br/></li></ul>
i.e. with https, http or ftp which makes the second case work with your original regex

Extracting both the full match, and the last token match in a regexp

I have a little interesting issue here. I have a plaintext URL coming from Excel and I need to change it to an HTML URL with a unique body. Here is the regex code for javascript:
text = text.toString().replace(/=hyperlink\(([#\\\w\s\(\)-\.\/]+)\)/g, "<a href='file:///$1'>$1</a>");
This works perfectly fine for what it does. Example, text is:
=hyperlink("\\share\folder\log\2013\13-05-13\13-05-13.txt")
regex turns it into
\\share\folder\log\2013\13-05-13\13-05-13.txt
However, I need the inner HTML to be just the text file name:
13-05-13.txt
To further complicate the matter, the original text the regex is going through is not a single occurrence. It is an entire spreadsheet with 100's of rows that contain this. So the regex will be matching and replacing 100's of these strings in one operation.
Hopefully it is possible to get this all done in one regexp on the entire string, but I suppose I could loop through each line of the string first...
If there is no way to do this with one regex engine, what do you think the best approach is? (no PHP/Python/Server side. Just Javascript, HTML, Jquery, etc).
I guess you could use this regex:
=hyperlink\("([#\\\w\s\(\)\-\.\/]+\\([^"]+))"\)
And this new replace:
$2
I'm not sure how your regex was working, but I added the quotes in the regex and replaced the single quotes by double quotes in the replace. Revert those if need be.
Demo

How to turn urls padded by space into links

I have the following code that is used to turn http URLs in text into anchor tags. It's looking for anything starting with http, surrounded by white space (or the beginning/end of input)
function linkify (str) {
var regex = /(^|\s)(https?:\/\/\S+)($|\s)/ig;
return str.replace(regex,'$1$2$3')
}
// This works
linkify("Go to http://www.google.com and http://yahoo.com");
// This doesn't, yahoo.com doesn't become a link
linkify("Go to http://www.google.com http://yahoo.com");
The case where it doesn't work is if I only have a single space between two links. I'm assuming it's because the space in between the two links can't be used to match both URLs, after the first match, the space after the URL has already been consumed.
To play with: http://jsfiddle.net/NgMw8/
Can somebody suggest a regex way of doing this? I could scan the string myself, looking for a regex way of doing it (or some way that doesn't require scanning the string my self and building a new string on my own.
Don't capture the final \s. This way, the second url will match the preceding \s, as required:
function linkify (str) {
var regex = /(^|\s)(https?:\/\/\S+)/ig;
return str.replace(regex,'$1$2')
}
http://jsfiddle.net/NgMw8/3/
Just use a positive lookahead when matching your final $|\s, like so:
var regex = /(^|\s)(https?:\/\/\S+)(?=($|\s))/ig;
None will work if there are any html element stuck to the url ...
Similar question and it's answers HERE
Some solutions can handle url like "test.com/anothertest?get=letsgo" and append http://
Workaround may be done to handle https and miscellaneous tld ...

How to get data from string using Javascript Regex

I can't post the exact data i'm trying to extract but here's a basic scenario with the same outcome. I'm grabbing the body of a page and trying to extract a bit.ly link from it. So let's say for example, this is the chunk of data where i'm trying to grab the link from.
String:
http://bit.ly/Pq8AkS</div><div class="shareUnit"><div class="-cx-PRIVATE-fbTimelineExternalShareUnit__wrapper"><div><div class="-cx-PRIVATE-fbTimelineExternalShareUnit__root -cx-PRIVATE-fbTimelineExternalShareUnit__hasImage"><a class="-cx-PRIVATE-fbTimelineExternalShareUnit__video -cx-PRIVATE-fbTimelineExternalShareUnit__image -cx-PRIVATE-fbTimelineExternalShareUnit__content" ajaxify="/ajax/flash/expand_inline.php?target_div=uikk85_59&share_id=271663136271285&max_width=403&max_height=403&context=timelineSingle" rel="async" href="#" onclick="CSS.addClass(this, "-cx-PRIVATE-fbTimelineExternalShareUnit__loading");CSS.removeClass(this, "-cx-PRIVATE-fbTimelineExternalShareUnit__video");"><i class="-cx-PRIVATE-fbTimelineExternalShareUnit__play"></i><img class="img" src="http://external.ak.fbcdn.net/safe_image.php?d=AQDoyY7_wjAyUtX2&w=155&h=114&url=http%3A%2F%2Fi1.ytimg.com%2Fvi%2FDre21lBu2zU%2Fmqdefault.jpg" alt="" /></a>
Now, I can get what i'm looking for with the following code but the link isn't always going to be exactly 6 characters long. So this causes an issue...
Body = document.getElementsByTagName("body")[0].innerHTML;
regex = /2Fbit.ly%2F(.{6})&h/g;
Matches = regex.exec(Body);
Here's what I was orginally trying but the problem I have is that it grabs too much data. It's going all the way to the last "&h" in the string above instead of stopping at the first one it hits.
Body = document.getElementsByTagName("body")[0].innerHTML;
regex = /2Fbit.ly%2F(.*)&h/g;
Matches = regex.exec(Body);
So basically the main part of the string i'm trying to focus on is "%2Fbit.ly%2FPq8AkS&h" so that I can get the "Pq8AkS" out of it. When I use the (.*) it's grabbing everything between "%2F" and the very last "&h" in the large string above.
You should not be using a regex on HTML. Use DOM functions to get the desired link object, then get the href attribute from that, then you can use a regex on just the href.
By default .* is greedy meaning that it matches the most it can match and still find a match. If you want it to be non-greedy (match the least possible), you can use this .*? instead like this:
regex = /2Fbit.ly%2F(.*?)&h/;
I also don't think you want the g flag on the regex as there should only be one match in the right URL.
If you show the rest of your HTML, we could offer advice on finding the right link object rather than trying to match the entire body HTML.
FYI, another trick for a non-greedy match is to do something like this:
regex = /2Fbit.ly%2F([^&]*)&h/;
Which matches a series of characters that are not & followed by &h which accomplishes the same goal as long as & can't be in the matched sequence.
By default + and * are greedy and match as much as possible. You need a non-greedy match for your (.+). A quick search gives the solution as
? directly following a quantifier makes the quantifier non-greedy (makes it match minimum instead of maximum of the interval defined).
So try changing your regex= line to
regex = /2Fbit.ly%2F(.*?)&h/g;
Edit: #jfriend00's answer below is more complete.

Regex wordwrap with UTF8 characters in JS

i've already read all tha articles in here wich touch a similar problem but still don't get any solution working. In my case i wanna wrap each word of a string with a span. The words contain special characters like 'äüö...'
What i am doing at the moment is:
var textWrap = text.replace(/\b([a-zA-Z0-9ßÄÖÜäöüÑñÉéÈèÁáÀàÂâŶĈĉĜĝŷÊêÔôÛûŴŵ-]+)\b/g, "<span>$1</span>");
But what happens is that if the äüñ or whatever NON-Ascii character is at the end or at the beginning it also acts like a boundary. Being within a word these characters do't act as a boundary.
'Ärmelkanal' becomes Ä<span>rmelkanal</span> but should be <span>Ärmelkanal</span>
'Käse'works fine... becomes <span>Käse</span>
'diré' becomes <span>dir</span>é but should be <span>diré</span>
Any advice would be very appreciated. I need to do that on clientside :-( BTW did i mention that i hate regular expressions ;-)
Thank You very much!
The problem is that JavaScript recognizes word boundaries only before/after ASCII letters (and numbers/underscore). Just drop the \b anchors and it should work.
result = subject.replace(/[a-zA-Z0-9ßÄÖÜäöüÑñÉéÈèÁáÀàÂâŶĈĉĜĝŷÊêÔôÛûŴŵ-]+/g, "<span>$&</span>");

Categories