How to turn urls padded by space into links

How to turn urls padded by space into links - javascript

I have the following code that is used to turn http URLs in text into anchor tags. It's looking for anything starting with http, surrounded by white space (or the beginning/end of input)
function linkify (str) {
var regex = /(^|\s)(https?:\/\/\S+)($|\s)/ig;
return str.replace(regex,'$1$2$3')
}
// This works
linkify("Go to http://www.google.com and http://yahoo.com");
// This doesn't, yahoo.com doesn't become a link
linkify("Go to http://www.google.com http://yahoo.com");
The case where it doesn't work is if I only have a single space between two links. I'm assuming it's because the space in between the two links can't be used to match both URLs, after the first match, the space after the URL has already been consumed.
To play with: http://jsfiddle.net/NgMw8/
Can somebody suggest a regex way of doing this? I could scan the string myself, looking for a regex way of doing it (or some way that doesn't require scanning the string my self and building a new string on my own.

Don't capture the final \s. This way, the second url will match the preceding \s, as required:
function linkify (str) {
var regex = /(^|\s)(https?:\/\/\S+)/ig;
return str.replace(regex,'$1$2')
}
http://jsfiddle.net/NgMw8/3/

Just use a positive lookahead when matching your final $|\s, like so:
var regex = /(^|\s)(https?:\/\/\S+)(?=($|\s))/ig;

None will work if there are any html element stuck to the url ...
Similar question and it's answers HERE
Some solutions can handle url like "test.com/anothertest?get=letsgo" and append http://
Workaround may be done to handle https and miscellaneous tld ...

Related

Match link patterns in HTML code with a RegEx

I'm using a linkify function, which detects link-like patterns by using regex and replaces those with a-tags to reveal a clickable link.
The regex looks like that:
// http://, https://, ftp://
var urlPattern = /\b(?![^<]*>|[^<>]*<\/)(?:https?|ftp):\/\/[a-z0-9-+&##\/%?=~_|!:,.;]*[a-z0-9-+&##\/%=~_|]/gim;
/* Some explanations:
(?! # Negative lookahead start (will cause match to fail if contents match)
[^<]* # Any number of non-'<' characters
> # A > character
| # Or
[^<>]* # Any number of non-'<' and non-'>' characters
</ # The characters < and /
) # End negative lookahead.
*/
and replaces the link like this:
return textInput.replace(urlPattern, '<a target="_blank" rel="noopener" href="$&">$&</a>')
The regex works perfectly for in-text links. However, I am using it in HTML-Code also, such as
<ul><li>Link: https://www.link.com</li></ul> //linkify not working
<ul><li>Link: https://www.link.com <br/></li></ul> //linkify working
where just the secont example is working. I dont't know why the behavior is different and would be very glad to get some help from you. What should my regex look like, to linkify without the break in list elements?

If I understood correctly your issue I think that this regex should be ok to detect the links in both the scenarios:
\b(?![^<]*>)(?:https?|ftp):\/\/([a-z0-9-+&##\/%?=~_|!:,.;]*)
Essentially with the first part we are segmenting in this way:
Then we go and grab the different parts of interest: the first part is a non-capturing group as in your original expression to strip the protocol later, if really unneeded. The last part takes the remaining part of the URL
For the way we created the regex we can now decide if taking the entire URL or just the second part. This is evident looking to the bottom-right of this screenshot:
Now in order to log the two parts we can take this nice snippet:
const str = '<ul><li>Link: https://www.link.com</li></ul>';
var myRegexp = /\b(?![^<]*>)(?:https?|ftp):\/\/([a-z0-9-+&##\/%?=~_|!:,.;]*)/gim;
var match = myRegexp.exec(str);
console.log(match[0]);
console.log(match[1]);
Possible variations:
in a situation like the one presented above you can simplify further your regex to:
(?:https?|ftp):\/\/([a-z0-9-+&##\/%?=~_|!:,.;]*)
getting the same output
if the full URL is enough you can remove the round parentheses of the second group
(?:https?|ftp):\/\/[a-z0-9-+&##\/%?=~_|!:,.;]*
PS - I'm assuming that your examples were meant to be:
<ul><li>Link: https://www.link.com</li></ul>
<ul><li>Link: https://www.link.com <br/></li></ul>
i.e. with https, http or ftp which makes the second case work with your original regex

Regexp javascript - url match with localhost

I'm trying to find a simple regexp for url validation, but not very good in regexing..
Currently I have such regexp: (/^https?:\/\/\w/).test(url)
So it's allowing to validate urls as http://localhost:8080 etc.
What I want to do is NOT to validate urls if they have some long special characters at the end like: http://dodo....... or http://dododo&&&&&
Could you help me?

How about this?
/^http:\/\/\w+(\.\w+)*(:[0-9]+)?\/?(\/[.\w]*)*$/
Will match: http://domain.com:port/path or just http://domain or http://domain:port
/^http:\/\/\w+(\.\w+)*(:[0-9]+)?\/?$/
match URLs without path
Some explanations of regex blocks:
Domain: \w+(\.\w+)* to match text with dots: localhost or www.yahoo.com (could be as long as Path or Port section begins)
Port: (:[0-9]+)? to match or to not match a number starting with semicolon: :8000 (and it could be only one)
Path: \/?(\/[.\w]*)* to match any alphanums with slashes and dots: /user/images/0001.jpg (until the end of the line)
(path is very interesting part, now I did it to allow lone or adjacent dots, i.e. such expressions could be possible: /. or /./ or /.../ and etc. If you'd like to have dots in path like in domain section - without border or adjacent dots, then use \/?(\/\w+(.\w+)*)* regexp, similar to domain part.)
* UPDATED *
Also, if you would like to have (it is valid) - characters in your URL (or any other), you should simply expand character class for "URL text matching", i.e. \w+ should become [\-\w]+ and so on.

If you want to match ABCD then you may leave the start part..
For Example to match http://localhost:8080
' just write
/(localhost).
if you want to match specific thing then please focus the term that you want to search, not the starting and ending of sentence.
Regular expression is for searching the terms, until we have a rigid rule for the same. :)
i hope this will do..

It depends on how complex you need the Regex to be. A simple way would be to just accept words (and the port/domain):
^https?:\/\/\w+(:[0-9]*)?(\.\w+)?$
Remember you need to use the + character to match one or more characters.
Of course, there are far better & more complicated solutions out there.

^https?:\/\/localhost:[0-9]{1,5}\/([-a-zA-Z0-9()#:%_\+.~#?&\/=]*)
match:
https://localhost:65535/file-upload-svc/files/app?query=abc#next
not match:
https://localhost:775535/file-upload-svc/files/app?query=abc#next
explanation
it can only be used for localhost
it also check the value for port number since it should be less than 65535 but you probably need to add additional logic

You can use this. This will allow localhost and live domain as well.
^https?:\/\/\w+(\.\w+)*(:[0-9]+)?(\/.*)?$

I'm pretty late to the party but now you should consider validating your URL with the URL class. Avoid the headache of regex and rely on standard
let isValid;
try {
new URL(endpoint); // Will throw if URL is invalid
isValid = true;
} catch (err) {
isValid = false;
}

^https?:\/\/(localhost:([0-9]+\.)+[a-zA-Z0-9]{1,6})?$
Will match the following cases :
http://localhost:3100/api
http://localhost:3100/1
http://localhost:3100/AP
http://localhost:310
Will NOT match the following cases :
http://localhost:3100/
http://localhost:
http://localhost
http://localhost:31

How to get data from string using Javascript Regex

I can't post the exact data i'm trying to extract but here's a basic scenario with the same outcome. I'm grabbing the body of a page and trying to extract a bit.ly link from it. So let's say for example, this is the chunk of data where i'm trying to grab the link from.
String:
http://bit.ly/Pq8AkS</div><div class="shareUnit"><div class="-cx-PRIVATE-fbTimelineExternalShareUnit__wrapper"><div><div class="-cx-PRIVATE-fbTimelineExternalShareUnit__root -cx-PRIVATE-fbTimelineExternalShareUnit__hasImage"><a class="-cx-PRIVATE-fbTimelineExternalShareUnit__video -cx-PRIVATE-fbTimelineExternalShareUnit__image -cx-PRIVATE-fbTimelineExternalShareUnit__content" ajaxify="/ajax/flash/expand_inline.php?target_div=uikk85_59&share_id=271663136271285&max_width=403&max_height=403&context=timelineSingle" rel="async" href="#" onclick="CSS.addClass(this, "-cx-PRIVATE-fbTimelineExternalShareUnit__loading");CSS.removeClass(this, "-cx-PRIVATE-fbTimelineExternalShareUnit__video");"><i class="-cx-PRIVATE-fbTimelineExternalShareUnit__play"></i><img class="img" src="http://external.ak.fbcdn.net/safe_image.php?d=AQDoyY7_wjAyUtX2&w=155&h=114&url=http%3A%2F%2Fi1.ytimg.com%2Fvi%2FDre21lBu2zU%2Fmqdefault.jpg" alt="" /></a>
Now, I can get what i'm looking for with the following code but the link isn't always going to be exactly 6 characters long. So this causes an issue...
Body = document.getElementsByTagName("body")[0].innerHTML;
regex = /2Fbit.ly%2F(.{6})&h/g;
Matches = regex.exec(Body);
Here's what I was orginally trying but the problem I have is that it grabs too much data. It's going all the way to the last "&h" in the string above instead of stopping at the first one it hits.
Body = document.getElementsByTagName("body")[0].innerHTML;
regex = /2Fbit.ly%2F(.*)&h/g;
Matches = regex.exec(Body);
So basically the main part of the string i'm trying to focus on is "%2Fbit.ly%2FPq8AkS&h" so that I can get the "Pq8AkS" out of it. When I use the (.*) it's grabbing everything between "%2F" and the very last "&h" in the large string above.

You should not be using a regex on HTML. Use DOM functions to get the desired link object, then get the href attribute from that, then you can use a regex on just the href.
By default .* is greedy meaning that it matches the most it can match and still find a match. If you want it to be non-greedy (match the least possible), you can use this .*? instead like this:
regex = /2Fbit.ly%2F(.*?)&h/;
I also don't think you want the g flag on the regex as there should only be one match in the right URL.
If you show the rest of your HTML, we could offer advice on finding the right link object rather than trying to match the entire body HTML.
FYI, another trick for a non-greedy match is to do something like this:
regex = /2Fbit.ly%2F([^&]*)&h/;
Which matches a series of characters that are not & followed by &h which accomplishes the same goal as long as & can't be in the matched sequence.

By default + and * are greedy and match as much as possible. You need a non-greedy match for your (.+). A quick search gives the solution as
? directly following a quantifier makes the quantifier non-greedy (makes it match minimum instead of maximum of the interval defined).
So try changing your regex= line to
regex = /2Fbit.ly%2F(.*?)&h/g;
Edit: #jfriend00's answer below is more complete.

Extending an existing regex to drop punctuation after URL links

I have an existing replace that matches http within a text string and creates a working URL from the text.
Working Example:
var Text = "Visit Gmail at http://gmail.com"
var linkText = Text.replace(/http:\/\/\S+/gi, '$&');
document.write(linkText);
Output:
Visit Gmail at http://gmail.com
Problem:
The problem arises when the link appears at the end of a sentence and the punctuation incorrectly becomes appended to the end of the URL.
Can someone advise on a way of extending my regex (or maybe adding a second replacement after this has been transformed) to overcome this?
I think the right answer will include adding something along the lines of /\W$/g to my original regex, but I can't see how this can be applied to just one word within the whole string.
As always, very grateful for any help.
Thanks,
Pete
Examples of problem links
http://gmail.com/.
http://gmail.com,
http://gmail.com/?
http://gmail.com!
All of these should resolve the link to http://gmail.com
Note how some could end in a slash then punctuation and others with punctuation directly after the domain name.

Try
/http:\/\/(.(?![.?] |$))*/
My logic is, if the last char is a dot, or question mark followed by either a space or end of string, you don't need it.
var Text = "Visit Gmail at http://gmail.com"
var linkText = Text.replace(/http:\/\/(.(?![.?](?:\s|$)))*./gi, '$&');
document.write(linkText);
Gives
"Visit Gmail at http://gmail.com"
Edit:
This may be better (it doesn't match white space now)
http:\/\/(.(?!(?:[.?](?: |$))))*.

Why not just use a negative character class?
/http://\S+[^.,?!]/gi

You could account for trailing unwanted characters, whether stripping them or not.
The replacement for both is capture buffer 1: <a href="$1">$1<\/a>
This also asumes you can do lookbehind. though I'm not sure if client side JS can do lookbehind assertions.
Strip unwanted chars
/(http:\/\/\S+)(?<![\/.,?!])[\/.,?!]*/
Or, leave unwanted characters
/(http:\/\/\S+)(?<![\/.,?!])/
Alternate, using lookahead
Strip
/(http:\/\/\S+?(?=[\/.,?!]+(?:\s|$)|\s|$))[\/.,?!]*/
Leave
/(http:\/\/\S+?(?=[\/.,?!]+(?:\s|$)|\s|$))/

JavaScript negative lookbehind issue

I've got some JavaScript that looks for Amazon ASINs within an Amazon link, for example
http://www.amazon.com/dp/B00137QS28
For this I use the following regex: /([A-Z0-9]{10})
However, I don't want it to match artist links which look like:
http://www.amazon.com/Artist-Name/e/B000AQ1JZO
So I need to exclude any links where there's a '/e' before the slash and the 10-character alphanumeric code. I thought the following would do that: (?<!/e)([A-Z0-9]{10}), but it turns out negative lookbehinds don't work in JavaScript. Is that right? Is there another way to do this instead?
Any help would be much appreciated!
As a side note, be aware there are plenty of Amazon link formats, which is why I want to blacklist rather than whitelist, eg, these are all the same page:
http://www.amazon.com/gp/product/B00137QS28/
http://www.amazon.com/dp/B00137QS28
http://www.amazon.com/exec/obidos/ASIN/B00137QS28/
http://www.amazon.com/Product-Title-Goes-Here/dp/B00137QS28/

In your case an expression like this would work:
/(?!\/e)..\/([A-Z0-9]{10})/

([A-Z0-9]{10}) will work equally well on the reverse of its input, so you can
reverse the string,
use positive lookahead,
reverse it back.

You need to use a lookahead to filter the /e/* ones out. Then trim the leading /e/ from each of the matches.
var source; // the source you're matching against the RegExp
var matches = source.match(/(?!\/e)..\/[A-Z0-9]{10}/g) || [];
var ids = matches.map(function (match) {
return match.substr(3);
});

We Keep Coding

JavaScript is the programming language of the Web.

How to turn urls padded by space into links - javascript

Don't capture the final \s. This way, the second url will match the preceding \s, as required: function linkify (str) { var regex = /(^|\s)(https?:\/\/\S+)/ig; return str.replace(regex,'$1$2') } http://jsfiddle.net/NgMw8/3/

Just use a positive lookahead when matching your final $|\s, like so: var regex = /(^|\s)(https?:\/\/\S+)(?=($|\s))/ig;

None will work if there are any html element stuck to the url ... Similar question and it's answers HERE Some solutions can handle url like "test.com/anothertest?get=letsgo" and append http:// Workaround may be done to handle https and miscellaneous tld ...

Related

Match link patterns in HTML code with a RegEx

Regexp javascript - url match with localhost

How to get data from string using Javascript Regex

Extending an existing regex to drop punctuation after URL links

JavaScript negative lookbehind issue

Categories

Resources