regex to find url in a text

regex to find url in a text - javascript

I have to find the first url in the text with a regular expression:
for example:
I love this website:http://www.youtube.com/music it's fantastic
or
[ es. http://www.youtube.com/music] text

I looked into this issue last year and developed a solution that you may want to look at - See: URL Linkification (HTTP/FTP) This link is a test page for the Javascript solution with many examples of difficult-to-linkify URLs.
My regex solution, written for both PHP and Javascript - is not simple (but neither is the problem as it turns out.) For more information I would recommend also reading:
The Problem With URLs by Jeff Atwood, and
An Improved Liberal, Accurate Regex Pattern for Matching URLs by John Gruber
The comments following Jeff's blog post are a must read if you want to do this right...
Note that this question gets asked a lot. Maybe do a search next time :)

You can't do this perfectly with a regular expression. You may be interested in this blog post. There is a bit more information on Regex Guru, but even those look very fragile. You will need to have additional checks outside of your regular expression to catch the edge cases.

Identifying URLs is tricky because they are often surrounded by punctuation marks and because users frequently do not use the full form of the URL. Many JavaScript functions exist for replacing URLs with hyperlinks, but I was unable to find one that works as well as the urlize filter in the Python-based web framework Django. I therefore ported Django's urlize function to JavaScript: https://github.com/ljosa/urlize.js
It actually would not pick up the URL in your example because there is a colon right before the URL. But if we modify the example a little:
urlize("I love this website: http://www.youtube.com/music it's fantastic", true, true)
=> 'I love this website: http://www.youtube.com/music it's fantastic"'
Note the second argument which, if true, inserts rel="nofollow" and the third argument which, if true, quotes characters that have special meaning in HTML.

This might work->
\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))
Found it somewhere
Will find links ->
http://foo.com/blah_blah/
(Something like http://foo.com/blah_blah)
http://foo.com/blah_blah_(wikipedia)
Hope this works....

i am using this regex : :) ( its translated ABNF )
[a-zA-Z]([a-zA-Z]|[0-9]|\+|\-|\.)*:\/\/((([a-zA-Z]|[0-9]|-|\.|_|~)|%[0-9A-Fa-f][0-9A-Fa-f]|[!$&'\(\)\*\+,;=]|:)*#)?(\[((([0-9A-Fa-f]{1,4}:){6}([0-9A-Fa-f]{1,4}:[0-9A-Fa-f]{1,4}|(25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9][0-9]|[0-9])\.(25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9][0-9]|[0-9])\.(25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9][0-9]|[0-9])\.(25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9][0-9]|[0-9]))|::([0-9A-Fa-f]{1,4}:){5}([0-9A-Fa-f]{1,4}:[0-9A-Fa-f]{1,4}|(25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9][0-9]|[0-9])\.(25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9][0-9]|[0-9])\.(25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9][0-9]|[0-9])\.(25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9][0-9]|[0-9]))|([0-9A-Fa-f]{1,4})?::([0-9A-Fa-f]{1,4}:){4}([0-9A-Fa-f]{1,4}:[0-9A-Fa-f]{1,4}|(25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9][0-9]|[0-9])\.(25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9][0-9]|[0-9])\.(25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9][0-9]|[0-9])\.(25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9][0-9]|[0-9]))|(([0-9A-Fa-f]{1,4}:){0,1}[0-9A-Fa-f]{1,4})?::([0-9A-Fa-f]{1,4}:){3}([0-9A-Fa-f]{1,4}:[0-9A-Fa-f]{1,4}|(25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9][0-9]|[0-9])\.(25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9][0-9]|[0-9])\.(25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9][0-9]|[0-9])\.(25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9][0-9]|[0-9]))|(([0-9A-Fa-f]{1,4}:){0,2}[0-9A-Fa-f]{1,4})?::([0-9A-Fa-f]{1,4}:){2}([0-9A-Fa-f]{1,4}:[0-9A-Fa-f]{1,4}|(25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9][0-9]|[0-9])\.(25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9][0-9]|[0-9])\.(25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9][0-9]|[0-9])\.(25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9][0-9]|[0-9]))|(([0-9A-Fa-f]{1,4}:){0,3}[0-9A-Fa-f]{1,4})?::[0-9A-Fa-f]{1,4}:([0-9A-Fa-f]{1,4}:[0-9A-Fa-f]{1,4}|(25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9][0-9]|[0-9])\.(25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9][0-9]|[0-9])\.(25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9][0-9]|[0-9])\.(25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9][0-9]|[0-9]))|(([0-9A-Fa-f]{1,4}:){0,4}[0-9A-Fa-f]{1,4})?::([0-9A-Fa-f]{1,4}:[0-9A-Fa-f]{1,4}|(25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9][0-9]|[0-9])\.(25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9][0-9]|[0-9])\.(25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9][0-9]|[0-9])\.(25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9][0-9]|[0-9]))|(([0-9A-Fa-f]{1,4}:){0,5}[0-9A-Fa-f]{1,4})?::[0-9A-Fa-f]{1,4}|(([0-9A-Fa-f]{1,4}:){0,6}[0-9A-Fa-f]{1,4})?::)|v[0-9A-Fa-f]\.(([a-zA-Z]|[0-9]|-|\.|_|~)|[!$&'\(\)\*\+,;=]|:))\]|(25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9][0-9]|[0-9])\.(25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9][0-9]|[0-9])\.(25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9][0-9]|[0-9])\.(25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9][0-9]|[0-9])|(([a-zA-Z]|[0-9]|-|\.|_|~)|%[0-9A-Fa-f][0-9A-Fa-f]|[!$&'\(\)\*\+,;=])*)(:[0-9]*)?(((\/(([a-zA-Z]|[0-9]|-|\.|_|~)|%[0-9A-Fa-f][0-9A-Fa-f]|[!$&'\(\)\*\+,;=]|:|#)*)*|\/((([a-zA-Z]|[0-9]|-|\.|_|~)|%[0-9A-Fa-f][0-9A-Fa-f]|[!$&'\(\)\*\+,;=]|:|#){1}(\/(([a-zA-Z]|[0-9]|-|\.|_|~)|%[0-9A-Fa-f][0-9A-Fa-f]|[!$&'\(\)\*\+,;=]|:|#)*)*)?|(([a-zA-Z]|[0-9]|-|\.|_|~)|%[0-9A-Fa-f][0-9A-Fa-f]|[!$&'\(\)\*\+,;=]|:|#){1}(\/(([a-zA-Z]|[0-9]|-|\.|_|~)|%[0-9A-Fa-f][0-9A-Fa-f]|[!$&'\(\)\*\+,;=]|:|#)*)*|(([a-zA-Z]|[0-9]|-|\.|_|~)|%[0-9A-Fa-f][0-9A-Fa-f]|[!$&'\(\)\*\+,;=]|#){1}(\/(([a-zA-Z]|[0-9]|-|\.|_|~)|%[0-9A-Fa-f][0-9A-Fa-f]|[!$&'\(\)\*\+,;=]|:|#)*)*))?\/?(\?((([a-zA-Z]|[0-9]|-|\.|_|~)|%[0-9A-Fa-f][0-9A-Fa-f]|[!$&'\(\)\*\+,;=]|:|#)|\/|\?)*)?(\#((([a-zA-Z]|[0-9]|-|\.|_|~)|%[0-9A-Fa-f][0-9A-Fa-f]|[!$&'\(\)\*\+,;=]|:|#)|\/|\?)*)?

You can use the following regex expression for extracting any type of url coming in message.
String regex = "(http(s)?:\/\/.)?(www\.)?[-a-zA-Z0-9#:%._\+~#=]{2,256}\.[a-z]{2,6}\b([-a-zA-Z0-9#:%_\+.~#?&/=]*)";

Typescript/Angular
This works for me:
const regExpressionUrl = new RegExp(/(https?:\/\/[^\s]+)/g); //detect URL
Ref: https://www.regextester.com/96249%7CRegular

Related

Unicode characters cannot be decoded

I use browserless.js (headless Chrome) to fetch the html code of a website, and then use a regular expression to find certain image URLs.
One example is the following:
https://vignette.wikia.nocookie.net/moviepedia/images/8/88/Adrien_Brody.jpg/revision/latest/top-crop/width/360/height/450?cb\u003d20141113231800\u0026path-prefix\u003dde
There are unicode characters such as \u003d, which should be decoded (in this case to =). The reason is that I want to include these images in a site, and without decoding some of them cannot be displayed (like that one above, just paste the URL; it gives broken-image.webp).
I have tried lots of things, but nothing works.
JSON.parse(JSON.stringify(...))
String.prototype.normalize()
decodeURIComponent
Curiously, the regular expression for "\u003d" (i.e. "\\u003d" in js) does not match that string above, but "u003d" does.
This is all very weird, and my current guess is that browserless is responsible for some weird formatting behind the scenes. Namely, when I console log the URL and copy paste it somewhere else, every method mentioned above works for decoding.
I hope that someone can help me on this.

Just to mark this one as answered. Thomas replied:
JSON.parse(`"${url}"`)

Detect URL in text without hyperlink and replace normal URL with hyperlink?

I have spent considerable amount of time searching for the solution or trying one, But I did not found one. So my usecase is:
I have a text which can have simple url(with or without http/s) or it can also have hyperlinked url.
What regex should do
It should leave hyperlink url as it is and convert the non hyperlinked url to a hyperlinked URL.
Example Text
I am learning regex from www.codeburst.com and trying regex at Regexr.
Expected Solution
I am learning regex from www.codeburst.com and trying regex at Regexr.
I have tried
this regex, but it it not working as expected.
/((?!href).((https?:\/\/)||(www\.)|(mailto:)).+)/gi

You probably need a negative lookbehind (?<!href=") which was added to ECMAScript recently, see this answer
be careful with double || which renders tokend behind this useless (hungry match)
also be careful with .+ which matches everything after (including newline with /s regex option)
I would start with
(?<!href=")(((https?:\/\/)|(www\.)|(mailto:))\S+)

(https?:\/\/)?(www\.)?[-a-zA-Z0-9#:%._\+~#=]{2,256}\.[a-z]{2,6}\b([-a-zA-Z0-9#:%_\+.~#?&//=]*)
The breakout of regex is as follows,
(https?:\/\/)? checks for http:\\ or https:\\ or no http
(www\.)?checks for www. or no www.
I have checked above regex with following test cases:
href="https://www.regexr.com"
href="http://www.regexr.com"
href="mailto:demo#demo.com"

Facebook-Like URLS

I have been searching for a method of encoding url just like facebook's. All I have been able to find is these methods:
escape
encodeURI
encodeURIComponent
The goal is to encode a string in latin characters, for example:
¿Cómo estás?
Facebook results in the next url
When I use the 3 functions I talked abour earlier I get nothing similar
escape("¿Cómo estás?"); //"%BFC%F3mo%20est%E1s%3F"
encodeURI("¿Cómo estás?");//"%C2%BFC%C3%B3mo%20est%C3%A1s?"
encodeURIComponent("¿Cómo estás?"); //"%C2%BFC%C3%B3mo%20est%C3%A1s%3F"
I need you to guide me to the solution, this is something im doing more than anything for SEO purposes. Do I have to code a function myself?
Thanks for your time.

So my first guess is you are encoding in UTF-8 where Facebook may be encoding in ISO.
https://en.wikipedia.org/wiki/ISO/IEC_8859-1 vs https://en.wikipedia.org/wiki/UTF-8
As far as SEO goes - I do not see the relation, the example is just a query string - so wouldn't be crawled by a search engine. I may be misunderstanding - hopefully that points you in the right direction though.

After I decided to make my search just encoding my string using encodeURIComponent I discovered that when you window.location.href your encoded string your URL actually looks like facebooks. What a surprise, It looks very different in the console.

Javascript mailto string loses 'encodeURI' encoding

Trying to create a simple 'mailto' function using javascript. I just need to be able to send some links (like: See this article bla bla).
Some of the links I need to send include spaces, danish chars. So I've been using the
encodeURI() function.
The problem arises when I try to mail the link (sample code below)
var _encodedPath = encodeURI(path);
var _tempString = "mailto:someemail#somewhere.dk?subject=Shared%20from%20some%20page&body=" + _encodedPath;
If I output the _tempString to the console I get the correct encoded string. However when using the same string in 'mailto' the string loses it's encoding and returns to the way it was before.
Any clue as to why this is?
Thanks in advance :)

The link is decoded when you click it - that's normal. Since you have an http link within a mailto link, it should be encoded twice.
Email clients do their best to make things that look like links clickable. They typically decide where the link ends in a somewhat arbitrary and unpredictable manner.
In email, the best way to keep a link contiguous is to enclose it in angle-brackets like this:
<http://www.example.com/url with spaces>
But this isn't foolproof. Email is fragile and you can't control the content well enough with a mailto link. It might be better to try to reduce the complexity of the url - perhaps by providing or utilizing a url-shortener service. Any url longer than 74 or so characters is likely to be mangled by some email clients.

You should use encodeURIComponent instead of encodeURI.
More information here.

this site helped me solving any troubles with mailto links:
http://www.1ngo.de/web/formular.html
may be it's not the nicest way, but it always works with every browser i know. And it also has very cool algorithm implemented to format the content so that everything should be alright. Just try it and play around a little with code by quoting out parts of the code and you will understand very fast what exactly happens there and how to modify it for your wishes. Althoug it's a little late I hope this one helps anybody checking this question.
althoug it's in german, you just need to copy the code shown there and run it and experiment with it.

Advanced Javascript Obfuscation

I've been training heavily in JS obfuscation, starting to know my way around all advanced concepts, but I recently found an obfuscated code, I believe it is some form of "Native Javascript Code", I just can't find ANY documentation on this type of obfuscation :
Here is a small extract :
'\141\75\160\162\157\155\160\164\50\47\105\156\164\162\145\172\40'
It is called this way :
eval(eval('\141\75\160\162\157\155\160\164\50\47\105\156\164\162\145\172\40'))
Since the code is the work of another and I encoutered it in a JS challenge I'm not posting the full code, so the example I gave won't work, but the full code does work.
So here is my question:
What type of code is this? And where can I learn more about it?
Any suggestions appreciated :)

It's just a string with the characters escaped. You can read it in the JavaScript console in any browser:
console.log('\141\75\160\162\157\155\160\164\50\47\105\156\164\162\145\172\40')
will print:
"a=prompt('Entrez "

It's just escaped characters, one part outputting the string of a query and another actually running the returned string - try calling it in a console.
eval('\160\162\157\155\160\164\50\47\105\156\164\162\145\172\47\51')
Might help?

These numbers is the ascii codes (http://www.asciitable.com/index/asciifull.gif) of characters (in Octal representation).
You can convert it to characters. This is used when somebody wants to make an XSS attack, or wants to hide the js code.
So the string what you written represents:
a=prompt('Entrez
The js engines, browsers can translate the octal format to the 'real' string. With eval function it could run. (in case the 'translated' code has no syntax errors)

We Keep Coding

JavaScript is the programming language of the Web.

regex to find url in a text - javascript

I have to find the first url in the text with a regular expression: for example: I love this website:http://www.youtube.com/music it's fantastic or [ es. http://www.youtube.com/music] text

You can't do this perfectly with a regular expression. You may be interested in this blog post. There is a bit more information on Regex Guru, but even those look very fragile. You will need to have additional checks outside of your regular expression to catch the edge cases.

This might work-> \b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/))) Found it somewhere Will find links -> http://foo.com/blah_blah/ (Something like http://foo.com/blah_blah) http://foo.com/blah_blah_(wikipedia) Hope this works....

You can use the following regex expression for extracting any type of url coming in message. String regex = "(http(s)?:\/\/.)?(www\.)?[-a-zA-Z0-9#:%._\+~#=]{2,256}\.[a-z]{2,6}\b([-a-zA-Z0-9#:%_\+.~#?&/=]*)";

Typescript/Angular This works for me: const regExpressionUrl = new RegExp(/(https?:\/\/[^\s]+)/g); //detect URL Ref: https://www.regextester.com/96249%7CRegular

Related

Unicode characters cannot be decoded

Detect URL in text without hyperlink and replace normal URL with hyperlink?

Facebook-Like URLS

Javascript mailto string loses 'encodeURI' encoding

Advanced Javascript Obfuscation

Categories

Resources