Getting http URL using regex when there are multiple http URLs - javascript

I want to extract different .swf files from different sites for a project. Different sites use different source methods so I can't use src= or data= in my regex.
I'm able to match the file name with /[\w-]+.swf/g , but when I try to match the full path( http(.*?).swf ) starting with http it matches another http before the path (the first one in the code). Also I can't use src= or data= etc, it must be only the link.
Basically, is there a way to limit the match to the first http found when searching backwards?
If anyone cares to take a look then here's the code: http://pastebin.com/kT20UqqJ .
And here's a good place to test regex: http://regex.larsolavtorvik.com/

Try the following one:
var regex = /http:[\.\/\w-%]+\.swf/g
You need to escape the . else it will match an arbitrary character and the / since it is the expression delimiter.
You can see the working Example here.

If you have url encoded characters (like white space) you would have also a % in your url.
Here is an example which will work in this case: /http:[\./\w%-]+\.swf/g
Here is a tool where you can test the regex: http://regexpal.com/
And one where you can check it's performance: http://regexter.com/

Related

Regex to filter out URL protocol but not domain

I have some weird cases where I'm getting url's in the below format:
http://http://www.somedomain.com/testing/ased=http://something
I need them to come out like this: www.somedomain.com/testing/ased=http://something
They may or may not have www and they could be other things besides .com. They could also be https but the big issue is I need to filter out the protocol before the domain, but if the domain has a protocol in it, keep that.
I was using a replace with /.*?:\/\//g and replacing it with an empty string, but that doesn't work on the weird ones where there's a protocol in the domain.
Thanks for any assistance!
You can use some RegEx and non-capturing groups to do this. I would caution that you might want to look into where the data is coming from and try to clean it before it gets to you but I also understand that sometimes that isn't an option.
Here is some RegEx that will do what you need
(?:http(s)?:\/\/)+(?<resource>.*[^\=]*.*)/gm
This breaks down as follows:
(?:http(s)?:\/\/)+
This matches http:// and https:// one or more times and puts it into a non-capturing group.
(?<resource>.*)
This grabs everything after the first non-capturing group and puts it into a named-group "resource"
Here is a JavaScript code snippet that puts that into action:
// get your url
let string = "http://https://www.somedomain.com/testing/ased=http://something"
// build your regex
let regex = /(?:http(s)?:\/\/)+(?<resource>.*)/gm;
// get the matched groups
const { groups: { resource } } = regex.exec(string)
// log the resource group
console.log(resource)

Regular Expression - Extract subdomain & domain

I'm trying to form a regular expression (javascript/node.js) which will extract the sub-domain & domain part from any given URL. This is what I ended up with:
[^(?:http:\/\/|www\.|https:\/\/)]([^\/]+)
Right now, I'm just considering http, https for protocol & exclude "www." portion from the subdomain+domain portion of an URL. I checked the expression & it almost works. But, here is the issue:
Success
'http://mplay.google.co.in/sadfask/asdkfals?dk=10'.match(/[^(?:http:\/\/|www\.|https:\/\/)]([^\/]+)/i)
'http://lplay.google.co.in/sadfask/asdkfals?dk=10'.match(/[^(?:http:\/\/|www\.|https:\/\/)]([^\/]+)/i)
Failure
'http://play.google.co.in/sadfask/asdkfals?dk=10'.match(/[^(?:http:\/\/|www\.|https:\/\/)]([^\/]+)/i)
'http://tplay.google.co.in/sadfask/asdkfals?dk=10'.match(/[^(?:http:\/\/|www\.|https:\/\/)]([^\/]+)/i)
I just use the first element from the result array. I'm not able to understand why "play." & "tplay." doesn't work. Could anyone please help me in this regard?
Does "/p" and "/t" have any meaning for the regular expression evaluator?
Is there any other way of extracting sub-domain & domain from any given URL using a regular expression?
Edit -
Example:
https://play.google.com/store/apps/details?id=com.skgames.trafficracer => play.google.com
https://mail.google.com/mail/u/0/#inbox => mail.google.com
Your regex doesn't seem correct. Try this regex:
/^(?:https?:\/\/)?(?:[^#\n]+#)?(?:www\.)?([^:\/\n?]+)/img
RegEx Demo
You are about the one millionth person to try to parse URLs in JavaScript. I'm a little bit surprised you didn't see any of the existing questions on SO dating back years. The last thing you want to do is write yet another broken regexp, with all due respect to those that provided answers to your question.
There are many well documented libraries and approaches to handling this. Google it. The simplest way is to create an a element in memory, assign it an href, and then access its hostname and other properties. See http://tutorialzine.com/2013/07/quick-tip-parse-urls/. If that does not float your boat, then use a library like uri.js.
If you really don't want to use a library, and insist on reinventing the wheel, then at least do something like the following:
function get_domain_from_url(url) {
var a = document.createElement('a').
a.setAttribute('href', url);
return a.hostname;
}
Essentially, you are delegating the extraction of the subdomain/domain part of the URL to the browser's URL parsing logic, which is MUCH better than anything you will ever write.
Also see Parse URL with jquery/ javascript?, Parse URL with Javascript, How do I parse a URL into hostname and path in javascript?, or parse URL with JavaScript or jQuery. How did you miss those? Sorry, I have to vote to close this as a duplicate.
The same RegExp as in anubhava's answer, only added support for protocol-relative URLs like //google.com:
/^(?:https?:)?(?:\/\/)?(?:[^#\n]+#)?(?:www\.)?([^:\/\n]+)/im
RegEx Demo
Here's a solution ignoring everything before ://
.*\://?([^\/]+)
Incase you want to ignore www.
.*\://(?:www.)?([^\/]+)
Your regex expression works pretty well. You only need to remove the brackets. The final expression is:
^(?:http:\/\/|www\.|https:\/\/)([^\/]+)
Hope it's useful!
I know I am late to the party but I want to answer the question with some extra useful info.
Get the domain name from a link using regex.
^(https?:\/\/)?(www\.)?([^\/]+)
Here is the link to above regex.
If you want to get the subdomain, split the result from one of the matches of above regex with the first occurrence of .
Note: regex is faster than language built-in modules. check below examples, regex comes out to be 15x faster than the built-in module
javascript Example with Regex:
console.time('time2');
const pttrn = /^(https?:\/\/)?(www\.)?([^\/]+)/gm
const urlInfo = pttrn.exec("https://www.google.co.in/imghp");
console.timeEnd('time2');
//time2: 0.055ms
console.log(urlInfo[0]) // https://www.google.co.in
console.log(urlInfo[1]) // https://
console.log(urlInfo[2]) // www.
console.log(urlInfo[3]) // google.co.in
Nodejs with built-in url module
console.time('time');
const url = require('url');
const urlInfo = url.parse("https://www.google.co.in/imghp");
console.timeEnd('time');
//time: 0.840ms;
console.log(urlInfo.hostname) //www.google.co.in

Regexp javascript - url match with localhost

I'm trying to find a simple regexp for url validation, but not very good in regexing..
Currently I have such regexp: (/^https?:\/\/\w/).test(url)
So it's allowing to validate urls as http://localhost:8080 etc.
What I want to do is NOT to validate urls if they have some long special characters at the end like: http://dodo....... or http://dododo&&&&&
Could you help me?
How about this?
/^http:\/\/\w+(\.\w+)*(:[0-9]+)?\/?(\/[.\w]*)*$/
Will match: http://domain.com:port/path or just http://domain or http://domain:port
/^http:\/\/\w+(\.\w+)*(:[0-9]+)?\/?$/
match URLs without path
Some explanations of regex blocks:
Domain: \w+(\.\w+)* to match text with dots: localhost or www.yahoo.com (could be as long as Path or Port section begins)
Port: (:[0-9]+)? to match or to not match a number starting with semicolon: :8000 (and it could be only one)
Path: \/?(\/[.\w]*)* to match any alphanums with slashes and dots: /user/images/0001.jpg (until the end of the line)
(path is very interesting part, now I did it to allow lone or adjacent dots, i.e. such expressions could be possible: /. or /./ or /.../ and etc. If you'd like to have dots in path like in domain section - without border or adjacent dots, then use \/?(\/\w+(.\w+)*)* regexp, similar to domain part.)
* UPDATED *
Also, if you would like to have (it is valid) - characters in your URL (or any other), you should simply expand character class for "URL text matching", i.e. \w+ should become [\-\w]+ and so on.
If you want to match ABCD then you may leave the start part..
For Example to match http://localhost:8080
' just write
/(localhost).
if you want to match specific thing then please focus the term that you want to search, not the starting and ending of sentence.
Regular expression is for searching the terms, until we have a rigid rule for the same. :)
i hope this will do..
It depends on how complex you need the Regex to be. A simple way would be to just accept words (and the port/domain):
^https?:\/\/\w+(:[0-9]*)?(\.\w+)?$
Remember you need to use the + character to match one or more characters.
Of course, there are far better & more complicated solutions out there.
^https?:\/\/localhost:[0-9]{1,5}\/([-a-zA-Z0-9()#:%_\+.~#?&\/=]*)
match:
https://localhost:65535/file-upload-svc/files/app?query=abc#next
not match:
https://localhost:775535/file-upload-svc/files/app?query=abc#next
explanation
it can only be used for localhost
it also check the value for port number since it should be less than 65535 but you probably need to add additional logic
You can use this. This will allow localhost and live domain as well.
^https?:\/\/\w+(\.\w+)*(:[0-9]+)?(\/.*)?$
I'm pretty late to the party but now you should consider validating your URL with the URL class. Avoid the headache of regex and rely on standard
let isValid;
try {
new URL(endpoint); // Will throw if URL is invalid
isValid = true;
} catch (err) {
isValid = false;
}
^https?:\/\/(localhost:([0-9]+\.)+[a-zA-Z0-9]{1,6})?$
Will match the following cases :
http://localhost:3100/api
http://localhost:3100/1
http://localhost:3100/AP
http://localhost:310
Will NOT match the following cases :
http://localhost:3100/
http://localhost:
http://localhost
http://localhost:31

How to identify all URLs that contain a (domain) substring?

If I am correct, the following code will only match a URL that is exactly as presented.
However, what would it look like if you wanted to identify subdomains as well as urls that contain various different query strings - in other words, any address that contains this domain:
var url = /test.com/
if (window.location.href.match(url)){
alert("match!");
}
If you want this regex to match "test.com" you need to escape the "." and both of the "/" that means any character in regex syntax.
Escaped : \/test\.com\/
Take a look for here for more info
No, your pattern will actually match on all strings containing test.com.
The regular expresssion /test.com/ says to match for test[ANY CHARACTER]com anywhere in the string
Better to use example.com for example links. So I replaces test with example.
Some example matches could be
http://example.com
http://examplexcom.xyz
http://example!com.xyz
http://example.com?q=123
http://sub.example.com
http://fooexample.com
http://example.com/asdf/123
http://stackoverflow.com/?site=example.com
I think you need to use /g. /g enables "global" matching. When using the replace() method, specify this modifier to replace all matches, rather than only the first one:
var /test.com/g;
If you want to test if an URL is valid this is the one I use. Fairly complex, because it takes care also of numeric domain & a few other peculiarities :
var urlMatcher = /(([\w]+:)?\/\/)?(([\d\w]|%[a-fA-f\d]{2,2})+(:([\d\w]|%[a-fA-f\d]{2,2})+)?#)?([\d\w][-\d\w]{0,253}[\d\w]\.)+[\w]{2,4}(:[\d]+)?(\/([-+_~.\d\w]|%[a-fA-f\d]{2,2})*)*(\?(&?([-+_~.\d\w]|%[a-fA-f\d]{2,2})=?)*)?(#([-+_~.\d\w]|%[a-fA-f\d]{2,2})*)?/;
Takes care of parameters and anchors etc... dont ask me to explain the details pls.

regex matching image url with spaces

I need to match a image url like this:
http://site.com/site.com/files/images/img (5).jpg
Something like this works fine:
.replace(/(http:\/\/([ \S]+\.(jpg|png|gif)))/ig, "<div style=\"background: url($1)\"></div>")
Except if I have something like this:
http://site.com/site.com/files/audio/audiofile.mp3 http://site.com/site.com/files/images/img (5).jpg
How do I match only the image?
Thanks!
Edit: And I'm using javascript.
Assuming images will always be in the 'images' directory, try:
http://.*/images/(.*?).(jpe?g|gif|png)
If you can't assume an images directory:
http://.*/(.*?).(jpe?g|gif|png)
Group 1 and 2 should have what you want (file name and extension).
I tested the regular expression here and here and it appears to do what you want.
Proper URLs should not have spaces in them, they should have %20 or a plus '+' instead. If you had them written with those alternatives then your matching would be much easier.
Why not:
/([^/]+\.(jpg|png|gif))$
Using
http:\/\/.*\/(.*)\.(jpg|png|gif)
should do the trick if all you want is the name of the image. The first group is the file name and the second group is the file extension.
Can you assume that the urls will be space delimited, or return delimited?
As in, can you assume this input?
site.com/images/images/lol (5).jpg
site.com/images/other/radio.mp3
site.com/images/images/copter (3).jpg
If you are going to have your delimiter as part of your string to return, things get tricky. What kind of volume are you talking about here? Could you do it semi-manually at all, or does the process have to be automated?
This would be an approach:
^((\w+):)?\/\/((\w|\.)+(:\d+)?)[^:]+\.(jpe?g|gif|png)$
Mathing on the colon. (:)
In this case it's only accepted for the protocol and port (optional).
This will not match:
http://site.com/site.com/files/audio/audiofile.mp3 http://site.com/site.com/files/images/img (5).jpg
This will match (colon in second http:// removed)
"/audiofile.mp3 http/" will count as a folder in "/audio/"
http://site.com/site.com/files/audio/audiofile.mp3 http//site.com/site.com/files/images/img (5).jpg
It's not fool proof. There are other characters that are not allowed in filenames ( * | " < > )

Categories