JavaScript regular expression for matching URL path components

JavaScript regular expression for matching URL path components - javascript

What JavaScript regular expression should I use to match individual components of a URL path? By path, I mean the path of the resource on the server, e.g. if the URL is 'http://example.com/directory/resource?start=0', the path is '/directory/resource'. By path components, I mean the /-separated parts of the path.
Let's say we have the URL 'http://example.com/component1/component2'. What I would like is to be able to match 'component1' or 'component2' with a grouped regular expression for each, so each component can be extracted, i.e. something like this: 'http://example.com/($component-regex)/($component-regex) ($component-regex being the regular expression we need to devise). In this example, there would be two matched groups: 'component1' and 'component2'.
Please come up with a regex that's considered safe by JSLint :) For example, it considers [^/]+ insecure.

You don't need a regex for this:
var components = url.split(/[?#]/)[0].split("/").slice(3);
Okay, you do need a regex to split on one of two possible characters, but you could do it without any regex with this:
var components = url.split("#")[0].split("?")[0].split("/").slice(3);

Related

Is there something like glob but for URLs, in JavaScript?

I need to match URLs against string patterns but I want to avoid RegExp to keep the patterns simple and readable.
I'd like to be able to have patterns like http://*.example.org/*, which should be equivalent of /^http:\/\/.*\.example.org\/.*$/ in RegExp. That RegExp should also illustrate why I want to keep it more readable.
Basically I'd like glob-like patterns that work for URLs. The Problem is: normal glob implementations treat / as a delimiter. That means, http://foo.example.org/bar/bla wouldn't match my simple pattern.
So, an implementation of glob that can ignore slashes would be great. Is there such a thing or something similar?

You can start with a function like this for glob like behavior:
function glob(pattern, input) {
var re = new RegExp(pattern.replace(/([.?+^$[\]\\(){}|\/-])/g, "\\$1").replace(/\*/g, '.*'));
return re.test(input);
}
Then call it as:
glob('http://*.example.org/*', 'http://foo.example.org/bar/bla');
true

Solved the problem by writing a lib for it:
https://github.com/lnwdr/calmcard
This matches arbitrary strings with simple wildcards.

express js: Conditional route parameters with RegEx

I need to match a route that has this form: /city-state-country
Where city can be in formats: san-francisco (multiword separated by '-') or newtown (single word).
And also some countries have state missing, so '-state' param in route should be optional.
How can I strictly match match my route pattern, meaning that it will take either 2 or 3 parameters separated by '-'?
I had something like this:
app.get(/([A-Za-z\-\']+)-([A-Za-z\']+)-([A-Za-z\']+)/, routes.index_location);
but, it didn't work.
Ultimately, cases like these should not work:
/c/san-jose-ca-us
/san-jose-ca-us-someweirdstuff

san-jose-ca-us-someweirdstuff can be parsed as san-jose-ca (city) - us (state) - someweirdstuff (country), so it's perfectly valid case
Unless you missed something, the task is impossible in general. We know that us isn't a state, but regexp doesn't.
You can try to limit an amount of dashes in the city to one, or enumerate all possible countries, or do something like that... Anyway, this has nothing to do with regular expressions, really.

Actually, there is a way. But, it would take a multi step process. In the first pass, replace all two letter states (since they are optional) with a different delimiter. In the second pass, replace all of the countries with a different delimiter so you can recognize cities. In the third pass, replace all city dashes with some other character and add back the states and countries with dash delimiters. In the final pass, replace your cities with a different delimiter with the delimiter you expect.
For instance:
replace /-(al|ca|az...)/ with ~$1 san-jose-ca-us = san-jose~ca-us
replace /-(.+)$/ with ~$1 san-jose~ca-us = san-jose~ca~us
replace /-/ with *$1 san-jose~ca~us = san*jose~ca~us
replace /~/ with - san*jose~ca~us = san*jose-ca-us
etc.

If you only want to keep your information on 1 level hierarchy you can try the underscore delimiter. So, your url be like: city_state_country

Regex if substring exist then match another part of the string

I am using the YUI3 library and am using a filter to match and replace parts of a URL.
Because filter is not very flexible, I am only able to provide a regex expression for searching and then a string for replacing the matches:
filter: {
searchExp : "-min\\.js",
replaceStr: "-debug.js"
}
In my case, I have a URL that looks like this:
http://site.com/assets/js?yui-3.9.0/widget-base/assets/skins/sam/widget-base.css&yui-3.9.0/cssbutton/cssbutton-min.css
I would like to match /assets/js if there are .css files. If the parameters contain a CSS file, then it will always only contain CSS files.
So far, I have written a small regex to check for the presence of .css at the very end:
.*\.css$
However, now, if we have a match, I would like to return /assets/js as the match. Is this something that is doable with regex?
Personally, I would rather this be done with a simple function and a simple if/else, but due to the limitations (I can only use regex), I need to find a regex solution to this.

This is a bit hacked together, but should do the job:
var t = new RegExp( "/assets/js(([^\\.]*\\.)*[^\\.]*\\.css)$" )
document.write( "http://site.com/assets/js?yui-3.9.0/widget-base/assets/skins/sam/widget-base.css&yui-3.9.0/cssbutton/cssbutton-min.css".replace( t, "/newthing/$1" ) );
Essentially it searches for /assets/js, followed by any characters, followed by .css. If the whole thing matches it wil replace it with the new text, and include the matched pattern (from the first brackets) after it. Everything from before /assets isn't included in the match, so doesn't need to be included.
I imagine your library uses replace internally, so those strings should work. Specifically,
"/assets/js(([^\\.]*\\.)*[^\\.]*\\.css)$"
"/newthing/$1"
I'm not quite sure what you want to do with the results, but this allows you to change the folder and add suffixes (as well as check for the presence of both tokens in the first place). To add a suffix change the replacement to this:
"/assets/js$1-mysuffix"

What's wrong with this regular expression to find URLs?

I'm working on a JavaScript to extract a URL from a Google search URL, like so:
http://www.google.com/search?client=safari&rls=en&q=thisisthepartiwanttofind.org&ie=UTF-8&oe=UTF-8
Right now, my code looks like this:
var checkForURL = /[\w\d](.org)/i;
var findTheURL = checkForURL.exec(theURL);
I've ran this through a couple regex testers and it seems to work, but in practice the string I get returned looks like this:
thisisthepartiwanttofind.org,.org
So where's that trailing ,.org coming from?
I know my pattern isn't super robust but please don't suggest better patterns to use. I'd really just like advice on what in particular I did wrong with this one. Thanks!

Remove the parentheses in the regex if you do not process the .org (unlikely since it is a literal). As per #Mark comment, add a + to match one or more characters of the class [\w\d]. Also, I would escape the dot:
var checkForURL = /[\w\d]+\.org/i;

What you're actually getting is an array of 2 results, the first being the whole match, the second - the group you defined by using parens (.org).
Compare with:
/([\w\d]+)\.org/.exec('thisistheurl.org')
→ ["thisistheurl.org", "thisistheurl"]
/[\w\d]+\.org/.exec('thisistheurl.org')
→ ["thisistheurl.org"]
/([\w\d]+)(\.org)/.exec('thisistheurl.org')
→ ["thisistheurl.org", "thisistheurl", ".org"]
The result of an .exec of a JS regex is an Array of strings, the first being the whole match and the subsequent representing groups that you defined by using parens. If there are no parens in the regex, there will only be one element in this array - the whole match.

You should escape .(DOT) in (.org) regex group or it matches any character. So your regex would become:
/[\w\d]+(\.org)/
To match the url in your example you can use something like this:
https?://([0-9a-zA-Z_.?=&\-]+/?)+
or something more accurate like this (you should choose the right regex according to your needs):
^https?://([0-9a-zA-Z_\-]+\.)+(com|org|net|WhatEverYouWant)(/[0-9a-zA-Z_\-?=&.]+)$

How to identify all URLs that contain a (domain) substring?

If I am correct, the following code will only match a URL that is exactly as presented.
However, what would it look like if you wanted to identify subdomains as well as urls that contain various different query strings - in other words, any address that contains this domain:
var url = /test.com/
if (window.location.href.match(url)){
alert("match!");
}

If you want this regex to match "test.com" you need to escape the "." and both of the "/" that means any character in regex syntax.
Escaped : \/test\.com\/
Take a look for here for more info

No, your pattern will actually match on all strings containing test.com.

The regular expresssion /test.com/ says to match for test[ANY CHARACTER]com anywhere in the string
Better to use example.com for example links. So I replaces test with example.
Some example matches could be
http://example.com
http://examplexcom.xyz
http://example!com.xyz
http://example.com?q=123
http://sub.example.com
http://fooexample.com
http://example.com/asdf/123
http://stackoverflow.com/?site=example.com

I think you need to use /g. /g enables "global" matching. When using the replace() method, specify this modifier to replace all matches, rather than only the first one:
var /test.com/g;

If you want to test if an URL is valid this is the one I use. Fairly complex, because it takes care also of numeric domain & a few other peculiarities :
var urlMatcher = /(([\w]+:)?\/\/)?(([\d\w]|%[a-fA-f\d]{2,2})+(:([\d\w]|%[a-fA-f\d]{2,2})+)?#)?([\d\w][-\d\w]{0,253}[\d\w]\.)+[\w]{2,4}(:[\d]+)?(\/([-+_~.\d\w]|%[a-fA-f\d]{2,2})*)*(\?(&?([-+_~.\d\w]|%[a-fA-f\d]{2,2})=?)*)?(#([-+_~.\d\w]|%[a-fA-f\d]{2,2})*)?/;
Takes care of parameters and anchors etc... dont ask me to explain the details pls.

We Keep Coding

JavaScript is the programming language of the Web.

JavaScript regular expression for matching URL path components - javascript

You don't need a regex for this: var components = url.split(/[?#]/)[0].split("/").slice(3); Okay, you do need a regex to split on one of two possible characters, but you could do it without any regex with this: var components = url.split("#")[0].split("?")[0].split("/").slice(3);

Related

Is there something like glob but for URLs, in JavaScript?

express js: Conditional route parameters with RegEx

Regex if substring exist then match another part of the string

What's wrong with this regular expression to find URLs?

How to identify all URLs that contain a (domain) substring?

Categories

Resources