How to make regex match pattern from the beginning?

How to make regex match pattern from the beginning? - javascript

I need a little assistance with a Regular Expressions.
I'm doing the following from JavaScript to "mask" all special URLs that may be composed using the following rule:
They may begin with something like this 0> or 1223> or 1_23>
They may begin with a protocol, ex: http:\\ or https:\\
They may also have www. subdomain
So for instance, for https://www.example.com it should produce https://www. ....
So I came up with the following JS:
var url = "0>https://www.example.com/plugins/page.php?href=https://forum.example.com/topic/some_topic";
m = url.match(/\b((?:[\d_]+>)?.+\:\/\/(?:www.)?)/i);
if (m) {
url = m[1] + " ...";
}
console.log(url);
It works for most cases, except that "repeating" URL in my example, in which case I get this:
0>https://www.example.com/plugins/page.php?href=https:// ...
when I was expecting:
0>https:// www. ...
How do I make it pick the match from the beginning? I thought adding \b would do it...

Just make the .+, non-greedy, like this
m = url.match(/\b((?:[\d_]+>)?.+?\:\/\/(?:www.)?)/i);
Note the ? after .+. It means that, the RegEx has to match till the first : after the current expression. If you don't use the ?, it will make it greedy and it will consume all the characters till the last : in the string.
And, you don't have to escape : and you have to escape . after www. So your RegEx will become like this
m = url.match(/\b((?:[\d_]+>)?.+?:\/\/(?:www\.)?)/i);

Related

JavaScript to remove whatever is after the tld and before the whitespace

I have a bunch of functions that are filtering a page down to the domains that are attached to email addresses. It's all working great except for one small thing, some of the links are coming out like this:
EXAMPLE.COM
EXAMPLE.ORG.
EXAMPLE.ORG>.
EXAMPLE.COM"
EXAMPLE.COM".
EXAMPLE.COM).
EXAMPLE.COM(COMMENT)"
DEPT.EXAMPLE.COM
EXAMPLE.ORG
EXAMPLE.COM.
I want to figure out one last filter (regex or not) that will remove everything after the TLD. All of these items are in an array.
EDIT
The function I'm using:
function filterByDomain(array) {
var regex = new RegExp("([^.\n]+\.[a-z]{2,6}\b)", 'gi');
return array.filter(function(text){
return regex.test(text);
});
}

You can probably use this regex to match your TLD for each case:
/^[^.\n]+\.[a-z]{2,63}$/gim
RegEx Demo
You validation function can be:
function filterByDomain(array) {
var regex = /^[^.\n]+\.[a-z]{2,63}$/gim;
return array.filter(function(text){
return regex.test(text);
});
}
PS: Do read this Q & A to see that up to 63 characters are allowed in TLD.

I'd match all leading [\w.] and omit the last dot, if any:
var result = url.match(/^[\w\.]+/).join("");
if(result.slice(-1)==".") result = result.slice(0,-1);
With note that \w should be replaced for something more sophisticated:
_ is part of \w set but should not be in url path
- is not part of \w but can be in url not adjacent to . or -
To keep the regexp simple and the code readable, I'd do it this way
substitute _ for # in url (both # and _ can be only after TLD)
substitute - for _ (_ is part of \w)
after the regexp test, substitute _ back for -
URL like www.-example-.com would still pass, can be detected by searching for [.-]{2,}

Match Url path without query string

I would like to match a path in a Url, but ignoring the querystring.
The regex should include an optional trailing slash before the querystring.
Example urls that should give a valid match:
/path/?a=123&b=123
/path?a=123&b=123
So the string '/path' should match either of the above urls.
I have tried the following regex: (/path[^?]+).*
But this will only match urls like the first example above: /path/?a=123&b=123
Any idea how i would go about getting it to match the second example without the trailing slash as well?
Regex is a requirement.

No need for regexp:
url.split("?")[0];
If you really need it, then try this:
\/path\?*.*
EDIT Actually the most precise regexp should be:
^(\/path)(\/?\?{0}|\/?\?{1}.*)$
because you want to match either /path or /path/ or /path?something or /path/?something and nothing else. Note that ? means "at most one" while \? means a question mark.
BTW: What kind of routing library does not handle query strings?? I suggest using something else.

http://jsfiddle.net/bJcX3/
var re = /(\/?[^?]*?)\?.*/;
var p1 = "/path/to/something/?a=123&b=123";
var p2 = "/path/to/something/else?a=123&b=123";
var p1_matches = p1.match(re);
var p2_matches = p2.match(re);
document.write(p1_matches[1] + "<br>");
document.write(p2_matches[1] + "<br>");

javascript regex that gets all subdomains

I have the following RegEx:
[!?\.](.*)\.example\.com
and this sample string:
test foo abc.def.example.com bar ghi.jkl.example.com def
I want that the RegEx products the following matches: def.example.com and jkl.example.com.
What do I have to change? Should be working on all subdomains of example.com. If possible it should only take the first subdomain-level (abc.def.example.com -> def.example.com).
Tested it on regexpal, not fully working :(

You may use the following expression : [^.\s]+\.example\.com.
Explanation
[^.\s]+ : match anything except a dot or whitespace one or more times
\.example\.com : match example.com
Note that you don't need to escape a dot in a character class

Just on a side note, while HamZa's answer works for your current sample code, if you need to make sure that the domain names are also valid, you might want to try a different approach, since [^.\s]+ will match ANY character that is not a space or a . (for example, that regex will match jk&^%&*(l.example.com as a "valid" subdomain).
Since there are far fewer valid characters for domain name values than there are invalid ones, you might consider using an "additive" approach to the regex, rather than subtractive. This pattern here is probably the one that you are looking for for valid domain names: /(?:[\s.])([a-z0-9][a-z0-9-]+[a-z0-9]\.example\.com)/gi
To break it down a little more . . .
(?:[\s.]) - matches the space or . that would mark the beginning of the loweset level subdomain
([a-z0-9][a-z0-9-]+[a-z0-9]\.example\.com) - this captures a group of letters, numbers or dashes, that must begin and end with a letter or a number (domain name rules), and then the example.com domain.
gi - makes the regex pattern greedy and case insensitive
At this point, it simply a question of grabbing the matches. Since .match() doesn't play well with the regex "non-capturing groups", use .exec() instead:
var domainString = "test foo abc.def.example.com bar ghi.jkl.example.com def";
var regDomainPattern = /(?:[\s.])([a-z0-9][a-z0-9-]+[a-z0-9]\.example\.com)/gi;
var aMatchedDomainStrings = [];
var patternMatch;
// loop through as long as .exec() still gets a match, and take the second index of the result (the one that ignores the non-capturing groups)
while (null != (patternMatch = regDomainPattern.exec(domainString))) {
aMatchedDomainStrings.push(patternMatch[1]);
}
At that point aMatchedDomainStrings should contain all of your valid, first-level, sub-domains.
var domainString = "test foo abc.def.example.com bar ghi.jkl.example.com def";
. . . should get you: def.example.com and jkl.example.com, while:
var domainString = "test foo abc.def.example.com bar ghi.jk&^%&*(l.example.com def";
. . . should get you only: def.example.com

JavaScript Regex to match a URL in a field of text

How can I setup my regex to test to see if a URL is contained in a block of text in javascript. I cant quite figure out the pattern to use to accomplish this
var urlpattern = new RegExp( "(http|ftp|https):\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,#?^=%&:/~\+#]*[\w\-\#?^=%&/~\+#])?"
var txtfield = $('#msg').val() /*this is a textarea*/
if ( urlpattern.test(txtfield) ){
//do something about it
}
EDIT:
So the Pattern I have now works in regex testers for what I need it to do but chrome throws an error
"Invalid regular expression: /(http|ftp|https)://[w-_]+(.[w-_]+)+([w-.,#?^=%&:/~+#]*[w-#?^=%&/~+#])?/: Range out of order in character class"
for the following code:
var urlexp = new RegExp( '(http|ftp|https):\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,#?^=%&:/~\+#]*[\w\-\#?^=%&/~\+#])?' );

Though escaping the dash characters (which can have a special meaning as character range specifiers when inside a character class) should work, one other method for taking away their special meaning is putting them at the beginning or the end of the class definition.
In addition, \+ and \# in a character class are indeed interpreted as + and # respectively by the JavaScript engine; however, the escapes are not necessary and may confuse someone trying to interpret the regex visually.
I would recommend the following regex for your purposes:
(http|ftp|https)://[\w-]+(\.[\w-]+)+([\w.,#?^=%&:/~+#-]*[\w#?^=%&/~+#-])?
this can be specified in JavaScript either by passing it into the RegExp constructor (like you did in your example):
var urlPattern = new RegExp("(http|ftp|https)://[\w-]+(\.[\w-]+)+([\w.,#?^=%&:/~+#-]*[\w#?^=%&/~+#-])?")
or by directly specifying a regex literal, using the // quoting method:
var urlPattern = /(http|ftp|https):\/\/[\w-]+(\.[\w-]+)+([\w.,#?^=%&:\/~+#-]*[\w#?^=%&\/~+#-])?/
The RegExp constructor is necessary if you accept a regex as a string (from user input or an AJAX call, for instance), and might be more readable (as it is in this case). I am fairly certain that the // quoting method is more efficient, and is at certain times more readable. Both work.
I tested your original and this modification using Chrome both on <JSFiddle> and on <RegexLib.com>, using the Client-Side regex engine (browser) and specifically selecting JavaScript. While the first one fails with the error you stated, my suggested modification succeeds. If I remove the h from the http in the source, it fails to match, as it should!
Edit
As noted by #noa in the comments, the expression above will not match local network (non-internet) servers or any other servers accessed with a single word (e.g. http://localhost/... or https://sharepoint-test-server/...). If matching this type of url is desired (which it may or may not be), the following might be more appropriate:
(http|ftp|https)://[\w-]+(\.[\w-]+)*([\w.,#?^=%&:/~+#-]*[\w#?^=%&/~+#-])?
#------changed----here-------------^
<End Edit>
Finally, an excellent resource that taught me 90% of what I know about regex is Regular-Expressions.info - I highly recommend it if you want to learn regex (both what it can do and what it can't)!

Complete Multi URL Pattern.
UPDATED: Nov. 2020, April & June 2021 (Thanks commenters)
Matches all URI or URL in a string!
Also extracts the protocol, domain, path, query and hash. ([a-z0-9-]+\:\/+)([^\/\s]+)([a-z0-9\-#\^=%&;\/~\+]*)[\?]?([^ \#\r\n]*)#?([^ \#\r\n]*)
https://regex101.com/r/jO8bC4/56
Example JS code with output - every URL is turned into a 5-part array of its 'parts' (protocol, host, path, query, and hash)
var re = /([a-z0-9-]+\:\/+)([^\/\s]+)([a-z0-9\-#\^=%&;\/~\+]*)[\?]?([^ \#\r\n]*)#?([^ \#\r\n]*)/mig;
var str = 'Bob: Hey there, have you checked https://www.facebook.com ?\n(ignore) https://github.com/justsml?tab=activity#top (ignore this too)';
var m;
while ((m = re.exec(str)) !== null) {
if (m.index === re.lastIndex) {
re.lastIndex++;
}
console.log(m);
}
Will give you the following:
["https://www.facebook.com",
"https://",
"www.facebook.com",
"",
"",
""
]
["https://github.com/justsml?tab=activity#top",
"https://",
"github.com",
"/justsml",
"tab=activity",
"top"
]

You have to escape the backslash when you are using new RegExp.
Also you can put the dash - at the end of character class to avoid escaping it.
& inside a character class means & or a or m or p or ; , you just need to put & and ; , a, m and p are already match by \w.
So, your regex becomes:
var urlexp = new RegExp( '(http|ftp|https)://[\\w-]+(\\.[\\w-]+)+([\\w-.,#?^=%&:/~+#-]*[\\w#?^=%&;/~+#-])?' );

try (http|ftp|https):\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,#?^=%&:/~\+#]*[\w\-\#?^=%&/~\+#])?

I've cleaned up your regex:
var urlexp = new RegExp('(http|ftp|https)://[a-z0-9\-_]+(\.[a-z0-9\-_]+)+([a-z0-9\-\.,#\?^=%&;:/~\+#]*[a-z0-9\-#\?^=%&;/~\+#])?', 'i');
Tested and works just fine ;)

Try this general regex for many URL format
/(([A-Za-z]{3,9})://)?([-;:&=\+\$,\w]+#{1})?(([-A-Za-z0-9]+\.)+[A-Za-z]{2,3})(:\d+)?((/[-\+~%/\.\w]+)?/?([&?][-\+=&;%#\.\w]+)?(#[\w]+)?)?/g

The trouble is that the "-" in the character class (the brackets) is being parsed as a range: [a-z] means "any character between a and z." As Vini-T suggested, you need to escape the "-" characters in the character classes, using a backslash.

try this worked for me
/^((ftp|http[s]?):\/\/)?(www\.)([a-z0-9]+)\.[a-z]{2,5}(\.[a-z]{2})?$/
that is so simple and understandable

Split string in JavaScript using a regular expression

I'm trying to write a regex for use in javascript.
var script = "function onclick() {loadArea('areaog_og_group_og_consumedservice', '\x26roleOrd\x3d1');}";
var match = new RegExp("'[^']*(\\.[^']*)*'").exec(script);
I would like split to contain two elements:
match[0] == "'areaog_og_group_og_consumedservice'";
match[1] == "'\x26roleOrd\x3d1'";
This regex matches correctly when testing it at gskinner.com/RegExr/ but it does not work in my Javascript. This issue can be replicated by testing ir here http://www.regextester.com/.
I need the solution to work with Internet Explorer 6 and above.
Can any regex guru's help?

Judging by your regex, it looks like you're trying to match a single-quoted string that may contain escaped quotes. The correct form of that regex is:
'[^'\\]*(?:\\.[^'\\]*)*'
(If you don't need to allow for escaped quotes, /'[^']*'/ is all you need.) You also have to set the g flag if you want to get both strings. Here's the regex in its regex-literal form:
/'[^'\\]*(?:\\.[^'\\]*)*'/g
If you use the RegExp constructor instead of a regex literal, you have to double-escape the backslashes: once for the string literal and once for the regex. You also have to pass the flags (g, i, m) as a separate parameter:
var rgx = new RegExp("'[^'\\\\]*(?:\\\\.[^'\\\\]*)*'", "g");
while (result = rgx.exec(script))
print(result[0]);

The regex you're looking for is .*?('[^']*')\s*,\s*('[^']*'). The catch here is that, as usual, match[0] is the entire matched text (this is very normal) so it's not particularly useful to you. match[1] and match[2] are the two matches you're looking for.
var script = "function onclick() {loadArea('areaog_og_group_og_consumedservice', '\x26roleOrd\x3d1');}";
var parameters = /.*?('[^']*')\s*,\s*('[^']*')/.exec(script);
alert("you've done: loadArea("+parameters[1]+", "+parameters[2]+");");
The only issue I have with this is that it's somewhat inflexible. You might want to spend a little time to match function calls with 2 or 3 parameters?
EDIT
In response to you're request, here is the regex to match 1,2,3,...,n parameters. If you notice, I used a non-capturing group (the (?: ) part) to find many instances of the comma followed by the second parameter.
/.*?('[^']*')(?:\s*,\s*('[^']*'))*/

Maybe this:
'([^']*)'\s*,\s*'([^']*)'

We Keep Coding

JavaScript is the programming language of the Web.

How to make regex match pattern from the beginning? - javascript

Related

JavaScript to remove whatever is after the tld and before the whitespace

Match Url path without query string

javascript regex that gets all subdomains

JavaScript Regex to match a URL in a field of text

Split string in JavaScript using a regular expression

Categories

Resources