regex html file href/src url pattern - javascript

Building an Electron app which gives you all colors of any website.
For that, the app downloads the url (like http://youtube.com) and saves it as html.
Now the app reads the html file and searches for any url which links to a file which might contain a color value (rgb/rgba/#/hsl), so those files would be css,js,svg etc. Those urls are added to an array, which is used by the electron-download-manager package lateron...
eg: ["href="/main.css?v=33.1"", "src="http://somesite.com/js/regex.js""]
href=" / src=" are removed by other functions
My pattern for the url is:
/(href|src)=("|')(.*?)(\.|\/)(css|js|svg|json)(.*?)("|')/g
which just works fine, but it doesnt end matching on the closing quote symbol '/"
the match of the first example is the whole line, it contains everything after the closing quote, so the title="" is part of the url, which makes no sense
href="https://www.youtube.com/opensearch?locale=de_DE" title="YouTube"><link rel="manifest" href="/manifest.json" // matches everything until json is found
src="bla.css" // works
src='bla.css?ver=123.456' // works
Is there a regex rule which says "stop by this character"?
my rule should be:
(start with href=", url , ends with .css/.js, optional fileversion(?v=123), quote symbol)

A regex to find any tag with src or href attribute with value containing one of these
extensions or sub-dirs css, js, svg, json is this :
/<[\w:]+(?=(?:[^>"']|"[^"]*"|'[^']*')*?\s(href|src)\s*=\s*(?:(['"])\s*((?:(?!\2)[\S\s])*?[.\/](?:css|js|svg|json)(?:(?!\2)[\S\s])*?)\s*\2))\s+(?:"[\S\s]*?"|'[\S\s]*?'|[^>]*?)+>/
https://regex101.com/r/tKrTSO/1
Where :
Attribute is in group 1
Value is in group 3
Expanded
< [\w:]+ # Any tag
(?= # Assert (a pseudo atomic group)
(?: [^>"'] | " [^"]* " | ' [^']* ' )*?
\s
( href | src ) # (1), href or src attribute
\s* = \s*
(?:
( ['"] ) # (2), Quote
\s*
( # (3 start), value
(?:
(?! \2 )
[\S\s]
)*?
[./] # One of these extensions or sub-dirs
(?: css | js | svg | json )
(?:
(?! \2 )
[\S\s]
)*?
) # (3 end)
\s*
\2
)
)
\s+
(?: " [\S\s]*? " | ' [\S\s]*? ' | [^>]*? )+
>

Related

Regex match url with params to specific pattern but not query string

My regex pattern:
const pattern = /^\/(test|foo|bar\/baz|en|ppp){1}/i;
const mat = pattern.exec(myURL);
I want to match:
www.mysite.com/bar/baz/myParam/...anything here
but not
www.mysite.com/bar/baz/?uid=100/..
myParam can be any string with or without dashes but only after that anything else can occur like query strings but not immediately after baz.
Tried
/^\/(test|foo|bar\/baz\/[^/?]*|en|ppp){1}/i;
Nothing works.
This, I believe, is what you are asking for:
const myURL = "www.mysite.com/bar/baz/myParam/";
const myURL2 = "www.mysite.com/bar/baz/?uid=100";
const regex = /\/[^\?]\w+/gm;
console.log('with params', myURL.match(regex));
console.log('with queryParams', myURL2.match(regex))
You can test this and play further in Regex101. Even more, if you use that page, it tells you what does what in the regex string.
If it's not what you were asking for, there was another question related to yours, without regex: Here it is
For the 2 example strings, you might use
^[^\/]+\/bar\/baz\/[\w-]+\/.*$
Regex demo
If you want to use the alternations as well, it might look like
^[^\/]+\/(?:test|foo|bar)\/(?:baz|en|ppp)\/[\w-]+\/.*$
^ Start of string
[^\/]+ Match 1+ times any char except a /
\/ Match /
(?:test|foo|bar) Match 1 of the options
\/ Match /
(?:baz|en|ppp) Match 1 of the options
\/ Match /
[\w-]+ Match 1+ times a word char or -
\/ Match /
.* Match 0+ occurrences of any char except a newline
$ End of string
Regex demo
Using a negative lookahead or lookbehind will solve your problem. There are 2 options not clear from the question:
?uid=100 is not allowed after the starting part /bar/baz, so www.mysite.com/test/bar/baz?uid=100 should be valid.
?uid=100 is not allowed anywhere in the string following /bar/baz, which means that www.mysite.com/test/bar/baz/?uid=100 is invalid as well.
Option 1
In short:
\/(test|foo|bar\/baz(?!\/?\?)|en|ppp)(\/[-\w?=]+)*\/?
Explanation of the important parts:
| # OR
bar # 'bar' followed by
\/ # '/' followed by
baz # 'baz'
(?! # (negative lookahead) so, **not** followed by
\/? # 0 or 1 times '/'
\? # '?'
) # END negative lookahead
and
( # START group
\/ # '/'
[-\w?=]+ # any word char, or '-','?','='
)* # END group, occurrence 0 or more times
\/? # optional '/'
Examples Option 1
You can make the lookahead even more specific with something like (?!\/?\?\w+=\w+) to make explicit that ?a=b is not allowed, but that's up to you.
Option 2
To make explicit that ?a=b is not allowed anywhere we can use negative lookbehind. Let's first find a solution for not allowing* bar/baz preceding the ?a=b.
Shorthand:
(?<!bar\/baz\/?)\?\w+=\w+
Explanation:
(?<! # Negative lookbehind: do **not** match preceding
bar\/baz # 'bar/baz'
\/? # optional '/'
)
\? # match '?'
\w+=\w+ # match e.g. 'a=b'
Let's make this part of the complete regex:
\/(test|foo|en|ppp|bar\/baz)(\/?((?<!bar\/baz\/?)\?\w+=\w+|[-\w]+))*\/?$
Explanation:
\/ # match '/'
(test|foo|en|ppp|bar\/baz) # start with 'test', 'foo', 'en', 'ppp', 'bar/baz'
(\/? # optional '/'
((?<!bar\/baz\/?)\?\w+=\w+ # match 'a=b', with negative lookbehind (see above)
| # OR
[-\w]+) # 1 or more word chars or '-'
)* # repeat 0 or more times
\/? # optional match for closing '/'
$ # end anchor
Examples Option 2

Match only subregex, part of regex

Hello I wanted to do autofiller to match to this format "HH:MM".
I wanted to check only against this regex /^(0[1-9]|1[012]):[0-5][0-9]$/ but have no idea how to match regex substring. I've looked at wikipedia and some sites and can't find modificator to check for 'subregex'. Doesn't this option exist? I've finally solved this problem with code below, but this array could certainly be generated programmatically, so there should already be solution I am searching for. Or it doesn't exist and I should write it?
patterns = [ /./, /^[0-9]$/, /^(0?[1-9]|1[012])$/, /^(0[1-9]|1[012]):$/, /^(0[1-9]|1[012]):[0-5]$/, /^(0[1-9]|1[012]):[0-5][0-9]$/]
unless patterns[newTime.length].test(newTime)
newTime = newTime.substring(0, newTime.length - 1)
You could probably accomplish the same thing a bit more efficient.
Combine the regexes into a cascading optional form, then use the match length, substring
and a template to auto complete the time.
Pseudo code (don't know JS too well) and real regex.
# pseudo-code:
# -------------------------
# input = ....;
# template = '00:00';
# rx = ^(?:0(?:[0-9](?::(?:[0-5](?:[0-9])?)?)?)?|1(?:[0-2](?::(?:[0-5](?:[0-9])?)?)?)?)$
# match = regex( input, rx );
# input = input + substr( template, match.length(), -1 );
^
(?:
0
(?:
[0-9]
(?:
:
(?:
[0-5]
(?: [0-9] )?
)?
)?
)?
|
1
(?:
[0-2]
(?:
:
(?:
[0-5]
(?: [0-9] )?
)?
)?
)?
)
$

regex to match simple URLs does not work properly

I'm trying to make a simple regex expression to match simple URLs (without URL parameters etc.)
it seems to work but there is still some problem..
This is my regex:
/(https|http|ftp):\/\/((-|[a-z0-9])+\.)+(com|org|net)\/?((-|[a-z0-9]\/?)+(-|[a-z0-9])*\.(css|js))?/ig
In this little list you can see what does not work properly:
HTTP://q-2Ud.a.q-2Ud.com/
https://q-2Ud.q-2Ud.q-2Ud.com
http://www.q-2Ud.q-2Ud.q-2Ud.com
http://www.q-2Ud.q-2Ud.q-2Ud.com/c ------------------------------------> NOT WORK
http://www.q-2Ud.q-2Ud.q-2Ud.com/cs -----------------------------------> NOT WORK
http://www.q-2Ud.q-2Ud.q-2Ud.com/css ----------------------------------> NOT WORK
http://www.q-2Ud.q-2Ud.q-2Ud.com/csss ---------------------------------> NOT WORK
http://www.q-2Ud.q-2Ud.q-2Ud.com/csss/css -----------------------------> NOT WORK
http://www.q-2Ud.q-2Ud.q-2Ud.com/css/yuyuyu/gyygug.css
http://www.q-2Ud.q-2Ud.q-2Ud.com/h/.css -------------------------------> NOT WORK
http://www.q-2Ud.q-2Ud.q-2Ud.com/.css
http://www.q-2Ud.q-2Ud.q-2Ud.com/k.css
http://www.q-2Ud.q-2Ud.q-2Ud.com/kk.css
http://www.q-2Ud.q-2Ud.q-2Ud.com/kkk.css
http://www.q-2Ud.q-2Ud.q-2Ud.com/f-1.css
http://www.q-2Ud.q-2Ud.q-2Ud.com/o/o.css
http://www.q-2Ud.q-2Ud.q-2Ud.com/d-1/d-2/d-3/d-4/f-1.css
http://www.q-2Ud.q-2Ud.q-2Ud.com/q-2Ud/q-2Ud/q-2Ud/q-2Ud/q-2Ud.js
Demo Here
it is matching URLs with .css or .js ending.
Remove \.(css|js) and it should work
/(https|http|ftp):\/\/((-|[a-z0-9])+\.)+(com|org|net)\/?\.?((-|[a-z0-9]\/?)+(-|[a-z0-9])*\/?(\.css|\.js)?)?/ig
This may catch all the ones that you are missing
Just need to arrange the groups a little better while maintaining validity.
This is trimmed to capture just the main 4 parts without delimiters.
edit: If you don't want to match .js or .css without a filename, use this regex ->
(?i)(https|http|ftp)://((?:[a-z0-9-]+\.)+(?:com|org|net))(?:/(?:([a-z0-9-]+(?:/?[a-z0-9-])*(?:\.(css|js))?))?)?
otherwise use this one ->
# /(?i)(https|http|ftp):\/\/((?:[a-z0-9-]+\.)+(?:com|org|net))(?:\/(?:([a-z0-9-]+(?:\/?[a-z0-9-])*)\/?)?(?:\.(css|js))?)?/
(?i)
( https | http | ftp ) # (1)
://
( # (2 start)
(?:
[a-z0-9-]+
\.
)+
(?: com | org | net )
) # (2 end)
(?:
/
(?:
( # (3 start)
[a-z0-9-]+
(?:
/?
[a-z0-9-]
)*
) # (3 end)
/?
)?
(?:
\.
( css | js ) # (4)
)?
)?

Regular expression for matching Titles (ex: Book title)

I am using jquery.validate.js plugin to validate a form and I want regex with match Titles(Books or Non-Books or any Title of products) but I failed to match and I wanted a regex which match following,
=> From 'A-Z' , 'a-z', whitespace, as well as tab space , special characters like ' ( ' , ' ) ' , - , _ , and 'coma' , dot , semicolon, ifen ,' : ', and all numbers
I used following regex for above:
/^[a-zA-Z0-9.'\-_\s]$/
/^[\d,\w,\s\;\:\()]$/
/^[^.-_#][A-Za-z0-9_ -.]+$/ - this is showing error when Title starts from upper case 'A'
and I referred following sites
http://regexpal.com/ // in this site i checked the above characters bot it showed error on validate
http://regexlib.com/DisplayPatterns.aspx?AspxAutoDetectCookieSupport=1
http://www.vogella.com/articles/JavaRegularExpressions/article.html
thanks in advance
This regex should match what you want /^[A-Za-z0-9\s\-_,\.;:()]+$/.
Special characters like . & - need escaping with a backslash. You also need a + or * at the end of the square braces to say 'one or more' or 'any number of' respectively.
I think this one you want DEMO
^\w++(?:[.,_:()\s-](?![.\s-])|\w++)*$
Description
^ # Start of string
\w++ # Match one or more alnum characters, possessively
(?: # Match either
[.,_:()\s-] # a "special" character
(?![.\s-]) # aserting that it's really single
| # or
\w++ # one or more alnum characters, possessively
)* # zero or more times
$ # End of string

Javascript - regex odd number of quotes

So I have this javascript regex expression:
var reg = new RegExp("(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))");
How could I escape the quotes so that the quotes are contained, since right now, they overflow, and quote the lines after it.
Edit:
regex expanded:
(?xi)
\b
( # Capture 1: entire matched URL
(?:
[a-z][\w-]+: # URL protocol and colon
(?:
/{1,3} # 1-3 slashes
| # or
[a-z0-9%] # Single letter or digit or '%'
# (Trying not to match e.g. "URI::Escape")
)
| # or
www\d{0,3}[.] # "www.", "www1.", "www2." … "www999."
| # or
[a-z0-9.\-]+[.][a-z]{2,4}/ # looks like domain name followed by a slash
)
(?: # One or more:
[^\s()<>]+ # Run of non-space, non-()<>
| # or
\(([^\s()<>]+|(\([^\s()<>]+\)))*\) # balanced parens, up to 2 levels
)+
(?: # End with:
\(([^\s()<>]+|(\([^\s()<>]+\)))*\) # balanced parens, up to 2 levels
| # or
[^\s`!()\[\]{};:'".,<>?«»“”‘’] # not a space or one of these punct chars
)
)
If you just use the native declaration form of regex in javascript:
var reg = /regex here/;
Then, you can freely use quotes in the regex without escaping anything. You will have to escape any forward slashes in the regex by putting a backslash in front of it.
If you want to stick with the string form, then you can escape a quote with a backslash in front of it to keep it from being a string terminator:
var reg = new RegExp('My dog\'s breath');

Categories