Regular Expression to get link and link description - javascript

I'm attempting to develop my own linking syntax as such:
[this is google|google.com]
I know how to get the text between the square brackets (\[(.*?)\]) but I'm not sure how to extract the individual pieces. Also, if someone wanted to simply add square brackets without a link (eg. [this is google]), it wouldn't be detected as a link.
Can anyone provide me with some direction on this? I need to access both pieces.

Here you go:
\[(.*?)\|(.*?)\]
Basically, just add another capture group. The \| escapes the | in your marvelous syntax. :)
http://regex101.com/r/dS3uJ9

Not this? Try this regex
(\[(.*?)\|?(.*?)\])

You can use;
s='[this is google|google.com]';
m = s.match(/\[([^|]+)\|([^\]]+)\]/);
Then use m[1] and m[2] for description and link.
UPDATE: For showing difference between performances of both regex approaches I created 2 regex101 links:
http://regex101.com/r/xZ6nB7 (Using negation)
http://regex101.com/r/rI0eH8 (Using lazy quantifier .*?
To see performance difference click on Launch regex debugger link on top-left of both links.
You will notice my regex displays:
+Match 1 - finished in 11 steps
But .*? link shows this:
+Match 1 - finished in 59 steps
That proves point that negation based regex is much more efficient than .* approach.
PS: If you click on +Match 1 - finished in 59 steps you will see red color BACKTRACK messages in 2nd link.

Related

Regex: Replace last segment of url

I try to figure out the correct regex to replace the last segment of an url with a modified version of that very last segment. (I know that there are similar threads out there, but none seemed to help...)
Example:
https://www.test.com/one/two/three/mypost/
--->
one/two/three?id=mypost
https://www.test.com/one/mypost/
--->
one?id=mypost
Now I am stuck here:
https://regex101.com/r/9GqYaU/1
I can get the last segment in capturing group 2 but how would I replace it?
I think I will have to something like this:
const url = 'https://www.test.com/one/two/three/mypost/'
const regex = /(http[s]?:\/\/)([^\/]+\/)*(?=\/$|$)/
const path = url.replace(regex, `${myUrlWithoutTheLastSegmentAnd WithoutHTTPS}?id=$2`)
return path
But I have no idea how to get the url without the last segment. I have currently only access to the whole string or group 1 (which is useless in this case) and then group 2, but not the string without group 2.
I would be very glad for any help here. Sometimes I just lack the knowledge of what is possible with regex and how to achieve it.
Thank you in advance.
Cheers
You could use the URL class to extract the pathname and substring to remove the first '/'.
Then, you could put the last part of the pathname in a group and use it as a reference $1 for the replacement.
const url = new URL('https://www.test.com/one/two/three/mypost/').pathname.substring(1)
console.log(url.replace(/\/([^/]*)\/$/, '?id=$1'))
I came across your question yesterday and agree with going down the route of parsing the URL. Once you get there you could even use JavaScript array methods which I prefer to string methods like:
pathname.split("/").filter(p => p.length).pop()
This would separate each folder, ignore any with no length (i.e. handle a trailing slash) and return the last one (mypost).
Anyway, I am also learning regex so sometimes when I find a question like this I just try to find the answer anyway as the best way of learning is doing. It took 24 hours 😂 I came up with this:
/(https?:\/\/).+?([a-z-]*)\/?$/gm
(https?:\/\/) you know what this does. Small correction, you don't need the square brackets. Question mark matches 0 or 1 of the preceding character. As we're only matching s this just works. If you wanted to match s or z you would use [sz]?. I think.
.+? this is the cool one I think I will use in future now I found it. The question mark here has a different meaning - it makes .+ (which means one or more of any character) non-greedy. That means it stops applying once it reaches the next rule. Which is...
([a-z-]*) any number of letters or a hyphen. You should maybe change this to include numbers and upper case.
\/? Optional slash
$ all this must apply at the end of the string.
Here is a demo
https://regex101.com/r/mQNkIS/1

Regular Expression to find merge conflicts in file

This is the file which contains merge conflicts,
<<<<<<< HEAD
$conf['some_unit_id'] = '4-qw-gg-ds-sometext';
=======
// Some Snippets Site Info
$conf['site_info'] = array(
'customer_service_phone' => '+1 323223232
'logo_path' => 'https://www.google.com/img/icons/src/logo.svg',
'currency' => 'CAD',
'https://www.youtube.com/user/somewebsite/ogog',
'https://www.instagram.com/somewebsite/',
),
);
>>>>>>> ff6df3435231fdff78fwsd83e7dffa0732eft554
// Somes code
$done['rules'] = TRUE;
Am trying to find the best regular expression that detect merge conflicts in the file. Initially I tried with :
/(<* HEAD)/
Which will detect only HEAD with some preceding <
I have some other markers as well like :
1. ======
2. >>>>> ff6df3435231fdff78fwsd83e7dffa0732eft554
These two markers must detect along with HEAD marker as well. And if a developer fixes the merge conflicts only <* HEAD and rest of the ie., ===== and >>> ff6df3435231fdff78fwsd83e7dffa0732eft554 the regular expression should detect that as well.
Since this regular expression am using in pre-commit hook. If one pattern detected in file commit will break. I need exact regex to detect merge conflict markings.
Any solution would be appreciated.
Since they're all the same length, you can use a character group:
/^[<=>]{7}( .+)?$/mg
(make sure to use a multiline regex)
You can use:
^<{7} HEAD(?:(?!={7})[\s\S])*={7}(?:(?!>{7} \w+)[\s\S])*>{7} \w+
Demo & explanation
You might also match all the lines by checking the start of each line to prevent some of the unnecessary backtracking using [\s\S].
First match the <<<<<<< HEAD part, then match all following lines that do not start with ======= and then match it.
Then match all lines that do not start with >>>>>>> followed by matching it and chars [a-z0-9].
^<{7} HEAD(?:\r?\n(?!={7}\r?\n).*)*\r?\n={7}(?:\r?\n(?!>{7} ).*)*\r?\n>{7} [a-z0-9]+
Regex demo
If you want to highlight the markers, you could use a capturing group:
^(<{7} HEAD)(?:\r?\n(?!={7}\r?\n).*)*\r?\n(={7})(?:\r?\n(?!>{7} ).*)*\r?\n(>{7} [a-z0-9]+)
Regex demo
If I understand correctly your desire, you want to find block code that need to resolve conflict. I hope my suggestion can help you.
/^<{7}\sHEAD[\s\S]+?>{7}\s\w+$/gm
Details:
Mode: multiline
^<{7}\sHEAD: block code starts with <<<<<<< HEAD
[\s\S]+?: get any character as few times as possible (line break accepted)
{7}\s\w+$: block code ends with >>>>>>> commit hash
Demo

Matching Multiple Level Deep URL

I have been trying to write a regex on javascript that helps matches URL exactly 2 level deep URL in the following format:
https://myurl.com/x/y/z
https://anotherurl.com/ab9/zx/qs
I have tried couple of revelant regex suggested in other answers and tried modifying them for my own purposes - however to no avail: Regex to match 2 level deep URL and not 3 level deep (Google Analytics), https://mixedanalytics.com/blog/regex-match-number-subdirectories-url/
Could someone shade some light? Please pardon my lack of knowledge in regex. I am just starting out.
This is a Regular Expression which I think you want something like this:-
\^(?:https|http)\:\/\/[^\/]+\/[^\/]+\/[^\/]+\/[^\/]+$\
The Explanation:-
^ : The first of the string.
(?: ... ) : A non-capturing group.
https|http : Matches both https and http.
\:\/\/ : Matches :// which appear after https.
[^\/]+ : Matches anything except /, and the plus means one or more occurrences(letters or symbols).
\/ : Matches / symbol.
$ : The end of the string.
And the other part of the regex code is repeated and described above, and also if you don't understand the explanation open this link this describes more nicely than me, but I didn't wrote 2 level deep URL because your examples aren't two level deep URL, they're 3 level deep URL, And if you just want 2 level deep URL without looking at your examples so use this instead:-
\^(?:https|http)\:\/\/[^\/]+\/[^\/]+\/[^\/]+$\

javascript regex negation detect Url NOT containing given domain

I need to check some html files and extract the urls that are not referred to 2 websites
after many tests I got this
/(http|https)?:?(\/\/)\w*\.*\-*[^(mysite.com)]\w*\.?\S*/igm
that works not bad.. but not perfectly:
for example, as can see HERE on regexr.com it matches
// End
but not
www.demo.com
while should be the countrary, but adding a ? after (\/\/) it becomes an unusful "catch all"
and if url has a " at beginning and at the end, and this clearly happens frequently
does not grab starting " (correctly) but grab ending one (wrong)
finally it should not match also theothermysite.net but do well understood how to handle OR with Negation :-(
can help please?
Joe
Like this?
/((http|https):(\/\/)|www\.)\w*\.*\-*[^(mysite.com)(theothermysite.net)]\w*\.?[^\s\t\r\n\"]*/igm
I just added a "or www", replaced \S with its components plus \" and added another atomic group to the negation like you already did with mysite.com

REGEX - links vs hashtags conflict

I'm using some regex to convert links, hashtags, mention etc from text I get from APIs (twitter, facebook,..)
It works well but in the special case where there is an anchor # in a link, the first pattern converts the link first and then the hastag inside the link tag - for example converting:
http://www.mytaratata.com/emission/taratata-n89/video/557/edwyn-collins-a-girl-like-you-1995#newsletter
is a mess.
I just would like the regex for twitter hashtags doesn't match if it's a link - (for exemple if it contains a dot)
hello#music -> match
#hello#music -> match
hello.com#music -> no match
I'm about something like this using a negative lookahead but I can't get it :
((?!\.)#.*\w*[a-zA-Z_]+\w*)
I think you want something like this,
^(?!.*?\.).*?(#.*\w*[a-zA-Z_]+\w*)
Get the hasttag from group index 1.
DEMO
OR
^.*?\..*$|(#\w*[a-zA-Z_]+\w*)
DEMO
I would suggest keeping things simple here using this regex:
^[^.\n]+#([^#.\n]+)
RegEx Demo

Categories