Js ReGex non-capturing group not working - javascript

my problem is, i need to capture an script src, but i need to get it only if it has an script tag before the src.
So here follow my regex and the options i tried
String: <script src="http://example.net"></script>
Regex: /(?:\<script[^]+src=("|'))([^]+)(?="|')/g
Match: <script src="http://example.net
Second option:
String: <script src="http://example.net"></script>
Regex: /(?!\<script[^]+src=("|'))([^]+)(?="|')/g
Match: script src="http://example.net
What i need to get is: http://example.net
I really do appreciate any help.
This is the tool i'm using for testing: http://www.regexr.com/
Thanks,

Regular expression is not the right tool for parsing HTML, but to fix the problem you can use the exec() method in a loop to grab all your submatches and then push the match results of the captured group into an array.
var s = '<script src="http://foo.net"></script><script src="http://bar.com"></script>';
var re = /<script[^>]+?src=['"]([^'"]+)['"]/g,
matches = [];
while (m = re.exec(s)) {
matches.push(m[1]);
}
console.log(matches) //=> [ 'http://foo.net', 'http://bar.com' ]

Not sure exactly what you're trying to do or where you got that syntax.
If you want values of the src attribute in all script tags, why not just search for /<script[^>]*\ssrc="([^"]*)"/ and examine the first subexpression match..

This syntax [^]+ as far i know, works only with old versions of internet explorer (but perhaps with newer versions too, you know microsoft) and means all that is not nothing (i.e. everything), one or several times.
If you want to match all the characters until the end of the tag and before the attribute you want, you need to use [^>]+? (as you can see) with a lazy quantifier.
For the second ugly [^], since it is between quotes, you only need to replace it with [^"'] that excludes quotes.
The result you need is not the whole match but the content of the capture group.
<script[^>]+?src=["']([^"']+)["']

Here's a start for you:
/<script src=\"(.*)(?=\")/g
Retrieve the value of the first capturing group returned by this expression.

Here is the regexr.com result:
String: <script src="http://example.net"></script>
Regex: /(?:<script src=")([^"]+)/g
group#1: http://example.net
And here is the example javascript code:
s = '<script src="http://example.net"></script>';
url = s.split(/(?:<script src=")([^"]+)/g)[1];
Since javascript doesn't support lookbehind assertions, - AFAIK - You can't both match only the url and check if there is a script tag before the url. Therefore, As an alternative of lookbehind assertions, this is the fastest and easiest solution that i know.

Related

javascript regex replace multiline strings [duplicate]

var ss= "<pre>aaaa\nbbb\nccc</pre>ddd";
var arr= ss.match( /<pre.*?<\/pre>/gm );
alert(arr); // null
I'd want the PRE block be picked up, even though it spans over newline characters. I thought the 'm' flag does it. Does not.
Found the answer here before posting. SInce I thought I knew JavaScript (read three books, worked hours) and there wasn't an existing solution at SO, I'll dare to post anyways. throw stones here
So the solution is:
var ss= "<pre>aaaa\nbbb\nccc</pre>ddd";
var arr= ss.match( /<pre[\s\S]*?<\/pre>/gm );
alert(arr); // <pre>...</pre> :)
Does anyone have a less cryptic way?
Edit: this is a duplicate but since it's harder to find than mine, I don't remove.
It proposes [^] as a "multiline dot". What I still don't understand is why [.\n] does not work. Guess this is one of the sad parts of JavaScript..
DON'T use (.|[\r\n]) instead of . for multiline matching.
DO use [\s\S] instead of . for multiline matching
Also, avoid greediness where not needed by using *? or +? quantifier instead of * or +. This can have a huge performance impact.
See the benchmark I have made: https://jsben.ch/R4Hxu
Using [^]: fastest
Using [\s\S]: 0.83% slower
Using (.|\r|\n): 96% slower
Using (.|[\r\n]): 96% slower
NB: You can also use [^] but it is deprecated in the below comment.
[.\n] does not work because . has no special meaning inside of [], it just means a literal .. (.|\n) would be a way to specify "any character, including a newline". If you want to match all newlines, you would need to add \r as well to include Windows and classic Mac OS style line endings: (.|[\r\n]).
That turns out to be somewhat cumbersome, as well as slow, (see KrisWebDev's answer for details), so a better approach would be to match all whitespace characters and all non-whitespace characters, with [\s\S], which will match everything, and is faster and simpler.
In general, you shouldn't try to use a regexp to match the actual HTML tags. See, for instance, these questions for more information on why.
Instead, try actually searching the DOM for the tag you need (using jQuery makes this easier, but you can always do document.getElementsByTagName("pre") with the standard DOM), and then search the text content of those results with a regexp if you need to match against the contents.
You do not specify your environment and version of JavaScript (ECMAScript), and I realise this post was from 2009, but just for completeness:
With the release of ECMA2018 we can now use the s flag to cause . to match \n (see https://stackoverflow.com/a/36006948/141801).
Thus:
let s = 'I am a string\nover several\nlines.';
console.log('String: "' + s + '".');
let r = /string.*several.*lines/s; // Note 's' modifier
console.log('Match? ' + r.test(s)); // 'test' returns true
This is a recent addition and will not work in many current environments, for example Node v8.7.0 does not seem to recognise it, but it works in Chromium, and I'm using it in a Typescript test I'm writing and presumably it will become more mainstream as time goes by.
Now there's the s (single line) modifier, that lets the dot matches new lines as well :)
\s will also match new lines :D
Just add the s behind the slash
/<pre>.*?<\/pre>/gms
[.\n] doesn't work, because dot in [] (by regex definition; not javascript only) means the dot-character. You can use (.|\n) (or (.|[\n\r])) instead.
I have tested it (Chrome) and it's working for me (both [^] and [^\0]), by changing the dot (.) with either [^\0] or [^] , because dot doesn't match line break (See here: http://www.regular-expressions.info/dot.html).
var ss= "<pre>aaaa\nbbb\nccc</pre>ddd";
var arr= ss.match( /<pre[^\0]*?<\/pre>/gm );
alert(arr); //Working
In addition to above-said examples, it is an alternate.
^[\\w\\s]*$
Where \w is for words and \s is for white spaces
[\\w\\s]*
This one was beyond helpful for me, especially for matching multiple things that include new lines, every single other answer ended up just grouping all of the matches together.

What's wrong with this regular expression to find URLs?

I'm working on a JavaScript to extract a URL from a Google search URL, like so:
http://www.google.com/search?client=safari&rls=en&q=thisisthepartiwanttofind.org&ie=UTF-8&oe=UTF-8
Right now, my code looks like this:
var checkForURL = /[\w\d](.org)/i;
var findTheURL = checkForURL.exec(theURL);
I've ran this through a couple regex testers and it seems to work, but in practice the string I get returned looks like this:
thisisthepartiwanttofind.org,.org
So where's that trailing ,.org coming from?
I know my pattern isn't super robust but please don't suggest better patterns to use. I'd really just like advice on what in particular I did wrong with this one. Thanks!
Remove the parentheses in the regex if you do not process the .org (unlikely since it is a literal). As per #Mark comment, add a + to match one or more characters of the class [\w\d]. Also, I would escape the dot:
var checkForURL = /[\w\d]+\.org/i;
What you're actually getting is an array of 2 results, the first being the whole match, the second - the group you defined by using parens (.org).
Compare with:
/([\w\d]+)\.org/.exec('thisistheurl.org')
→ ["thisistheurl.org", "thisistheurl"]
/[\w\d]+\.org/.exec('thisistheurl.org')
→ ["thisistheurl.org"]
/([\w\d]+)(\.org)/.exec('thisistheurl.org')
→ ["thisistheurl.org", "thisistheurl", ".org"]
The result of an .exec of a JS regex is an Array of strings, the first being the whole match and the subsequent representing groups that you defined by using parens. If there are no parens in the regex, there will only be one element in this array - the whole match.
You should escape .(DOT) in (.org) regex group or it matches any character. So your regex would become:
/[\w\d]+(\.org)/
To match the url in your example you can use something like this:
https?://([0-9a-zA-Z_.?=&\-]+/?)+
or something more accurate like this (you should choose the right regex according to your needs):
^https?://([0-9a-zA-Z_\-]+\.)+(com|org|net|WhatEverYouWant)(/[0-9a-zA-Z_\-?=&.]+)$

JavaScript negative lookbehind issue

I've got some JavaScript that looks for Amazon ASINs within an Amazon link, for example
http://www.amazon.com/dp/B00137QS28
For this I use the following regex: /([A-Z0-9]{10})
However, I don't want it to match artist links which look like:
http://www.amazon.com/Artist-Name/e/B000AQ1JZO
So I need to exclude any links where there's a '/e' before the slash and the 10-character alphanumeric code. I thought the following would do that: (?<!/e)([A-Z0-9]{10}), but it turns out negative lookbehinds don't work in JavaScript. Is that right? Is there another way to do this instead?
Any help would be much appreciated!
As a side note, be aware there are plenty of Amazon link formats, which is why I want to blacklist rather than whitelist, eg, these are all the same page:
http://www.amazon.com/gp/product/B00137QS28/
http://www.amazon.com/dp/B00137QS28
http://www.amazon.com/exec/obidos/ASIN/B00137QS28/
http://www.amazon.com/Product-Title-Goes-Here/dp/B00137QS28/
In your case an expression like this would work:
/(?!\/e)..\/([A-Z0-9]{10})/
([A-Z0-9]{10}) will work equally well on the reverse of its input, so you can
reverse the string,
use positive lookahead,
reverse it back.
You need to use a lookahead to filter the /e/* ones out. Then trim the leading /e/ from each of the matches.
var source; // the source you're matching against the RegExp
var matches = source.match(/(?!\/e)..\/[A-Z0-9]{10}/g) || [];
var ids = matches.map(function (match) {
return match.substr(3);
});

Referencing nested groups in JavaScript using string replace using regex

Because of the way that jQuery deals with script tags, I've found it necessary to do some HTML manipulation using regular expressions (yes, I know... not the ideal tool for the job). Unfortunately, it seems like my understanding of how captured groups work in JavaScript is flawed, because when I try this:
var scriptTagFormat = /<script .*?(src="(.*?)")?.*?>(.*?)<\/script>/ig;
html = html.replace(
scriptTagFormat,
'<span class="script-placeholder" style="display:none;" title="$2">$3</span>');
The script tags get replaced with the spans, but the resulting title attribute is blank. Shouldn't $2 match the content of the src attribute of a script tag?
Nesting of groups is irrelevant; their numbering is determined strictly by the positions of their opening parentheses within the regex. In your case, that means it's group #1 that captures the whole src="value" sequence, and group #2 that captures just the value part.
Try this:
/<script (?:(?!src).)*(?:src="(.*?)")?.*?>(.*?)<\/script>/ig
See here: rubular
As stema wrote, the .*? matches too much. With the negative lookahead (?:(?!src).)* you will match only until a src attribute.
But actually in this case you could also just move the .*? into the optional part:
/<script (?:.*?src="(.*?)")?.*?>(.*?)<\/script>/ig
See here: rubular
The .*? matches too much because the following group is optional, ==> your src is matched from one of the .*? around. if you remove the ? after your first group it works.
Update: As #morja pointed out your solution is to move the first .*? into the optional src part.
Just for completeness: /<script (?:.*?(src="(.*?)"))?.*?>(.*?)<\/script>/ig
You can see it here on rubular (corrected my link also)
If you don't want to use the content of the first capturing group, then make it a non capturing group using (?:)
/<script (?:.*?(?:src="(.*?)"))?.*?>(.*?)<\/script>/ig
Then your wanted result is in $1 and $2.
Could you post the html you are retrieving? Your code works fine in a simple example: jsfiddle (warning: alert box)
My first guess is that one of your script tags does not have a src meaning you are left with a single capture group (the script contents).
I'm thinking that regular expressions by themselves can't do exactly what I'm looking for, so here's my modification to work around the problem:
var scriptTagFormat = /<script\s+((.*?)="(.*?)")*\s*>(.*?)<\/script>/ig;
html = html.replace(
scriptTagFormat,
'<span class="script-placeholder" style="display:none;" $1>$4</span>');
Before, I wanted to avoid setting non-standard attributes on the replacement span. This code blindly copies all attributes instead. Luckily, the non-standard attributes aren't stripped out of the DOM when I insert the HTML, so it will work for my purposes.

Regex to remove all but file name from links

I am trying to write a regexp that removes file paths from links and images.
href="path/path/file" to href="file"
href="/file" to href="file"
src="/path/file" to src="file"
and so on...
I thought that I had it working, but it messes up if there are two paths in the string it is working on. I think my expression is too greedy. It finds the very last file in the entire string.
This is my code that shows the expression messing up on the test input:
<script type="text/javascript" src="/javascripts/jquery.js"></script>
<script type="text/javascript">
$(document).ready(function(){
var s = '<img src="/one/two/keep.this">';
var t = s.replace(/(src|href)=("|').*\/(.*)\2/gi,"$1=$2$3$2");
alert(t);
});
</script>
It gives the output:
The correct output should be:
<img src="keep.this">
Thanks for any tips!
It doesn't have to be a regular expression (assuming / delimiters):
var fileName = url.split('/').pop(); //pop takes the last element
I would suggest run separate regex replacement, one for a links and another for img, easier and clearer, thus more maintainable.
This seems to work in case anyone else has the problem:
var t = s.replace(/(src|href)=('|")([^ \2]*\/)*\/?([^ \2]*)\2/gi,"$1=$2$4$2");
Try adding ? to make the * quantifiers non-greedy. You want them to stop matching when they encounter the ending quote character. The greedy versions will barrel right on past the ending quote if there's another quote later in the string, finding the longest possible match; the non-greedy ones will find the shortest possible match.
/(src|href)=("|').*?\/([^/]*?)\2/gi
Also I changed the second .* to [^/]* to allow the first .* to still match the full path now that it's non-greedy.

Categories