Context: I have some dynamically generated HTML which can have embedded javascript function calls inside. I'm trying to extract the function calls with a regular expression.
Sample HTML string:
<dynamic html>
<script language="javascript">
funcA();
</script>
<a little more dynamic html>
<script language="javascript">
funcB();
</script>
My goal is to extract the text "funcA();" and "funcB();" from the above snippet (either as a single string or an array with two elements would be fine). The regular expression I have so far is:
var regexp = /[\s\S]*<script .*>([\s\S]*)<\/script>[\s\S]*/gm;
Using html_str.replace(regexp, "$1") only returns "funcB();".
Now, this regexp works just fine when there is only ONE set of <script> tags in the HTML, but when there are multiple it only returns the LAST one when using the replace() method. Even removing the '/g' modifier matches only the last function call. I'm still a novice to regular expressions so I know I'm missing something fundamental here... Any help in pointing me in the right direction would be greatly appreciated. I've done a bit of research already but still haven't been able to get this issue resolved.
Your wildcard matches are all greedy. This means they will not only match what you expect, but as much as there possibly is in your code.
Make them all non-greedy (.*?) and it should work.
Related
my problem is, i need to capture an script src, but i need to get it only if it has an script tag before the src.
So here follow my regex and the options i tried
String: <script src="http://example.net"></script>
Regex: /(?:\<script[^]+src=("|'))([^]+)(?="|')/g
Match: <script src="http://example.net
Second option:
String: <script src="http://example.net"></script>
Regex: /(?!\<script[^]+src=("|'))([^]+)(?="|')/g
Match: script src="http://example.net
What i need to get is: http://example.net
I really do appreciate any help.
This is the tool i'm using for testing: http://www.regexr.com/
Thanks,
Regular expression is not the right tool for parsing HTML, but to fix the problem you can use the exec() method in a loop to grab all your submatches and then push the match results of the captured group into an array.
var s = '<script src="http://foo.net"></script><script src="http://bar.com"></script>';
var re = /<script[^>]+?src=['"]([^'"]+)['"]/g,
matches = [];
while (m = re.exec(s)) {
matches.push(m[1]);
}
console.log(matches) //=> [ 'http://foo.net', 'http://bar.com' ]
Not sure exactly what you're trying to do or where you got that syntax.
If you want values of the src attribute in all script tags, why not just search for /<script[^>]*\ssrc="([^"]*)"/ and examine the first subexpression match..
This syntax [^]+ as far i know, works only with old versions of internet explorer (but perhaps with newer versions too, you know microsoft) and means all that is not nothing (i.e. everything), one or several times.
If you want to match all the characters until the end of the tag and before the attribute you want, you need to use [^>]+? (as you can see) with a lazy quantifier.
For the second ugly [^], since it is between quotes, you only need to replace it with [^"'] that excludes quotes.
The result you need is not the whole match but the content of the capture group.
<script[^>]+?src=["']([^"']+)["']
Here's a start for you:
/<script src=\"(.*)(?=\")/g
Retrieve the value of the first capturing group returned by this expression.
Here is the regexr.com result:
String: <script src="http://example.net"></script>
Regex: /(?:<script src=")([^"]+)/g
group#1: http://example.net
And here is the example javascript code:
s = '<script src="http://example.net"></script>';
url = s.split(/(?:<script src=")([^"]+)/g)[1];
Since javascript doesn't support lookbehind assertions, - AFAIK - You can't both match only the url and check if there is a script tag before the url. Therefore, As an alternative of lookbehind assertions, this is the fastest and easiest solution that i know.
ok i do have this following data in my div
<div id="mydiv">
<!--
what is your present
<code>alert("this is my present");</code>
where?
<code>alert("here at my left hand");</code>
oh thank you! i love you!! hehe
<code>alert("welcome my honey ^^");</code>
-->
</div>
well what i need to do there is to get the all the scripts inside the <code> blocks and the html codes text nodes without removing the html comments inside. well its a homework given by my professor and i can't modify that div block..
I need to use regular expressions for this and this is what i did
var block = $.trim($("div#mydiv").html()).replace("<!--","").replace("-->","");
var htmlRegex = new RegExp(""); //I don't know what to do here
var codeRegex = new RegExp("^<code(*n)</code>$","igm");
var code = codeRegex.exec(block);
var html = "";
it really doesn't work... please don't give the exact answer.. please teach me.. thank you
I need to have the following blocks for the variable code
alert("this is my present");
alert("here at my left hand");
alert("welcome my honey ^^");
and this is the blocks i need for variable html
what is your present
where?
oh thank you! i love you!! hehe
my question is what is the regex pattern to get the results above?
Parsing HTML with a regular expression is not something you should do.
I'm sure your professor thinks he/she was really clever and that there's no way to access the DOM API and can wave a banner around and justify some minor corner-case for using regex to parse the DOM and that sometimes it's okay.
Well, no, it isn't. If you have complex code in there, what happens? Your regex breaks, and perhaps becomes a security exploit if this is ever in production.
So, here:
http://jsfiddle.net/zfp6D/
Walk the dom, get the nodeType 8 (comment) text value out of the node.
Invoke the HTML parser (that thing that browsers use to parse HTML, rather than regex, why you wouldn't use the HTML parser to parse HTML is totally beyond me, it's like saying "Yeah, I could nail in this nail with a hammer, but I think I'm going to just stomp on the nail with my foot until it goes in").
Find all the CODE elements in the newly parsed HTML.
Log them to console, or whatever you want to do with them.
First of all, you should be aware that because HTML is not a regular language, you cannot do generic parsing using regular expressions that will work for all valid inputs (generic nesting in particular cannot be expressed with regular expressions). Many parsers do use regular expressions to match individual tokens, but other algorithms need to be built around them
However, for a fixed input such as this, it's just a case of working through the structure you have (though it's still often easier to use different parsing methods than just regular expressions).
First lets get all the code:
var code = '', match = [];
var regex = new RegExp("<code>(.*?)</code>", "g");
while (match = regex.exec(content)) {
code += match[1] + "\n";
}
I assume content contains the content of the div that you've already extracted. Here the "g" flag says this is for "global" matching, so we can reuse the regex to find every match. The brackets indicate a capturing group, . means any character, * means repeated 0 or more times, and ? means "non-greedy" (see what happens without it to see what it does).
Now we can do a similar thing to get all the other bits, but this time the regex is slightly more complicated:
new RegExp("(<!--|</code>)(.*?)(-->|<code>)", "g")
Here | means "or". So this matches all the bits that start with either "start comment" or "end code" and end with "end comment" or "start code". Note also that we now have 3 sets of brackets, so the part we want to extract is match[2] (the second set).
You're doing a lot of unnecessary stuff. .html() gives you the inner contents as a string. You should be able to use regEx to grab exactly what you need from there. Also, try to stick with regEx literals (e.g. /^regexstring$/). You have to escape escape characters using new RegExp which gets really messy. You generally only want to use new RegExp when you need to put a string var into a regEx.
The match function of strings accepts regEx and returns a collection of every match when you add the global flag (e.g. /^regexstring$/g <-- note the 'g'). I would do something like this:
var block = $('#mydiv').html(), //you can set multiple vars in one statement w/commas
matches = block.match(/<code>[^<]*<\/code>/g);
//[^<]* <-- 0 or more characters that aren't '<' - google 'negative character class'
matches.join('_') //lazy way of avoiding a loop - join into a string with a safe character
.replace(/<\/*code>/g,'') //\/* 0 or more forward slashes
.split('_');//return the matches string back to array
//Now do what you want with matches. Eval (ew) or append in a script tag (ew).
//You have no control over the 'ew'. I just prefer data to scripts in strings
I am trying to write a regexp that removes file paths from links and images.
href="path/path/file" to href="file"
href="/file" to href="file"
src="/path/file" to src="file"
and so on...
I thought that I had it working, but it messes up if there are two paths in the string it is working on. I think my expression is too greedy. It finds the very last file in the entire string.
This is my code that shows the expression messing up on the test input:
<script type="text/javascript" src="/javascripts/jquery.js"></script>
<script type="text/javascript">
$(document).ready(function(){
var s = '<img src="/one/two/keep.this">';
var t = s.replace(/(src|href)=("|').*\/(.*)\2/gi,"$1=$2$3$2");
alert(t);
});
</script>
It gives the output:
The correct output should be:
<img src="keep.this">
Thanks for any tips!
It doesn't have to be a regular expression (assuming / delimiters):
var fileName = url.split('/').pop(); //pop takes the last element
I would suggest run separate regex replacement, one for a links and another for img, easier and clearer, thus more maintainable.
This seems to work in case anyone else has the problem:
var t = s.replace(/(src|href)=('|")([^ \2]*\/)*\/?([^ \2]*)\2/gi,"$1=$2$4$2");
Try adding ? to make the * quantifiers non-greedy. You want them to stop matching when they encounter the ending quote character. The greedy versions will barrel right on past the ending quote if there's another quote later in the string, finding the longest possible match; the non-greedy ones will find the shortest possible match.
/(src|href)=("|').*?\/([^/]*?)\2/gi
Also I changed the second .* to [^/]* to allow the first .* to still match the full path now that it's non-greedy.
I am hoping that this will have a pretty quick and simple answer. I am using regular-expressions.info to help me get the right regular expression to turn URL-encoded, ISO-8859-1 pound sign ("%A3"), into a URL-encoded UTF-8 pound sign ("%C2%A3").
In other words I just want to swap %A3 with %C2%A3, when the %A3 is not already prefixed with %C2.
So I would have thought the following would work:
Regular Expression: (?!(\%C2))\%A3
Replace With: %C2%A3
But it doesn't and I can't figure out why!
I assume my syntax is just slightly wrong, but I can't figure it out! Any ideas?
FYI - I know that the following will work (and have used this as a workaround in the meantime), but really want to understand why the former doesn't work.
Regular Expression: ([^\%C2])\%A3
Replace With: $1%C2%A3
TIA!
Why not just replace ((%C2)?%A3) with %C2%A3, making the prefix an optional part of the match? It means that you're "replacing" text with itself even when it's already right, but I don't foresee a performance issue.
Unfortunately, the (?!) syntax is negative lookahead. To the best of my knowledge, JavaScript does not support negative lookbehind.
What you could do is go forward with the replacement anyway, and end up with %C2%C2%A3 strings, but these could easily be converted in a second pass to the desired %C2%A3.
You could replace
(^.?.?|(?!%C2)...)%A3
with
$1%C2%A3
I would suggest you use the functional form of Javascript String.replace (see the section "Specifying a function as a parameter"). This lets you put arbitrary logic, including state if necessary, into a regexp-matching session. For your case, I'd use a simpler regexp that matches a superset of what you want, then in the function call you can test whether it meets your exact criteria, and if it doesn't then just return the matched string as is.
The only problem with this approach is that if you have overlapping potential matches, you have the possibility of missing the second match, since there's no way to return a value to tell the replace() method that it isn't really a match after all.
I am trying to write some JavaScript RegEx to replace user inputed tags with real html tags, so [b] will become <b> and so forth. the RegEx I am using looks like so
var exptags = /\[(b|u|i|s|center|code){1}]((.){1,}?)\[\/(\1){1}]/ig;
with the following JavaScript
s.replace(exptags,"<$1>$2</$1>");
this works fine for single nested tags, for example:
[b]hello[/b] [u]world[/u]
but if the tags are nested inside each other it will only match the outer tags, for example
[b]foo [u]to the[/u] bar[/b]
this will only match the b tags. how can I fix this? should i just loop until the starting string is the same as the outcome? I have a feeling that the ((.){1,}?) patten is wrong also?
Thanks
The easiest solution would be to to replace all the tags, whether they are closed or not and let .innerHTML work out if they are matched or not it will much more resilient that way..
var tagreg = /\[(\/?)(b|u|i|s|center|code)]/ig
div.innerHTML="[b][i]helloworld[/b]".replace(tagreg, "<$1$2>") //no closing i
//div.inerHTML=="<b><i>helloworld</i></b>"
AFAIK you can't express recursion with regular expressions.
You can however do that with .NET's System.Text.RegularExpressions using balanced matching. See more here: http://blogs.msdn.com/bclteam/archive/2005/03/15/396452.aspx
If you're using .NET you can probably implement what you need with a callback.
If not, you may have to roll your own little javascript parser.
Then again, if you can afford to hit the server you can use the full parser. :)
What do you need this for, anyway? If it is for anything other than a preview I highly recommend doing the processing server-side.
You could just repeatedly apply the regexp until it no longer matches. That would do odd things like "[b][b]foo[/b][/b]" => "<b>[b]foo</b>[/b]" => "<b><b>foo</b></b>", but as far as I can see the end result will still be a sensible string with matching (though not necessarily properly nested) tags.
Or if you want to do it 'right', just write a simple recursive descent parser. Though people might expect "[b]foo[u]bar[/b]baz[/u]" to work, which is tricky to recognise with a parser.
The reason the nested block doesn't get replaced is because the match, for [b], places the position after [/b]. Thus, everything that ((.){1,}?) matches is then ignored.
It is possible to write a recursive parser in server-side -- Perl uses qr// and Ruby probably has something similar.
Though, you don't necessarily need true recursive. You can use a relatively simple loop to handle the string equivalently:
var s = '[b]hello[/b] [u]world[/u] [b]foo [u]to the[/u] bar[/b]';
var exptags = /\[(b|u|i|s|center|code){1}]((.){1,}?)\[\/(\1){1}]/ig;
while (s.match(exptags)) {
s = s.replace(exptags, "<$1>$2</$1>");
}
document.writeln('<div>' + s + '</div>'); // after
In this case, it'll make 2 passes:
0: [b]hello[/b] [u]world[/u] [b]foo [u]to the[/u] bar[/b]
1: <b>hello</b> <u>world</u> <b>foo [u]to the[/u] bar</b>
2: <b>hello</b> <u>world</u> <b>foo <u>to the</u> bar</b>
Also, a few suggestions for cleaning up the RegEx:
var exptags = /\[(b|u|i|s|center|code)\](.+?)\[\/(\1)\]/ig;
{1} is assumed when no other count specifiers exist
{1,} can be shortened to +
Agree with Richard Szalay, but his regex didn't get quoted right:
var exptags = /\[(b|u|i|s|center|code)](.*)\[\/\1]/ig;
is cleaner. Note that I also change .+? to .*. There are two problems with .+?:
you won't match [u][/u], since there isn't at least one character between them (+)
a non-greedy match won't deal as nicely with the same tag nested inside itself (?)
Yes, you will have to loop. Alternatively since your tags looks so much like HTML ones you could replace [b] for <b> and [/b] for </b> separately. (.){1,}? is the same as (.*?) - that is, any symbols, least possible sequence length.
Updated: Thanks to MrP, (.){1,}? is (.)+?, my bad.
How about:
tagreg=/\[(.?)?(b|u|i|s|center|code)\]/gi;
"[b][i]helloworld[/i][/b]".replace(tagreg, "<$1$2>");
"[b]helloworld[/b]".replace(tagreg, "<$1$2>");
For me the above produces:
<b><i>helloworld</i></b>
<b>helloworld</b>
This appears to do what you want, and has the advantage of needing only a single pass.
Disclaimer: I don't code often in JS, so if I made any mistakes please feel free to point them out :-)
You are right about the inner pattern being troublesome.
((.){1,}?)
That is doing a captured match at least once and then the whole thing is captured. Every character inside your tag will be captured as a group.
You are also capturing your closing element name when you don't need it and are using {1} when that is implied. Below is a cleanup up version:
/\[(b|u|i|s|center|code)](.+?)\[\/\1]/ig
Not sure about the other problem.