Confused with Regex JS pattern - javascript

ok i do have this following data in my div
<div id="mydiv">
<!--
what is your present
<code>alert("this is my present");</code>
where?
<code>alert("here at my left hand");</code>
oh thank you! i love you!! hehe
<code>alert("welcome my honey ^^");</code>
-->
</div>
well what i need to do there is to get the all the scripts inside the <code> blocks and the html codes text nodes without removing the html comments inside. well its a homework given by my professor and i can't modify that div block..
I need to use regular expressions for this and this is what i did
var block = $.trim($("div#mydiv").html()).replace("<!--","").replace("-->","");
var htmlRegex = new RegExp(""); //I don't know what to do here
var codeRegex = new RegExp("^<code(*n)</code>$","igm");
var code = codeRegex.exec(block);
var html = "";
it really doesn't work... please don't give the exact answer.. please teach me.. thank you
I need to have the following blocks for the variable code
alert("this is my present");
alert("here at my left hand");
alert("welcome my honey ^^");
and this is the blocks i need for variable html
what is your present
where?
oh thank you! i love you!! hehe
my question is what is the regex pattern to get the results above?

Parsing HTML with a regular expression is not something you should do.
I'm sure your professor thinks he/she was really clever and that there's no way to access the DOM API and can wave a banner around and justify some minor corner-case for using regex to parse the DOM and that sometimes it's okay.
Well, no, it isn't. If you have complex code in there, what happens? Your regex breaks, and perhaps becomes a security exploit if this is ever in production.
So, here:
http://jsfiddle.net/zfp6D/
Walk the dom, get the nodeType 8 (comment) text value out of the node.
Invoke the HTML parser (that thing that browsers use to parse HTML, rather than regex, why you wouldn't use the HTML parser to parse HTML is totally beyond me, it's like saying "Yeah, I could nail in this nail with a hammer, but I think I'm going to just stomp on the nail with my foot until it goes in").
Find all the CODE elements in the newly parsed HTML.
Log them to console, or whatever you want to do with them.

First of all, you should be aware that because HTML is not a regular language, you cannot do generic parsing using regular expressions that will work for all valid inputs (generic nesting in particular cannot be expressed with regular expressions). Many parsers do use regular expressions to match individual tokens, but other algorithms need to be built around them
However, for a fixed input such as this, it's just a case of working through the structure you have (though it's still often easier to use different parsing methods than just regular expressions).
First lets get all the code:
var code = '', match = [];
var regex = new RegExp("<code>(.*?)</code>", "g");
while (match = regex.exec(content)) {
code += match[1] + "\n";
}
I assume content contains the content of the div that you've already extracted. Here the "g" flag says this is for "global" matching, so we can reuse the regex to find every match. The brackets indicate a capturing group, . means any character, * means repeated 0 or more times, and ? means "non-greedy" (see what happens without it to see what it does).
Now we can do a similar thing to get all the other bits, but this time the regex is slightly more complicated:
new RegExp("(<!--|</code>)(.*?)(-->|<code>)", "g")
Here | means "or". So this matches all the bits that start with either "start comment" or "end code" and end with "end comment" or "start code". Note also that we now have 3 sets of brackets, so the part we want to extract is match[2] (the second set).

You're doing a lot of unnecessary stuff. .html() gives you the inner contents as a string. You should be able to use regEx to grab exactly what you need from there. Also, try to stick with regEx literals (e.g. /^regexstring$/). You have to escape escape characters using new RegExp which gets really messy. You generally only want to use new RegExp when you need to put a string var into a regEx.
The match function of strings accepts regEx and returns a collection of every match when you add the global flag (e.g. /^regexstring$/g <-- note the 'g'). I would do something like this:
var block = $('#mydiv').html(), //you can set multiple vars in one statement w/commas
matches = block.match(/<code>[^<]*<\/code>/g);
//[^<]* <-- 0 or more characters that aren't '<' - google 'negative character class'
matches.join('_') //lazy way of avoiding a loop - join into a string with a safe character
.replace(/<\/*code>/g,'') //\/* 0 or more forward slashes
.split('_');//return the matches string back to array
//Now do what you want with matches. Eval (ew) or append in a script tag (ew).
//You have no control over the 'ew'. I just prefer data to scripts in strings

Related

How to append string after matching field with regex

I want to append a word after <body> tag, it should not modify/replace anything other than just append a word. I have done something like this, is it valid do empty parenthesis fir second capture group will match everything?
/(<body[^>]*>)()/, `$1${my_variable}$2`)
The second capture group, designed to capture nothing, will match "nothing" - it will form a match immediately after your closed body tag. There's nothing wrong with doing this for the regex, though you might want to be wary of using [^>]* - this negated character class will gladly match across lines and grab as much input as it can. Handy for matching multi-line tags, but often very dangerous.
Also, if you're on linux and for some reason have > symbols in filenames (which is valid!) your regex will break horribly, as shown here.
That being said, valid regex or not, it's usually a bad idea to use regex with html, since HTML isn't a regular language. Also, you could accidentally summon Cthulhu.
let page = "<html><body>Some info</body></html>";
page.replace("<body>", `<body>${my_variable}`);
or
page.replace(/<body>|<BODY>/, `<body>${my_variable}`);
If in the broweser you can also use document.querySelector("body").innerHTML
Also depending on which framework you're using there are better ways to accomplish this.

match text between two html custom tags but not other custom tags

I have something like the following;-
<--customMarker>Test1<--/customMarker>
<--customMarker key='myKEY'>Test2<--/customMarker>
<--customMarker>Test3 <--customInnerMarker>Test4<--/customInnerMarker> <--/customMarker>
I need to be able to replace text between the customMarker tags, I tried the following;-
str.replace(/<--customMarker>(.*?)<--\/customMarker>/g, 'item Replaced')
which works ok. I would like to also ignore custom inner tags and not match or replace them with text.
Also I need a separate expression to extract the value of the attribute key='myKEY' from the tag with Text2.
Many thanks
EDIT
actually I am trying to find things between comment tags but the comment tags were not displaying correctly so I had to remove the '!'. There's a unique situation that required comment tags... in anycase if anyone knows enough regex to help, it would be great. thank u.
In the end, I did something like the following (incase anyone else needs this. enjoy!!! But note: Word about town is that using regex with html tags is not ideal, so do your own research and make up your mind. For me, it had to be done this way, mostly bcos i wanted to, but also bcos it simplified the job in this instance);-
var retVal = str.replace(/<--customMarker>(.*?)<--\/customMarker>/g, function(token, match){
//question 1: I would like to also ignore custom inner tags and not match or replace them with text.
//answer:
var replacePattern = /<--customInnerMarker*?(.*?)<--\/customInnerMarker-->/g;
//remove inner tags from match
match = $.trim(match.replace(replacePattern, ''));
//replace and return what is left with a required value
return token.replace(match, objParams[match]);
//question 2: Also I need a separate expression to extract the value of the attribute key='myKEY' from the tag with Text2.
//answer
var attrPattern = /\w+\s*=\s*".*?"/g;
attrMatches = token.match(attrPattern);//returns a list of attributes as name/value pairs in an array
})
Can't you use <customMarker> instead? Then you can just use getElementsByTagName('customMarker') and get the inner text and child elements from it.
A regex merely matches an item. Once you have said match, it is up to you what you do with it. This is part of the problem most people have with using regular expressions, they try and combine the three different steps. The regex match is just the first step.
What you are asking for will not be possible with a single regex. You're going to need a mini state machine if you want to use regular expressions. That is, a logic wrapper around the matches such that it moves through each logical portion.
I would advise you look in the standard api for a prebuilt engine to parse html, rather than rolling your own. If you do need to do so, read the flex manual to get a basic understanding of how regular expressions work, and the state machines you build with them. The best example would be the section on matching multiline c comments.

Trying to remove trailing text

I having the following code. I want to extract the last text (hello64) from it.
<span class="qnNum" id="qn">4</span><span>.</span> hello64 ?*
I used the code below but it removes all the integers
questionText = questionText.replace(/<span\b.*?>/ig, "");
questionText=questionText.replace(/<\/span>/ig, "");
questionText = questionText.replace(/\d+/g,"");
questionText = questionText.replace("*","");
questionText = questionText.replace(". ",""); i want to remove the first integer, and need to keep the rest of the integers
It's the third line .replace(/\d+/g,"") which is replacing the integers. If you want to keep the integers, then don't replace \d+, because that matches one or more digits.
You could achieve most of that all on one line, by the way - there's no need to have multiple replaces there:
var questionText = questionText.replace(/((<span\b.*?>)|(<\/span>)|(\d+))/ig, "");
That would do the same as the first three lines of your code. (of course, you'd need to drop the |(\d+) as per the first part of the answer if you didn't want to get rid of the digits.
[EDIT]
Re your comment that you want to replace the first integer but not the subsequent ones:
The regex string to do this would depend very heavily on what the possible input looks like. The problem is that you've given us a bit of random HTML code; we don't know from that whether you're expecting it to always be in this precise format (ie a couple of spans with contents, followed by a bit at the end to keep). I'll assume that this is the case.
In this case, a much simpler regex for the whole thing would be to replace eveything within <span....</span> with blank:
var questionText = questionText.replace(/(<span\b.*?>.*?<\/span>)/ig, "");
This will eliminate the whole of the <span> tags plus their contents, but leave anything outside of them alone.
In the case of your example this would provide the desired effect, but as I say, it's hard to know if this will work for you in all cases without knowing more about your expected input.
In general it's considered difficult to parse arbitrary HTML code with regex. Regex is a contraction of "Regular Expressions", which is a way of saying that they are good at handling strings which have 'regular' syntax. Abitrary HTML is not a 'regular' syntax due to it's unlimited possible levels of nesting. What I'm trying to say here is that if you have anything more complex than the simple HTML snippets you've supplied, then you may be better off using a HTML parser to extract your data.
This will match the complete string and put the part after the last </span> till the next word boundary \b into the capturing group 1. You just need to replace then with the group 1, i.e. $1.
searched_string = string.replace(/^.*<\/span>\s*([A-Za-z0-9]+)\b.*$/, "$1");
The captured word can consist of [A-Za-z0-9]. If you want to have anything else there just add it into that group.

Can someone tell me the purpose of the second capture group in the jQuery rts regular expression?

In Jeff Roberson's jQuery Regular Expressions Review he proposes changing the rts regular expression in jQuery's ajax.js from /(\?|&)_=.*?(&|$)/ to /([?&])_=[^&\r\n]*(&?)/. In both versions, what is the purpose of the second capture group? The code does a replacement of the current random timestamp with a new random timestamp:
var ts = jQuery.now();
// try replacing _= if it is there
var ret = s.url.replace(rts, "$1_=" + ts + "$2");
Doesn't it only replace what it matches? I am thinking this does the same:
var ret = s.url.replace(/([?&])_=[^&\r\n]*/, "$1_=" + ts);
Can someone explain the purpose of the second capture group?
It's to pick up the next delimiter in the query string on the URL, so that it still works properly as a query string. Thus if the url is
http://foo.bar/what/ever?blah=blah&_=12345&zebra=banana
then the second group picks up the "&" before "zebra".
That's an awesome blog post by the way and everybody should read it.
edit — now that I think about it, I'm not sure why it's necessary to bother with replacing that second delimiter. In the "fixed" expression, that greedy * will pick up the whole parameter value and stop at the delimiter (or the end of the string) anyway.
I think you're right. It was needed in the original because matching the ampersand or end-of-string was how the .*? knew when to stop. In Jeff's version that's no longer necessary.
As the author of the article I can't tell you the reason for the second capture group. My intent with the article was to take existing regexes and simply make them more efficient - i.e. they should all match the same text - just do it faster. Unfortunately I did not have time to delve deeply into the code to see exactly how each and every one of them was being used. I assumed that the capture group for this one was there for a reason so I did not mess with it.

Processing Javascript RegEx submatches

I am trying to write some JavaScript RegEx to replace user inputed tags with real html tags, so [b] will become <b> and so forth. the RegEx I am using looks like so
var exptags = /\[(b|u|i|s|center|code){1}]((.){1,}?)\[\/(\1){1}]/ig;
with the following JavaScript
s.replace(exptags,"<$1>$2</$1>");
this works fine for single nested tags, for example:
[b]hello[/b] [u]world[/u]
but if the tags are nested inside each other it will only match the outer tags, for example
[b]foo [u]to the[/u] bar[/b]
this will only match the b tags. how can I fix this? should i just loop until the starting string is the same as the outcome? I have a feeling that the ((.){1,}?) patten is wrong also?
Thanks
The easiest solution would be to to replace all the tags, whether they are closed or not and let .innerHTML work out if they are matched or not it will much more resilient that way..
var tagreg = /\[(\/?)(b|u|i|s|center|code)]/ig
div.innerHTML="[b][i]helloworld[/b]".replace(tagreg, "<$1$2>") //no closing i
//div.inerHTML=="<b><i>helloworld</i></b>"
AFAIK you can't express recursion with regular expressions.
You can however do that with .NET's System.Text.RegularExpressions using balanced matching. See more here: http://blogs.msdn.com/bclteam/archive/2005/03/15/396452.aspx
If you're using .NET you can probably implement what you need with a callback.
If not, you may have to roll your own little javascript parser.
Then again, if you can afford to hit the server you can use the full parser. :)
What do you need this for, anyway? If it is for anything other than a preview I highly recommend doing the processing server-side.
You could just repeatedly apply the regexp until it no longer matches. That would do odd things like "[b][b]foo[/b][/b]" => "<b>[b]foo</b>[/b]" => "<b><b>foo</b></b>", but as far as I can see the end result will still be a sensible string with matching (though not necessarily properly nested) tags.
Or if you want to do it 'right', just write a simple recursive descent parser. Though people might expect "[b]foo[u]bar[/b]baz[/u]" to work, which is tricky to recognise with a parser.
The reason the nested block doesn't get replaced is because the match, for [b], places the position after [/b]. Thus, everything that ((.){1,}?) matches is then ignored.
It is possible to write a recursive parser in server-side -- Perl uses qr// and Ruby probably has something similar.
Though, you don't necessarily need true recursive. You can use a relatively simple loop to handle the string equivalently:
var s = '[b]hello[/b] [u]world[/u] [b]foo [u]to the[/u] bar[/b]';
var exptags = /\[(b|u|i|s|center|code){1}]((.){1,}?)\[\/(\1){1}]/ig;
while (s.match(exptags)) {
s = s.replace(exptags, "<$1>$2</$1>");
}
document.writeln('<div>' + s + '</div>'); // after
In this case, it'll make 2 passes:
0: [b]hello[/b] [u]world[/u] [b]foo [u]to the[/u] bar[/b]
1: <b>hello</b> <u>world</u> <b>foo [u]to the[/u] bar</b>
2: <b>hello</b> <u>world</u> <b>foo <u>to the</u> bar</b>
Also, a few suggestions for cleaning up the RegEx:
var exptags = /\[(b|u|i|s|center|code)\](.+?)\[\/(\1)\]/ig;
{1} is assumed when no other count specifiers exist
{1,} can be shortened to +
Agree with Richard Szalay, but his regex didn't get quoted right:
var exptags = /\[(b|u|i|s|center|code)](.*)\[\/\1]/ig;
is cleaner. Note that I also change .+? to .*. There are two problems with .+?:
you won't match [u][/u], since there isn't at least one character between them (+)
a non-greedy match won't deal as nicely with the same tag nested inside itself (?)
Yes, you will have to loop. Alternatively since your tags looks so much like HTML ones you could replace [b] for <b> and [/b] for </b> separately. (.){1,}? is the same as (.*?) - that is, any symbols, least possible sequence length.
Updated: Thanks to MrP, (.){1,}? is (.)+?, my bad.
How about:
tagreg=/\[(.?)?(b|u|i|s|center|code)\]/gi;
"[b][i]helloworld[/i][/b]".replace(tagreg, "<$1$2>");
"[b]helloworld[/b]".replace(tagreg, "<$1$2>");
For me the above produces:
<b><i>helloworld</i></b>
<b>helloworld</b>
This appears to do what you want, and has the advantage of needing only a single pass.
Disclaimer: I don't code often in JS, so if I made any mistakes please feel free to point them out :-)
You are right about the inner pattern being troublesome.
((.){1,}?)
That is doing a captured match at least once and then the whole thing is captured. Every character inside your tag will be captured as a group.
You are also capturing your closing element name when you don't need it and are using {1} when that is implied. Below is a cleanup up version:
/\[(b|u|i|s|center|code)](.+?)\[\/\1]/ig
Not sure about the other problem.

Categories