Regular expression and xml

Regular expression and xml - javascript

Sometimes, I have to work with not well formed xml (without root). So I take the first node name (In this case "error") and make up a regex pattern: "</error>$" The problem is it matches true with the following string. (</error> is the end of the line)
<error>0</error>
<roles>
<role rid="12" title="User" description="Hello world"></role>
<role rid="11" title="Admin" description="Hello world2"></role></roles>
After looking at some reference I tried to do so </error>\z and </error>\Z. but it doesn't work.
Please help me with the solution
P.S. If there are better solution, I'll be really happy. Target environment is javascript.

If the only thing that makes your XML not well-formed is that it's missing root, then the fix is simple – just add some root element and then parse that and work with it as normal XML.
xml = '<root>' + xml + '</root>';
You really shouldn't try to parse XML with regular expressions.

You are correct. Unfortunately, javascript does not support the \A and \Z anchors. Consider either using the substring of the last 10 characters with your regular expression (which would probably be more efficient.)
Otherwise, try this expression which matches if there is any non-whitespace after the </error> endtag.
[\s\S]*</error>(?=\s*\S)

Related

RegExp works in JS and PHP but not in Java

I have a regexp to extract an id and a label out of an HTML source code. It can be found HERE.
As you can see it work fine and its fast but when i try this regexp in java with the same source code it 1. Takes for ever and 2. only matches one string (from the first a to the last a is one match).
I tried it with the Multiline flag on and off but no difference. I don't understand how a regexp can work everywhere but in java. Any ideas?
private static final String COURSE_REGEX = "<a class=\"list-group-item list-group-item-action \" href=\"https:\\/\\/moodle-hs-ulm\\.de\\/course\\/view\\.php\\?id=([0-9]*)\"(?:.*\\s){7}<span class=\"media-body \">([^<]*)<\\/span>";
Pattern pattern = Pattern.compile(COURSE_REGEX, Pattern.MULTILINE);
Matcher matcher = pattern.matcher(sourceCode);
List<String> courses = new ArrayList<>();
while(matcher.find() && matcher.groupCount() == 2){
courses.add(matcher.group(1) + "(" + matcher.group(2) + ")");
}

Your regex is running into catastrophic backtracking because of the gargantuan number of possible permutations the subexpression (?:.*\s){7} needs to check (because the . can also match spaces). Java aborts the match attempt after a certain number of steps (not sure how many, certainly > 1.000.000). PHP or JS may not be so cautious.
If you simplify that part of your regex to .*?, you do get the matches:
"(?s)<a class=\"list-group-item list-group-item-action \" href=\"https://moodle-hs-ulm\\.de/course/view\\.php\\?id=([0-9]*)\".*?<span class=\"media-body \">([^<]*)</span>"
Note that you need the DOTALL flag ((?s), so . may match a newline) instead of the MULTILINE flag which changes the behavior of ^ and $ anchors (none of which your regex is using).
Also note that you don't need to escape slashes in a Java regex.
This solution is not very robust because .*? is rather unspecific. I suppose your previous attempt of (?:.*\\s){7} may have been designed to match no more than 7 lines of text? In that case, you could use (?:(?!</a>).)* instead to ensure that you don't cross over into the next <a> tag. That's one of the dangers of parsing HTML with regex :)
Finally, greetings from a staff member of the faculty of Informatics at your university :)

convert js regex into python regex

I'm working on a part of a project, which is repleacing http url's with https url's if possible.
The Problem is, that the regular expressions for that are written for the javascript regex parser, but I'm using that regex inside python. To be compatible, I would rewrite the regex during parsing into a valide python regex.
as example, I have that regular expression given:
https://$1wikimediafoundation.org/
and I would a regular expression like that:
https://\1wikimediafoundation.org/
my problem is that I doesn't know how to do that (converting $ into \)
This code doesn't work:
'https://$1wikimediafoundation.org/'.replace('$', '\')
generate the following error:
SyntaxError: EOL while scanning string literal
This code work without error:
'https://$1wikimediafoundation.org/'.replace('$', '\\')
but generate a wrong output:
'https://\\1wikimediafoundation.org/'

Actually it works:
>>> 'https://$1wikimediafoundation.org/'.replace('$', '\\')
'https://\\1wikimediafoundation.org/'
>>> print 'https://$1wikimediafoundation.org/'.replace('$', '\\')
https://\1wikimediafoundation.org/
when you are doing 'https://$1wikimediafoundation.org/'.replace('$', '\\'), it's returning the __repr__ (~representation) of the string and you can see special characters.
By printing it, you are using the __str__, the readable version. (See this answer on __str__ vs __repr__)

try this:
'https://$1wikimediafoundation.org/'.replace('$', r'\')
adding r"\" whill automatically escape the backslash which you are trying to do.

You test your regex here https://regex101.com/, and then change it to python.
Additionaly, to replace the matched group, you can use re.sub module on these lines:
re.sub(r"'([^']*)'", r'{\1}', col ) )
replace
'Protein_Expectation_Value_Log(e)', 'Protein_Intensity_Log(I)'
{Protein_Expectation_Value_Log(e)}, {Protein_Intensity_Log(I)}
More you can refer here

Note that $& in replacement patterns should be converted to \g<0>, since \0 is \0x00 character in python regex

Confused with Regex JS pattern

ok i do have this following data in my div
<div id="mydiv">
<!--
what is your present
<code>alert("this is my present");</code>
where?
<code>alert("here at my left hand");</code>
oh thank you! i love you!! hehe
<code>alert("welcome my honey ^^");</code>
-->
</div>
well what i need to do there is to get the all the scripts inside the <code> blocks and the html codes text nodes without removing the html comments inside. well its a homework given by my professor and i can't modify that div block..
I need to use regular expressions for this and this is what i did
var block = $.trim($("div#mydiv").html()).replace("<!--","").replace("-->","");
var htmlRegex = new RegExp(""); //I don't know what to do here
var codeRegex = new RegExp("^<code(*n)</code>$","igm");
var code = codeRegex.exec(block);
var html = "";
it really doesn't work... please don't give the exact answer.. please teach me.. thank you
I need to have the following blocks for the variable code
alert("this is my present");
alert("here at my left hand");
alert("welcome my honey ^^");
and this is the blocks i need for variable html
what is your present
where?
oh thank you! i love you!! hehe
my question is what is the regex pattern to get the results above?

Parsing HTML with a regular expression is not something you should do.
I'm sure your professor thinks he/she was really clever and that there's no way to access the DOM API and can wave a banner around and justify some minor corner-case for using regex to parse the DOM and that sometimes it's okay.
Well, no, it isn't. If you have complex code in there, what happens? Your regex breaks, and perhaps becomes a security exploit if this is ever in production.
So, here:
http://jsfiddle.net/zfp6D/
Walk the dom, get the nodeType 8 (comment) text value out of the node.
Invoke the HTML parser (that thing that browsers use to parse HTML, rather than regex, why you wouldn't use the HTML parser to parse HTML is totally beyond me, it's like saying "Yeah, I could nail in this nail with a hammer, but I think I'm going to just stomp on the nail with my foot until it goes in").
Find all the CODE elements in the newly parsed HTML.
Log them to console, or whatever you want to do with them.

First of all, you should be aware that because HTML is not a regular language, you cannot do generic parsing using regular expressions that will work for all valid inputs (generic nesting in particular cannot be expressed with regular expressions). Many parsers do use regular expressions to match individual tokens, but other algorithms need to be built around them
However, for a fixed input such as this, it's just a case of working through the structure you have (though it's still often easier to use different parsing methods than just regular expressions).
First lets get all the code:
var code = '', match = [];
var regex = new RegExp("<code>(.*?)</code>", "g");
while (match = regex.exec(content)) {
code += match[1] + "\n";
}
I assume content contains the content of the div that you've already extracted. Here the "g" flag says this is for "global" matching, so we can reuse the regex to find every match. The brackets indicate a capturing group, . means any character, * means repeated 0 or more times, and ? means "non-greedy" (see what happens without it to see what it does).
Now we can do a similar thing to get all the other bits, but this time the regex is slightly more complicated:
new RegExp("(<!--|</code>)(.*?)(-->|<code>)", "g")
Here | means "or". So this matches all the bits that start with either "start comment" or "end code" and end with "end comment" or "start code". Note also that we now have 3 sets of brackets, so the part we want to extract is match[2] (the second set).

You're doing a lot of unnecessary stuff. .html() gives you the inner contents as a string. You should be able to use regEx to grab exactly what you need from there. Also, try to stick with regEx literals (e.g. /^regexstring$/). You have to escape escape characters using new RegExp which gets really messy. You generally only want to use new RegExp when you need to put a string var into a regEx.
The match function of strings accepts regEx and returns a collection of every match when you add the global flag (e.g. /^regexstring$/g <-- note the 'g'). I would do something like this:
var block = $('#mydiv').html(), //you can set multiple vars in one statement w/commas
matches = block.match(/<code>[^<]*<\/code>/g);
//[^<]* <-- 0 or more characters that aren't '<' - google 'negative character class'
matches.join('_') //lazy way of avoiding a loop - join into a string with a safe character
.replace(/<\/*code>/g,'') //\/* 0 or more forward slashes
.split('_');//return the matches string back to array
//Now do what you want with matches. Eval (ew) or append in a script tag (ew).
//You have no control over the 'ew'. I just prefer data to scripts in strings

How to add special characters like & > in XML file using JavaScript

I am generating XML using Javascript. It works fine if there are no special characters in the XML. Otherwise, it will generate this message: "invalid xml".
I tried to replace some special characters, like:
xmlData=xmlData.replaceAll(">",">");
xmlData=xmlData.replaceAll("&","&");
//but it doesn't work.
For example:
<category label='ARR Builders & Developers'>
Thanks.

Consider generating the XML using DOM methods. For example:
var c = document.createElement("category");
c.setAttribute("label", "ARR Builders & Developers");
var s = new XMLSerializer().serializeToString(c);
s; // => "<category label=\"ARR Builder & Developers\"></category>"
This strategy should avoid the XML entity escaping problems you mention but might have some cross-browser issues.

This will do the replacement in JavaScript:
xml = xml.replace(/</g, "<");
xml = xml.replace(/>/g, ">");
This uses regular expression literals to replace all less than and greater than symbols with their escaped equivalent.

JavaScript comes with a powerful replace() method for string objects.
In general - and basic - terms, it works this way:
var myString = yourString.replace([regular expression or simple string], [replacement string]);
The first argument to .replace() method is the portion of the original string that you wish to replace. It can be represented by either a plain string object (even literal) or a regular expression.
The regular expression is obviously the most powerful way to select a substring.
The second argument is the string object (even literal) that you want to provide as a replacement.
In your case, the replacement operation should look as follows:
xmlData=xmlData.replace(/&/g,"&");
xmlData=xmlData.replace(/>/g,">");
//this time it should work.
Notice the first replacement operation is the ampersand, as if you should try to replace it later you would screw up pre-existing well-quoted entities for sure, just as "&gt;".
In addition, pay attention to the regex 'g' flag, as with it the replacement will take place all throughout your text, not only on the first match.
I used regular expressions, but for simple replacements like these also plain strings would be a perfect fit.
You can find a complete reference for String.replace() here.

Processing Javascript RegEx submatches

I am trying to write some JavaScript RegEx to replace user inputed tags with real html tags, so [b] will become <b> and so forth. the RegEx I am using looks like so
var exptags = /\[(b|u|i|s|center|code){1}]((.){1,}?)\[\/(\1){1}]/ig;
with the following JavaScript
s.replace(exptags,"<$1>$2</$1>");
this works fine for single nested tags, for example:
[b]hello[/b] [u]world[/u]
but if the tags are nested inside each other it will only match the outer tags, for example
[b]foo [u]to the[/u] bar[/b]
this will only match the b tags. how can I fix this? should i just loop until the starting string is the same as the outcome? I have a feeling that the ((.){1,}?) patten is wrong also?
Thanks

The easiest solution would be to to replace all the tags, whether they are closed or not and let .innerHTML work out if they are matched or not it will much more resilient that way..
var tagreg = /\[(\/?)(b|u|i|s|center|code)]/ig
div.innerHTML="[b][i]helloworld[/b]".replace(tagreg, "<$1$2>") //no closing i
//div.inerHTML=="<b><i>helloworld</i></b>"

AFAIK you can't express recursion with regular expressions.
You can however do that with .NET's System.Text.RegularExpressions using balanced matching. See more here: http://blogs.msdn.com/bclteam/archive/2005/03/15/396452.aspx
If you're using .NET you can probably implement what you need with a callback.
If not, you may have to roll your own little javascript parser.
Then again, if you can afford to hit the server you can use the full parser. :)
What do you need this for, anyway? If it is for anything other than a preview I highly recommend doing the processing server-side.

You could just repeatedly apply the regexp until it no longer matches. That would do odd things like "[b][b]foo[/b][/b]" => "<b>[b]foo</b>[/b]" => "<b><b>foo</b></b>", but as far as I can see the end result will still be a sensible string with matching (though not necessarily properly nested) tags.
Or if you want to do it 'right', just write a simple recursive descent parser. Though people might expect "[b]foo[u]bar[/b]baz[/u]" to work, which is tricky to recognise with a parser.

The reason the nested block doesn't get replaced is because the match, for [b], places the position after [/b]. Thus, everything that ((.){1,}?) matches is then ignored.
It is possible to write a recursive parser in server-side -- Perl uses qr// and Ruby probably has something similar.
Though, you don't necessarily need true recursive. You can use a relatively simple loop to handle the string equivalently:
var s = '[b]hello[/b] [u]world[/u] [b]foo [u]to the[/u] bar[/b]';
var exptags = /\[(b|u|i|s|center|code){1}]((.){1,}?)\[\/(\1){1}]/ig;
while (s.match(exptags)) {
s = s.replace(exptags, "<$1>$2</$1>");
}
document.writeln('<div>' + s + '</div>'); // after
In this case, it'll make 2 passes:
0: [b]hello[/b] [u]world[/u] [b]foo [u]to the[/u] bar[/b]
1: <b>hello</b> <u>world</u> <b>foo [u]to the[/u] bar</b>
2: <b>hello</b> <u>world</u> <b>foo <u>to the</u> bar</b>
Also, a few suggestions for cleaning up the RegEx:
var exptags = /\[(b|u|i|s|center|code)\](.+?)\[\/(\1)\]/ig;
{1} is assumed when no other count specifiers exist
{1,} can be shortened to +

Agree with Richard Szalay, but his regex didn't get quoted right:
var exptags = /\[(b|u|i|s|center|code)](.*)\[\/\1]/ig;
is cleaner. Note that I also change .+? to .*. There are two problems with .+?:
you won't match [u][/u], since there isn't at least one character between them (+)
a non-greedy match won't deal as nicely with the same tag nested inside itself (?)

Yes, you will have to loop. Alternatively since your tags looks so much like HTML ones you could replace [b] for <b> and [/b] for </b> separately. (.){1,}? is the same as (.*?) - that is, any symbols, least possible sequence length.
Updated: Thanks to MrP, (.){1,}? is (.)+?, my bad.

How about:
tagreg=/\[(.?)?(b|u|i|s|center|code)\]/gi;
"[b][i]helloworld[/i][/b]".replace(tagreg, "<$1$2>");
"[b]helloworld[/b]".replace(tagreg, "<$1$2>");
For me the above produces:
<b><i>helloworld</i></b>
<b>helloworld</b>
This appears to do what you want, and has the advantage of needing only a single pass.
Disclaimer: I don't code often in JS, so if I made any mistakes please feel free to point them out :-)

You are right about the inner pattern being troublesome.
((.){1,}?)
That is doing a captured match at least once and then the whole thing is captured. Every character inside your tag will be captured as a group.
You are also capturing your closing element name when you don't need it and are using {1} when that is implied. Below is a cleanup up version:
/\[(b|u|i|s|center|code)](.+?)\[\/\1]/ig
Not sure about the other problem.

We Keep Coding

JavaScript is the programming language of the Web.

Regular expression and xml - javascript

If the only thing that makes your XML not well-formed is that it's missing root, then the fix is simple – just add some root element and then parse that and work with it as normal XML. xml = '<root>' + xml + '</root>'; You really shouldn't try to parse XML with regular expressions.

Related

RegExp works in JS and PHP but not in Java

convert js regex into python regex

Confused with Regex JS pattern

How to add special characters like & > in XML file using JavaScript

Processing Javascript RegEx submatches

Categories

Resources