I noticed that the jQuery parseJSON basically does a simple regex "check":
parseJSON: function( data ) {
if ( typeof data !== "string" || !data ) {
return null;
}
// Make sure leading/trailing whitespace is removed (IE can't handle it)
data = jQuery.trim( data );
// Make sure the incoming data is actual JSON
// Logic borrowed from http://json.org/json2.js
if ( /^[\],:{}\s]*$/.test(data.replace(/\\(?:["\\\/bfnrt]|u[0-9a-fA-F]{4})/g, "#")
.replace(/"[^"\\\n\r]*"|true|false|null|-?\d+(?:\.\d*)?(?:[eE][+\-]?\d+)?/g, "]")
.replace(/(?:^|:|,)(?:\s*\[)+/g, "")) ) {
// Try to use the native JSON parser first
return window.JSON && window.JSON.parse ?
window.JSON.parse( data ) :
(new Function("return " + data))();
} else {
jQuery.error( "Invalid JSON: " + data );
}
},
If it passes that "check" and if it's a modern browser a native JSON parser is used. Otherwise, I assume for a browser like IE6 a new function is automatically invoked and returns the object.
Question #1: Since this is just a simple regex test, isn't this prone to some sort of obscure edge-case exploit? Shouldn't we really be using a full blown parser, for the browsers that don't support native JSON parsing at least?
Question #2: How much "safer" is (new Function(" return " + data ))() as opposed to eval("(" + text + ")")?
As mentioned in the comments there jQuery's JSON parser "borrows" the logic that tests to see if the JSON string is valid, right from json2.js. This makes it "as safe" as the most common non-native implementation, which is rather strict anyway:
// In the second stage, we run the text against regular expressions that look
// for non-JSON patterns. We are especially concerned with '()' and 'new'
// because they can cause invocation, and '=' because it can cause mutation.
// But just to be safe, we want to reject all unexpected forms.
// We split the second stage into 4 regexp operations in order to work around
// crippling inefficiencies in IE's and Safari's regexp engines. First we
// replace the JSON backslash pairs with '#' (a non-JSON character). Second, we
// replace all simple value tokens with ']' characters. Third, we delete all
// open brackets that follow a colon or comma or that begin the text. Finally,
// we look to see that the remaining characters are only whitespace or ']' or
// ',' or ':' or '{' or '}'. If that is so, then the text is safe for eval.
if (/^[\],:{}\s]*$/.
test(text.replace(/\\(?:["\\\/bfnrt]|u[0-9a-fA-F]{4})/g, '#').
replace(/"[^"\\\n\r]*"|true|false|null|-?\d+(?:\.\d*)?(?:[eE][+\-]?\d+)?/g, ']').
replace(/(?:^|:|,)(?:\s*\[)+/g, ''))) {
What I don't understand is why jQuery runs the regular expression/replaces before checking for a native implementation which would check for correct JSON grammar anyway. It seems like it would speed things up to only do this if a native implementation isn't available.
Question 2 is answered very well by bobince in another question:
It's not really a big difference, but the feeling is that eval is ‘worse’ than new Function. Not in terms of security — they're both equally useless in the face of untrusted input, but then hopefully your webapp is not returning untrusted JSON strings — but in terms of language-level weirdness, and hence resistance to optimisation.
Check out Nick Craver's answer there too for a direct quote from John Resig.
The JSON.parse method is the safest. This is defined when you include json2.js from http://www.json.org/js.html and used automatically by parseJSON/getJSON. It parses instead of executing the JSON markup.
Related
I am working on a Node.js application that needs to handle JSON strings and work with the objects.
Mostly all is well and JSON.parse(myString) is all I need.
The application also gets data from third parties. One of which seems to be developed with Python.
My application repeatable chokes on boolean values since they come captialized.
Example:
var jsonStr = "{'external_id': 123, 'description': 'Run #2944', 'test_ok': False}";
try{
var jsonObj = JSON.parse(jsonStr);
}catch(err){
console.err('Whoops! Could not parse, error: ' + err.message);
}
Notice the test_ok parameter - it all is good when it follows the Javascript way of having a lower case false boolean instead. But the capitalized boolean does not work out.
Of course I could try and replace capitalized boolean values via a string replace, but I am afraid to alter things that should not get altered.
Is there an alternative to JSON.parse that is a little more forgiving?
I don't mean to be rude but according to json.org, its an invalid json. That means you'll have to run a hack where you have to identify stringified boolean "True" and convert it to "true" without affecting a string that lets say is "True dat!"
First of all, I would not recommend using the code below. This is just to demonstrate how to convert your input string into a valid JSON. There were problems, one is the Boolean False, and another is the single quotes around property names. I'm not positive but I believe those need to be double quotes.
I don't believe having to convert a string into a valid JSON is a good choice. If you have no alternative, meaning you don't have access to the code generating this string, then the code below is still not a good choice because it will have issues if you have embedded quotes in the string values. i.e. you would need different string replace logic.
Keep all this in mind before using the code.
var jsonStr = "{'external_id': 123, 'description': 'Run #2944', 'test_ok': False}";
try {
jsonStr = jsonStr.replace(/:[ ]*False/,':false' ).replace( /'/g,'"');
var jsonObj = JSON.parse(jsonStr);
console.log( jsonObj );
} catch (err) {
console.err('Whoops! Could not parse, error: ' + err.message);
}
Short description:
Is there a javascript JSON-converter out there that is able to preserve Dates and does not use eval?
Example:
var obj1 = { someInt: 1, someDate: new Date(1388361600000) };
var obj2 = parseJSON(toJSON(obj1));
//obj2.someDate should now be of type Date and not String
//(like in ordinary json-parsers).
Long description:
I think most people working with JSON already had the problem of how to transmit a Date:
var obj = { someInt: 1, someDate: new Date(1388361600000) }
When converting this to JSON and back, the date suddenly became a String:
JSON.parse(JSON.stringify(obj))
== { someInt: 1, someDate: "2013-12-30T00:00:00.000Z" }
This is a huge disadvantage since you cannot easily submit a Date using JSON. There is always some post-processing necessary (and you need to know where to look for the dates).
Then Microsoft found a loophole in the specification of JSON and - by convention - encodes a date as follows:
{"someInt":1,"someDate":"\/Date(1388361600000)\/"}
The brilliance in this is that there is a now a definitive way to tell a String from a Date inside a valid JSON-string: An encoded String will never contain the substring #"/" (a backslash followed by a slash, not to be confused with an escaped slash). Thus a parser that knows this convention can now safely create the Date-object.
If a parser does not know this convention, the date will just be parsed to the harmless and readable String "/Date(1388361600000)/".
The huge drawback is that there seems to be no parser that can read this without using eval. Microsoft proposes the following way to read this:
var obj = eval("(" + s.replace(/\"\\\/Date\((\d+)\)\\\/\"/g, function (match, time) { return "new Date(" + time + ")"; }) + ")");
This works like a charm: You never have to care about Dates in JSON anymore. But it uses the very unsafe eval-method.
Do you know any ready-to-use-parser that achieves the same result without using eval?
EDIT
There was some confusion in the comments about the advantages of the tweaked encoding.
I set up a jsFiddle that should make the intentions clear: http://jsfiddle.net/AJheH/
I disagree with adeno's comment that JSON is a notation for strings and cannot represent objects. Json is a notation for compound data types which must be in the form of a serialized objects, albeit that the primitive types can only be integer, float, string or bool. (update: if you've ever had deal with spaghetti coded XML, then you'll appreciate that maybe this is a good thing too!)
Presumably hungarian notation has lost favour with Microsoft if they now think that creating a non-standard notation incorporating the data type to describe a type is better idea.
Of itself 'eval' is not evil - it makes solving some problems a lot easier - but it's very difficult to implement good security while using it. Indeed it's disabled by default with a Content Security Policy.
IMHO it boils down to storing the date as 1388361600000 or "2013-12-30T00:00:00.000Z". IMHO the latter has significantly more semantic value - taken out of context it is clearly a date+time while the latter could be just about anything. Both can be parsed by the ECMAscript Date object without resorting to using eval. Yes this does require code to process the data - but what can you do with an sort of data without parsing it? he only time I can see this as being an advanage is with a schemaless database - but in fairness this is a BIG problem.
The issue is the following line of code, here is an example function and take a look at parseWithDate function, add the script to the page and change the following line to this it will work.
http://www.asp.net/ajaxlibrary/jquery_webforms_serialize_dates_to_json.ashx
var parsed1 = JSON.parse(s1); // changed to below
var parsed1 = JSON.parseWithDate(s1);
Updated jsFiddle that works http://jsfiddle.net/GLb67/1/
I have a site and I used AJAX. And I got some problems.
Server return JSON string something like this {a:"x48\x65\x6C\x6C\x6F"}.
Then in xx.responseText, we have this string '{a:"\x48\x65\x6C\x6C\x6F"}'.
But if I create JavaScript string "\x48\x65\x6C\x6C\x6F" then I have "Hello" and not HEX!
Is it possible get in xx.responseText "real" text from HEX (automatically, without .replace())?
If the output is at all regular (predictable), .replace() is probably the simplest.
var escapeSequences = xx.responseText.replace(/^\{a:/, '').replace(/\}$/, '');
console.log(escapeSequences === "\"\\x48\\x65\\x6C\\x6C\\x6F\""); // true
Or, if a string literal that's equivalent in value but may not otherwise be the same is sufficient, you could parse (see below) and then stringify() an individual property.
console.log(JSON.stringify(data.a) === "\"Hello\""); // true
Otherwise, you'll likely need to run responseText through a lexer to tokenize it and retrieve the literal from that. JavaScript doesn't include an option for this separate from parsing/evaluating, so you'll need to find a library for this.
"Lexer written in JavaScript?" may be a good place to start for that.
To parse it:
Since it appears to be a string of code, you'll likely have to use eval().
var data = eval('(' + xx.responseText + ')');
console.log(data.a); // Hello
Note: The parenthesis make sure {...} is evaluated as an Object literal rather than as a block.
Also, I'd suggest looking into alternatives to code for communicating data like this.
A common option is JSON, which takes its syntax from JavaScript, but uses a rather strict subset. It doesn't allow functions or other potentially problematic code to be included.
var data = JSON.parse(xx.responseText);
console.log(data.a); // Hello
Visiting JSON.org, you should be able to find a reference or library for the choice of server-side language to output JSON.
{ "a": "Hello" }
Why not just let the JSON parser do its job and handle the \x escape sequences, and then just convert the string back to hex again afterwards, e.g.
function charToHex(c) {
var hex = c.charCodeAt(0).toString(16);
return (hex.length === 2) ? hex : '0' + hex;
}
"Hello".replace(/./g, charToHex); // gives "48656c6c6f"
Sometimes I see in a view source page ( html view source) this code:
if (JSON.stringify(["\u2028\u2029"]) === '["\u2028\u2029"]') JSON.stringify = function (a) {
var b = /\u2028/g,
c = /\u2029/g;
return function (d, e, f) {
var g = a.call(this, d, e, f);
if (g) {
if (-1 < g.indexOf('\u2028')) g = g.replace(b, '\\u2028');
if (-1 < g.indexOf('\u2029')) g = g.replace(c, '\\u2029');
}
return g;
};
}(JSON.stringify);
What is the problem with JSON.stringify(["\u2028\u2029"]) that it needs to be checked ?
Additional info :
JSON.stringify(["\u2028\u2029"]) value is "["
"]"
'["\u2028\u2029"]' value is also "["
"]"
I thought it might be a security feature. FileFormat.info for 2028 and 2029 have a banner stating
Do not use this character in domain names. Browsers are blacklisting it because of the potential for phishing.
But it turns out that the line and paragraph separators \u2028 and \u2029 respectively are treated as a new line in ES5 JavaScript.
From http://www.thespanner.co.uk/2011/07/25/the-json-specification-is-now-wrong/
\u2028 and \u2029 characters that can break entire JSON feeds since the string will contain a new line and the JavaScript parser will bail out
So you are seeing a patch for JSON.stringify. Also see Node.js JavaScript-stringify
Edit: Yes, modern browsers' built-in JSON object should take care of this correctly. I can't find any links to the actual source to back this up though. The Chromium code search doesn't mention any bugs that would warrant adding this workaround manually. It looks like Firefox 3.5 was the first version to have native JSON support, not entirely bug-free though. IE8 supports it too. So it is likely a now unnecessary patch, assuming browsers have implemented the specification correctly.
After reading both answers , here is the Simple visual explanation :
doing this
alert(JSON.stringify({"a":"sddd\u2028sssss"})) // can cause problems
will alert :
While changing the trouble maker to something else ( for example from \u to \1u)
will alert :
Now , let's invoke the function from my original Q ,
Lets try this alert(JSON.stringify({"a":"sddd\u2028sssss"})) again :
result :
and now , everybody's happy.
\u2028 and \u2029 are invisible Unicode line and paragraph separator characters. Natively JSON.stringify method converts these codes to their symbolic representation (as JavaScript automatically does in the strings), resulting in "["
"]". The code you have provided does not let JSON to convert the codes to symbols and preserves their \uXXXX representation in the output string, i.e. returning "["\u2028\u2029"]".
I have a bbcode -> html converter that responds to the change event in a textarea. Currently, this is done using a series of regular expressions, and there are a number of pathological cases. I've always wanted to sharpen the pencil on this grammar, but didn't want to get into yak shaving. But... recently I became aware of pegjs, which seems a pretty complete implementation of PEG parser generation. I have most of the grammar specified, but am now left wondering whether this is an appropriate use of a full-blown parser.
My specific questions are:
As my application relies on translating what I can to HTML and leaving the rest as raw text, does implementing bbcode using a parser that can fail on a syntax error make sense? For example: [url=/foo/bar]click me![/url] would certainly be expected to succeed once the closing bracket on the close tag is entered. But what would the user see in the meantime? With regex, I can just ignore non-matching stuff and treat it as normal text for preview purposes. With a formal grammar, I don't know whether this is possible because I am relying on creating the HTML from a parse tree and what fails a parse is ... what?
I am unclear where the transformations should be done. In a formal lex/yacc-based parser, I would have header files and symbols that denoted the node type. In pegjs, I get nested arrays with the node text. I can emit the translated code as an action of the pegjs generated parser, but it seems like a code smell to combine a parser and an emitter. However, if I call PEG.parse.parse(), I get back something like this:
[
[
"[",
"img",
"",
[
"/",
"f",
"o",
"o",
"/",
"b",
"a",
"r"
],
"",
"]"
],
[
"[/",
"img",
"]"
]
]
given a grammar like:
document
= (open_tag / close_tag / new_line / text)*
open_tag
= ("[" tag_name "="? tag_data? tag_attributes? "]")
close_tag
= ("[/" tag_name "]")
text
= non_tag+
non_tag
= [\n\[\]]
new_line
= ("\r\n" / "\n")
I'm abbreviating the grammar, of course, but you get the idea. So, if you notice, there is no contextual information in the array of arrays that tells me what kind of a node I have and I'm left to do the string comparisons again even thought the parser has already done this. I expect it's possible to define callbacks and use actions to run them during a parse, but there is scant information available on the Web about how one might do that.
Am I barking up the wrong tree? Should I fall back to regex scanning and forget about parsing?
Thanks
First question (grammar for incomplete texts):
You can add
incomplete_tag = ("[" tag_name "="? tag_data? tag_attributes?)
// the closing bracket is omitted ---^
after open_tag and change document to include an incomplete tag at the end. The trick is that you provide the parser with all needed productions to always parse, but the valid ones come first. You then can ignore incomplete_tag during the live preview.
Second question (how to include actions):
You write socalled actions after expressions. An action is Javascript code enclosed by braces and are allowed after a pegjs expression, i. e. also in the middle of a production!
In practice actions like { return result.join("") } are almost always necessary because pegjs splits into single characters. Also complicated nested arrays can be returned. Therefore I usually write helper functions in the pegjs initializer at the head of the grammar to keep actions small. If you choose the function names carefully the action is self-documenting.
For an examle see PEG for Python style indentation. Disclaimer: this is an answer of mine.
Regarding your first question I have tosay that a live preview is going to be difficult. The problems you pointed out regarding that the parser won't understand that the input is "work in progress" are correct. Peg.js tells you at which point the error is, so maybe you could take that info and go a few words back and parse again or if an end tag is missing try adding it at the end.
The second part of your question is easier but your grammar won't look so nice afterwards. Basically what you do is put callbacks on every rule, so for example
text
= text:non_tag+ {
// we captured the text in an array and can manipulate it now
return text.join("");
}
At the moment you have to write these callbacks inline in your grammar. I'm doing a lot of this stuff at work right now, so I might make a pullrequest to peg.js to fix that. But I'm not sure when I find the time to do this.
Try something like this replacement rule. You're on the right track; you just have to tell it to assemble the results.
text
= result:non_tag+ { return result.join(''); }