Illegal characters in JSON response

Illegal characters in JSON response - javascript

I have a Sencha Touch app. One of the stores I have uses an ajax proxy and a json reader. Some of the strings in the JSON returned from my sinatra app occasionally contain this character:
http://www.fileformat.info/info/unicode/char/2028/index.htm
Although it's invisible, the character occurs twice in the second string here, between the period and the ending quote:
"description": "Each of the levels requires logic, skill, and brute force to crush the enemy.  "
Try copy and pasting "Each of the levels requires logic, skill, and brute force to crush the enemy.  " into your javascript console! It won't be parsed as a string, and fails with SyntaxError: Unexpected token ILLEGAL.
This causes the JSON response to fail. I've been stuck on this for a long time! Any suggestions?

The only reliable way to fix this is server-side. Make sure your JSON generator emits those characters escaped, e.g. as \u2028.
In my experience, it's easiest to simply encode your JSON in plain ASCII which will always work. The downside is that it's less efficient as non-ASCII characters will take up more space, so depending on the frequency of those, you may not want that trade-off...
The documentation for Perl's JSON::XS has a good explanation of the problem and advice for how to fix it in Perl:
http://search.cpan.org/perldoc?JSON::XS#JSON_and_ECMAscript

Conceptually you are only allowed to send out strings from the server that are valid JavaScript literals by escaping appropriately.
If you want to fix this issue on the client you need an extra workaround step (only seems to work in Firefox):
var a = escape("Each of the levels requires logic, skill, and brute force to crush the enemy.");
alert(unescape(a));
But the discussion is obsolete, because you must escape on the server.

Avoid using eval to parse JSON.
Use JSON.parse or https://github.com/douglascrockford/JSON-js.
JSON.parse('{}'); // where {} is your JSON String

Related

How to convert array of objects to a shortest possible string?

I am thinking of a implementing a new project that has import/export feature. First, I will have an array of around 45 objects. The object structure is simple like this.
{"id": "someId", "quantity": 3}
So, in order to make it exportable, I will have to change the whole array of these objects into one single string first. For this part, I think I will use JSON.stringify(). After that, I want to make the string as short as possible for the users to use it (copy the string and paste it to share to other users to import it back to get the original array). I know this part is not necessary but I really want to make it as short as possible. Hence, the question. How to convert array of objects to a shortest possible string?
Any techniques such as Encoding, Encryption, or Hashing are acceptable as long as it is reversible to get the original data.
By "shortest possible", I mean you can answer any solution that is shorter than just pure stringification. I will just accept the one that gives shortest string to import.
I tried text minification but it gives almost the same result as the original text. I also tried encryption but it still gives a relatively long result.
Note: The string for import (that comes from export) can be human-readable or unreadable. It does not matter.

Deleting json optional SPACE after : colon and , comma
is a no-brainer. Let's assume you have already minified
in that way.
xz compression is generally helpful.
Perhaps you know some strings that are very likely
to repeatedly appear in the input doc. That might include:
"id":
"quantity":
Construct a prefix document which mentions such terms.
Sender will compress prefix + doc,
strip the initial unchanging bytes,
and send the rest.
Receiver will accept those bytes via TCP,
prepend the unchanging bytes,
and decompress.
Why does this improve compression ratios?
Lempel-Ziv and related schemes maintain a dictionary,
and transmit integer indexes into that dictionary
in order to indicate common words.
A word can be fairly long, even longer than "quantity".
The longer it is, the greater the savings.
If sender and receiver both know a set of words
that belong in the dictionary, beforehand,
we can avoid sending the raw text of those words.
Your chrome browser
compresses web headers
in this way already, each time you do a google search.
Finally, you might want to base64 encode the compressed output.
Ignore compression, and use a database instead,
in the way that tinyurl.com has been doing for quite a long time.
Set serial to 1.
Accept a new object, or set of objects.
Ask the DB if you've seen this input before.
If not, store it under a brand new serial ID.
Now send the matching ID to the receiving end.
It can query the central database when it gets a new ID,
and it can cache such results to use in future.

You might opt for a simple CSV export. The export string becomes, if you use the pipe separator, something like:
id|quantity\nsomeId|3\notherId|8
which is the equivalent of
[{"id":"someId","quantity":3},{"id":"otherId","quantity":8}]
This approach will remove the redundant id and quantity tags for each record and remove the unnecessary double quotes.
The downside is that your records all should have the same data structure but that is generally the case.

How to efficiently handle the line break exploit when implementing server sent events?

When implementing Server Sent Events on your application server, you can terminate a message and have it send by ending it with two line breaks: \n\n, as demonstrated on this documentation page.
So, what if you're receiving user input and forwarding it to all interested parties (as is typical in a chat application)? Could a malicious user not insert two line breaks in their payload to terminate the message early? Even more, could they not then set special fields such as the id and retry fields, now that they have access to the first characters of a line?
It seems that the only alternative is to instead scan their entire payload, and then replace instances of \n with something like \ndata:, such that their entire message payload has to maintain its position in the data tag.
However, is this not very inefficient? Having to scan the entire message payload for each message and then potentially do replacements involves not only scanning each entire payload, but also reallocating in the case of maleficence.
Or is there an alternative? I'm currently trying to decide between websockets and SSE, as they are quite similar, and this issue is making me learn more towards WebSockets, because it feels as if they would be more efficient if they are able to avoid this potential vulnerability.
Edit: To clarify, I'm mostly ignorant as to whether or not there is a way around having to scan each message in its entirety for \n\n. And if not, does WebSockets have the same issue where you need to scan each message in its entirety? Because if it does, then no matter. But if that's not the case, then it seems to be a point in favor of using websockets over SSE.

it shouldnt be necessary to scan the payload if you're encoding the user data correctly. With JSON it is safe to use the "data" field in server-sent events because JSON decode newline and controls characters per default, as the RFC says:
The representation of strings is similar to conventions used in the C
family of programming languages. A string begins and ends with
quotation marks. All Unicode characters may be placed within the
quotation marks, except for the characters that must be escaped:
quotation mark, reverse solidus, and the control characters (U+0000
through U+001F).
https://www.rfc-editor.org/rfc/rfc7159#page-8
the important thing is that nobody sneaks in an newline charactes but this isnt new to server sent events, header are seperate by a single new line and can be tampered too (if not correctly encoded) see https://www.owasp.org/index.php/HTTP_Response_Splitting
Heres an example of an server sent application with json encoding:
https://repl.it/#BlackEspresso/PointedWelloffCircles
you shouldnt be able to tampere the data field even with the newline characters are allowed
Encoding souldnt stop you from using server side events, but there are major differences between websockets and sse. For a comparison see this answer: https://stackoverflow.com/a/5326159/1749420

Unless I'm missing something obvious, sanitizing input is a common thing in web development.
Since the source that you shared explicitly mentioned a PHP example, I just did some research and lookie here:
https://www.php.net/manual/en/filter.filters.sanitize.php
FILTER_SANITIZE_SPECIAL_CHARS
HTML-escape '"<>& and characters with ASCII value less than 32,
optionally strip or encode other special characters.
and:
'\n' = 10 = 0x0A = line feed
So I'm not sure I understand why you would assume that converting certain input to character entities would necessarily be a bad thing.
Avoiding users to abuse the system by uploading unwanted input is what sanitization is for.

JavaScript JSON parser that tells error position

I've been having some troubles with parsing JSON that is received with WebSocket (original question - Parse JSON received with WebSocket results in error). The JSON string itself is valid (tested with several JSON validators), but JSON.parse throws an exception. I am trying to figure out what is it exactly that it cannot parse, but the only thing I'm getting is "SyntaxError: unexpected_token ILLEGAL", it doesn't say where is the exact position of the failed token. Is there any way of extracting such information?
Update: If I copy-paste that JSON string to a static file (e.g. "data.json") and then retrieve it and parse it with the same function (JSON.parse) - then it works fine.
So I'm assuming there's something tricky going on, I thought of newline symbol (may be there was \n instead of \r\n or vice versa) but completely removing all the line breaks didn't help. I would think that it very well may be an encoding problem, but the data is received via websocket and according to documentation it's utf-8 string.
2nd Update: IT WORKS just fine if I use "json_parse" from here: https://github.com/douglascrockford/JSON-js/blob/master/json_parse.js
Then it works fine! Does that mean this is a bug in "JSON.parse" implementation used by Chrome or what?
Thank you.

You could copy an implementation of JSON.parse() from somewhere (like out of jQuery), change it's name so you can call it directly, tweak the implementation so that it never detects the built-in parser so it always uses the JS parser and then change your code to use the new JS version of the parser, then trace through it in a javascript debugger until you find what it doesn't like.

One thing to check is if you have quotes and slashes within your JSON string. If yes, they need to be escaped:
{
"key": "i've \"quotes\" in me",
"key2": "and \\slashes too"
}
Also, JSONLint gives you the exact location of the error.
As per JSON.org, you cannot have quotes and slashes in your strings, so you need to escape them.

I think you don't need to call JSON.parse:
JSON.parse({"key": "whatever"}) // Syntax Error ILLEGAL
because it's already an object. I would also be curious to see the result of the following code:
eval("(" + json + ")");
Or
JSON.parse(decodeURIComponent(json));

Can't tell much with the details, but the possibility may be that your validator is doing non-strict parsing and your javascript may be doing strict parsing...

Does json add any overhead compared to returning raw html?

I have a ajax call that is currently returning raw html that I inject into the page.
Now my issue is, that in some situations, I need to return back a count value along with the raw html, and if that count is > 10, I need to fire another jquery operation to make something visible.
The best approach here would be to return json then right? So I can do:
jsonReturned.counter
jsonReturned.html
Do you agree?
Also, out of curiosity more than anything, is json any more expensive performance wise? It is just a simple object with properties but just asking.

This question reserves some discretion, but in my opinion, there is no efficiency concern with returning JSON instead of raw HTML. As you stated, you can easily return multiple messages without the need for extra parsing (I'll often have a status, a message, and data for example).
I haven't run any numbers, but I can tell you I've used JSON via AJAX in very heavy traffic (millions of requests) situations with no efficiency concerns.

I agree, json is the way to go. There is no doubt that it is a performance hit. The question is: is it a negligible hit? My opinion is that it is negligible. Javascript is pretty fast these days in the browser. You should be ok.

JSON'll likely be more compact than HTML, since bracket/quote pairs are ALWAYS going to be terser than the smallest possible tag combos. {}/[],"" v.s. <a></a>. 2 chars v.s. 7 is a win in my book. however, if your data requires huge amounts of escaping with \, then JSON would be a net loss, and could double the size of any given string.

Sources of additional overhead include
Download size and load time of a JSON parser for older browsers
JSON parse time.
Larger download size for the HTML content since the JSON string has to contain the HTML string plus quotes at least.
The only way to know whether these are significant to your app is to measure them. Neither of them are obviously large and none of them are more than O(n).

Strip the last character sent by JavaScript through websockets to Python

I'm currently trying out websockets, creating a client in JavaScript and a server in Python.
I'm stuck on a simple problem, though: when I send something from the client to the server it always contains a special ending character, but I don't know how to remove it.
I've tried data[:-1] thinking that would get rid of it, but it didn't.
With the character my JSON code won't validate.
This is what I send through JavaScript:
ws.send('{"test":"test"}');
This is what I get in python:
{"test":"test"}�
I thought the ending character was \xff

The expression "data[:-1]" is an expression that produces a copy of data missing the last character. It doesn't modify the "data" variable. To do that, you have to assign back to "data", like so:
data = data[:-1]
My suspicion is the "special ending character" is a bug, somewhere, either in your code or how you're using the APIs. Network code does not generally introduce random characters into the data stream. Good luck!

We Keep Coding

JavaScript is the programming language of the Web.