Handling unicode in the http response xml

Handling unicode in the http response xml - javascript

I'm writing a Google Chrome extension that builds upon myanimelist.net REST api. Sometimes the XMLHttpRequest response text contains unicode.
For example:
<title>Onegai My Melody Sukkiriâ�ª</title>
If I create a HTML node from the text it looks like this:
Onegai My Melody Sukkiriâ�ª
The actual title, however, is this:
Onegai My Melody Sukkiri♪
Why is my text not correctly rendered and how can I fix it?
Update
Code: background.html
I think these are the crucial parts:
function htmlDecode(input){
var e = document.createElement('div');
e.innerHTML = input;
return e.childNodes.length === 0 ? "" : e.childNodes[0].nodeValue;
}
function xmlDecode(input){
var result = input;
result = result.replace(/</g, "<");
result = result.replace(/>/g, ">");
result = result.replace(/\n/g, "
");
return htmlDecode(result);
}
Further:
var parser = new DOMParser();
var xmlText = response.value;
var doc = parser.parseFromString(xmlDecode(xmlText), "text/xml");

<title>Onegai My Melody Sukkiriâ�ª</title>
Oh dear! Not only is that the wrong text, it's not even well-formed XML. acirc and ordf are HTML entities which are not predefined in XML, and then there's an invalid UTF-8 sequence (one high byte, presumably originally 0x99) between them.
The problem is that myanimelist are generating their output ‘XML’ (but “if it ain't well-formed, it ain't XML”) using the PHP function htmlentities(). This tries to HTML-escape not only the potentially-sensitive-in-HTML characters <&"', but also all non-ASCII characters.
This generates the wrong characters because PHP defaults to treating the input to htmlentities() as ISO-8859-1 instead of UTF-8 which is the encoding they're actually using. But it was the wrong thing to begin with because the HTML entity set doesn't exist in XML. What they really wanted to use was htmlspecialchars(), which leaves the non-ASCII characters alone, only escaping the really sensitive ones. Because those are the same ones that are sensitive in XML, htmlspecialchars() works just as well for XML as HTML.
htmlentities() is almost always the Wrong Thing; htmlspecialchars() should typically be used instead. The one place you might want to encode non-ASCII bytes to entity references would be when you're targeting pure ASCII output. But even then htmlentities() fails because it doesn't make character references (&#...;) for the characters that don't have a predefined entity names. Pretty useless.
Anyway, you can't really recover the mangled data from this. The � represents a byte sequence that was UTF-8-undecodable to the XMLHttpRequest, so that information is irretrievably lost. You will have to persuade myanimelist to fix their broken XML output as per the above couple of paragraphs before you can go any further.
Also they should be returning it as Content-Type: text/xml not text/html as at the moment. Then you could pick up the responseXML directly from the XMLHttpRequest object instead of messing about with DOMParsers.

So, I've come across something similar to what's going on here at work, and I did a bit more research to confirm my hypothesis.
If you take a look at the returned value you posted above, you'll notice the tell-tell entity "â". 99% of the time when you see this entity, if means you have a character encoding issue (typically UTF-8 characters are being encoded as ISO-8859-1).
The first thing I would test for is to force a character encoding in the API return. (It's a long shot, but you could look)
Second, I'd try to force a character encoding onto the data returned (I know there's a .htaccess override, but I don't know what's allowed in Chrome extensions so you'll have to research that).
What I believe is going on, is when you crate the node with the data, you don't have a character encoding set on the document, and browsers (typically, in my experience) default to ISO-8859-1. So, check to make sure it's not your document that's the problem.
Finally, if you can't find the source (or can't prevent it) of the character encoding, you'll have to write a conversation table to replace the malformed values you're getting with the ones you want { JS' "replace" should be fine (http://www.w3schools.com/jsref/jsref_replace.asp) }.

You can't just use a simple search and replace to fix encoding issue since they are unicode, not characters typed on a keyboard.
Your data must be stored on the server in UTF-8 format if you are planning on retrieving it via AJAX. This problem is probably due to someone pasting in characters from MS-Word which use a completely different encoding scheme (ISO-8859).
If you can't fix the data, you're kinda screwed.
For more details, see: UTF-8 vs. Unicode

Related

How to fix an invalid random string to make it JSON valid

In Javascript, I need to "fix" a string, supposed to be JSON valid but may not be. The string has the following format (the unknown part is marked with "<INVALID_CHARS>"):
[
{ "key_1": "ok_data", "key_2": "something_valid <INVALID_CHARS>"},
{ "key_1": "ok_data", "key_2": "some_valid_value"}
]
"INVALID_CHARS" are chars which make the JSON.parse() function fail.
The errors are always localized on the "key_2" property of this array elements.
Note that these chars come from random binary data, and can thus be anything.
I would like to find the simplest solution, or at least one which is the least prone to errors.
I thought of replacing invalid characters, but there is also a problem with single backslash chars followed by a non special char, throwing an error too, or quote chars.
And I probably did not think of all the possible errors.
Thank you.

JSON is not allowed to contain arbitrary binary data; it must be a sequence of valid Unicode codepoints. (Usually these are transmitted in UTF-8 encoding, but regardless, arbitrary binary data is not possible.) So if you want to include arbitrary binary data you'll need to figure out how to unambiguously encode it for transmission. If you don't encode it in some way, then you won't be able to reliably distinguish a byte which happens to have the same code as " from the " which terminates the string.
There are a number of possible encodings you might use for which standard libraries exist in most languages. One of the most commonly used is base-64.

it's better to clarify the problem as seems you described wide range of the issues here. If you have problem with parsing structure above you just need to check the syntactic integrity of the structure. For example this structure parses well
let var1 = JSON.parse('[
{
"key_1":"ok_data",
"key_2":"something_valid <INVALID_CHARS>"
},
{
"key_1":"ok_data",
"key_2":"some_valid_value"
}
]');
In case if you need to replace <INVALID_CHARS> as binary data with json characters it's possible to encode <INVALID_CHARS> in base64 as it's the most reliable way. But I guess also problem not only to pack <INVALID_CHARS> to base64 and problem is also architectural and you need to prepare value of key_2 with valid part and invalid part. In this way, I would suggest separate (split) key_2 on two substrings separate by " " - "key_2": "something_valid <INVALID_CHARS>(can be omitted)".
Moreover, it's possible to use separate fields for string without error and a second for errors. Like this "key_2_1": "something_valid", "key_2_2":<INVALID_CHARS>
Another way is to look to using Multipart Form Data if it's possible, to transfer binary data

How to encode and decode all special characters in javascript or jquery?

I want to be able to encode and decode all the following characters using javascript or jquery...
~!##$%^&*()_+|}{:"?><,./';[]\=-`
I tried to encode them using this...
var cT = encodeURI(oM); // oM holds the special characters
cT = cT.replace(/[!"#$%&'()*+,.\/:;<=>?#[\\\]^`{|}~]/g, "\\\\$&");
Which does encode them, or escape them rather, but then I am trying to do the reverse with this...
decodeURIComponent(data.convo.replace(/\+/g, ' '));
But, it's not coming out in any way desired.
I've built a chat plugin for jquery, but the script crashes if someone enters a special character. I want the special characters to get encoded, then when they get pulled out of the data base, they should be decoded. I tried using urldecode in PHP before the data is returned to the ajax request but it's coming out horribly wrong.
I would think that there exists some function to encode and decode all special characters.
Oh, one caveat for this is that I'm wrapping each message with html elements, so I think the decoding needs to be done server side, before the message is wrapped, or be able to know when to ignore valid html tags and decode the other characters that are just what the user wanted to type.
Am I encoding/escaping them wrong to begin with?
Is that why the results are horrible?

This is pretty simple in javascript
//Note that i have escaped the " in the string - this means it still gets processed
var exampleInput = "Hello there h4x0r ~!##$%^&*()_+|}{:\"?><,./';[]\=-`";
var encodedInput = encodeURI(exampleInput);
var decodedInput = decodeURI(encodedInput);
console.log(exampleInput);
console.log(encodedInput);
console.log(decodedInput);
Just encode and decode the input. If something else is breaking in your script it means you are not stripping away things that you are somehow processing. It's hard to provide an accurate answer as you can see encoding and decoding the URI standards does not crash things. Only the processing of this content improperly would cause issues.
When you output the content in HTML you should be encoding the HTML entities.
Reference this thread Encode html entities in javascript if you need to actually encode for display inside HTML safely.
An additional reference on how html entities work can be found here: W3 Schools - HTML Entities and W3 Schools - HTML Symbols

How to insert arbitrary JSON in HTML's script tag

I would like to store a JSON's contents in a HTML document's source, inside a script tag.
The content of that JSON does depend on user submitted input, thus great care is needed to sanitise that string for XSS.
I've read two concept here on SO.
1. Replace all occurrences of the </script tag into <\/script, or replace all </ into <\/ server side.
Code wise it looks like the following (using Python and jinja2 for the example):
// view
data = {
'test': 'asdas</script><b>as\'da</b><b>as"da</b>',
}
context_dict = {
'data_json': json.dumps(data, ensure_ascii=False).replace('</script', r'<\/script'),
}
// template
<script>
var data_json = {{ data_json | safe }};
</script>
// js
access it simply as window.data_json object
2. Encode the data as a HTML entity encoded JSON string, and unescape + parse it in client side. Unescape is from this answer: https://stackoverflow.com/a/34064434/518169
// view
context_dict = {
'data_json': json.dumps(data, ensure_ascii=False),
}
// template
<script>
var data_json = '{{ data_json }}'; // encoded into HTML entities, like < > &
</script>
// js
function htmlDecode(input) {
var doc = new DOMParser().parseFromString(input, "text/html");
return doc.documentElement.textContent;
}
var decoded = htmlDecode(window.data_json);
var data_json = JSON.parse(decoded);
This method doesn't work because \" in a script source becames " in a JS variable. Also, it creates a much bigger HTML document and also is not really human readable, so I'd go with the first one if it doesn't mean a huge security risk.
Is there any security risk in using the first version? Is it enough to sanitise a JSON encoded string with .replace('</script', r'<\/script')?
Reference on SO:
Best way to store JSON in an HTML attribute?
Why split the <script> tag when writing it with document.write()?
Script tag in JavaScript string
Sanitize <script> element contents
Escape </ in script tag contents
Some great external resources about this issue:
Flask's tojson filter's implementation source
Rail's json_escape method's help and source
A 5 year long discussion in Django ticket and proposed code

Here's how I dealt with the relatively minor part of this issue, the encoding problem with storing JSON in a script element. The short answer is you have to escape either < or / as together they terminate the script element -- even inside a JSON string literal. You can't HTML-encode entities for a script element. You could JavaScript-backslash-escape the slash. I preferred to JavaScript-hex-escape the less-than angle-bracket as \u003C.
.replace('<', r'\u003C')
I ran into this problem trying to pass the json from oembed results. Some of them contain script close tags (without mentioning Twitter by name).
json_for_script = json.dumps(data).replace('<', r'\u003C');
This turns data = {'test': 'foo </script> bar'}; into
'{"test": "foo \\u003C/script> bar"}'
which is valid JSON that won't terminate a script element.
I got the idea from this little gem inside the Jinja template engine. It's what's run when you use the {{data|tojson}} filter.
def htmlsafe_json_dumps(obj, dumper=None, **kwargs):
"""Works exactly like :func:`dumps` but is safe for use in ``<script>``
tags. It accepts the same arguments and returns a JSON string. Note that
this is available in templates through the ``|tojson`` filter which will
also mark the result as safe. Due to how this function escapes certain
characters this is safe even if used outside of ``<script>`` tags.
The following characters are escaped in strings:
- ``<``
- ``>``
- ``&``
- ``'``
This makes it safe to embed such strings in any place in HTML with the
notable exception of double quoted attributes. In that case single
quote your attributes or HTML escape it in addition.
"""
if dumper is None:
dumper = json.dumps
rv = dumper(obj, **kwargs) \
.replace(u'<', u'\\u003c') \
.replace(u'>', u'\\u003e') \
.replace(u'&', u'\\u0026') \
.replace(u"'", u'\\u0027')
return Markup(rv)
(You could use \x3C instead of \u003C and that would work in a script element because it's valid JavaScript. But might as well stick to valid JSON.)

First of all, your paranoia is well founded.
an HTML-parser could be tricked by a closing script tag (better assume by any closing tag)
a JS-parser could be tricked by backslashes and quotes (with a really bad encoder)
Yes, it would be much "safer" to encode all characters that could confuse the different parsers involved. Keeping it human-readable might be contradicting your security paradigm.
Note: The result of JSON String encoding should be canoncical and OFC, not broken, as in parsable. JSON is a subset of JS and thus be JS parsable without any risk. So all you have to do is make sure the HTML-Parser instance that extracts the JS-code is not tricked by your user data.
So the real pitfall is the nesting of both parsers. Actually, I would urge you to put something like that into a separate request. That way you would avoid that scenario completely.
Assuming all possible styles and error-corrections that could happen in such a parser it might be that other tags (open or close) might achieve a similar feat.
As in: suggesting to the parser that the script tag has ended implicitly.
So it is advisable to encode slash and all tag braces (/,<,>), not just the closing of a script-tag, in whatever reversible method you choose, as long as long as it would not confuse the HTML-Parser:
Best choice would be base64 (but you want more readable)
HTMLentities will do, although confusing humans :)
Doing your own escaping will work as well, just escape the individual characters rather than the </script fragment
In conclusion, yes, it's probably best with a few changes, but please note that you will be one step away from "safe" already, by trying something like this in the first place, instead of loading the JSON via XHR or at least using a rigorous string encoding like base64.
P.S.: If you can learn from other people's code encoding the strings that's nice, but you should not resort to "libraries" or other people's functions if they don't do exactly what you need.
So rather write and thoroughly test your own (de/en)coder and know that this pitfall has been sealed.

How to handle possibly HTML encoded values in javascript

I have a situation where I'm not sure if the input I get is HTML encoded or not. How do I handle this? I also have jQuery available.
function someFunction(userInput){
$someJqueryElement.text(userInput);
}
// userInput "<script>" returns "<script>", which is fine
// userInput "<script>" returns &lt;script&gt;", which is bad
I could avoid escaping ampersands (&), but what are the risks in that? Any help is very much appreciated!
Important note: This user input is not in my control. It returns from a external service, and it is possible for someone to tamper with it and avoid the html escaping provided by that service itself.

You really need to make sure you avoid these situations as it introduces really difficult conditions to predict.
Try adding an additional variable input to the function.
function someFunction(userInput, isEncoded){
//Add some conditional logic based on isEncoded
$someJqueryElement.text(userInput);
}
If you look at products like fckEditor, you can choose to edit source or use the rich text editor. This prevents the need for automatic encoding detection.
If you are still insistent on automatically detecting html encoding characters, I would recommend using index of to verify that certain key phrases exist.
str.indexOf('<') !== -1
This example above will detect the < character.
~~~New text added after edit below this line.~~~
Finally, I would suggest looking at this answer. They suggest using the decode function and detecting lengths.
var string = "Your encoded & decoded string here"
function decode(str){
return decodeURIComponent(str).replace(/</g,'<').replace(/>/g,'>');
}
if(string.length == decode(string).length){
// The string does not contain any encoded html.
}else{
// The string contains encoded html.
}
Again, this still has the problem of a user faking out the process by entering those specially encoded characters, but that is what html encoding is. So it would be proper to assume html encoding as soon as one of these character sequences comes up.

You must always correctly encode untrusted input before concatenating it into a structured language like HTML.
Otherwise, you'll enable injection attacks like XSS.
If the input is supposed to contain HTML formatting, you should use a sanitizer library to strip all potentially unsafe tags & attributes.
You can also use the regex /<|>|&(?![a-z]+;) to check whether a string has any non-encoded characters; however, you cannot distinguish a string that has been encoded from an unencoded string that talks about encoding.

How to avoid double HTML escaping text?

In my application, there are times when some text may or may not be html escaped (depending on where the data came from). I want to ensure the non-escaped text gets escaped, but the already escaped text doesn't get escaped again.
How do people typically solve this?

You can't tell from the data.
For example:
Bob & Alice
… could be "The HTML representation of Bob & Alice" or it could also be "The plain text representation of Bob & Alice" (e.g. from an HTML tutorial).
Since you say:
depending on where the data came from
… keep track of where it comes from, and make sure you know if a source provides trusted HTML or plain text.
If you don't know, then how you handle it will depend on the context. The safe option would be to assume it is always plain text and thus always encode it. That will protect you from scripting injection attacks.

One way is to unescape the string and compare it to the original. If it is the same, the original is unescaped data, otherwise it is escaped data.
var str = '<data>';
// Escape unescaped data
if (unescape(str) === str) {
str = escape(str);
}

Unescape text before escaping it.

We Keep Coding

JavaScript is the programming language of the Web.