I'm using php's json_encode() to convert an array to json which then echo's it and is read from a javascript ajax request.
The problem is the echo'd text has unicode characters which the javascript json parse() function doesn't convert to.
Example array value is "2\u00000\u00001\u00000\u0000-\u00001\u00000\u0000-\u00000\u00001" which is "2010-10-01".
Json.parse() only gives me "2".
Anyone help me with this issue?
Example:
var resArray = JSON.parse(this.responseText);
for(var x=0; x < resArray.length; x++) {
var twt = resArray[x];
alert(twt.date);
break;
}
You have NUL characters (character code zero) in the string. It's actually "2_0_1_0_-_1_0_-_0_1", where _ represents the NUL characters.
The unicode character escape is actually part of the JSON standard, so the parser should handle that correctly. However, the result is still a string will NUL characters in it, so when you try to use the string in Javascript the behaviour will depend on what the browser does with the NUL characters.
You can try this in some different browsers:
alert('as\u0000df');
Internet Explorer will display only as
Firefox will display asdf but the NUL character doesn't display.
The best solution would be to remove the NUL characters before you convert the data to JSON.
To add to what Guffa said:
When you have alternating zero bytes, what has almost certainly happened is that you've read a UTF-16 data source without converting it to an ASCII-compatible encoding such as UTF-8. Whilst you can throw away the nulls, this will mangle the string if it contains any characters outside of ASCII range. (Not an issue for date strings of course, but it may affect any other strings you're reading from the same source.)
Check where your PHP code is reading the 2010-10-01 string from, and either convert it on the fly using eg iconv('utf-16le', 'utf-8', $string), or change the source to use a more reasonable encoding. If it's a text file, for example, save it in a text editor using ‘UTF-8 without BOM’, and not ‘Unicode’, which is a highly misleading name Windows text editors use to mean UTF-16LE.
Related
So from the textarea I take the shortcode %91samurai id="19"%93 it should be [samurai id="19"]:
var not_decoded_content = jQuery('[data-module_type="et_pb_text_forms_00132547"]')
.find('#et_pb_et_pb_text_form_content').html();
But when I try to decode the %91 and %93
self.content = decodeURI(not_decoded_content);
I get the error:
Uncaught URIError: URI malformed
How can i solve this problem?
The encodings are invalid. If you can't fix the whatever-system-produces-them to correctly produce %5B and %5D, then your only option is to do a replacement yourself: replace all %91 with character 91 which is '[', then replace all %93 with character 93 which is ']'.
Note that javascript String Replace as-is won't do "Replace all occurrences". If you need that, then create a loop (while it contains(...) do a replace), or search the internet for javascript replace all, you should find plenty results.
And a final note, I am used to using decodeURIComponent(...). If you can make the whatever-system-produces-them to correctly produce %5B and %5D, and you still get that error, then try using decodeURIComponent(...) instead of decodeURI(...).
The string you're trying to decode is not a URI. Use decodeURIComponent() instead.
UPDATE
Hmm, that's not actually the issue, the issues are the %91 and %93.
encodeURI('[]')
gives %5b%5d, it looks like whatever has encoded this string has used the decimal rather than hexadecimal value.
Decimal 91 = hex 5b
Decimal 93 = hex 5d
Trying again with the hex values
decodeURI('%5bsamurai id="19"%5d') == '[samurai id="19"]'
I know this is not the solution you want to see, but can you try using "%E2%80%98" for %91 and "%E2%80%9C" for %93 ?
The %91 and %93 are part of control characters which html does not like to decode (for reasons beyond me). Simply put, they're not your ordinary ASCII characters for HTML to play around with.
I am driving nuts trying to achieve this in JavaScript.
First I will describe the scenario, and then I'll put my code, Python version, which I can't seem to translate into JavaScript.
I have a web page running on a server. I have no access to it whatsoever, so the only way I have to achieve basic functionality is using JavaScript.
The web page is used to compare information. The information is stored in CSV format, which I use to create HTML tables on the fly by using AJAX calls. For the sake of not having that information quickly available to users, enabling them to print the source code and 'stealing it', I came across a range of solutions, like encoding in Base64 (I know this is considered 'security by obscurity' and it's a bad practice, but I have no other choice here).
Base64 it's very easy to use in this case, but I lose all the special characters from UTF-8 (like á é í ó ú ñ etc), which are part of my language (Spanish).
So here comes the preferred solution, which works like a charm in Python: using bitwise XOR. What could I achieve using this method:
If someone figures out the url of the CSV file, it wouldn't be so easy to read the text without basic programming knowledge to de-encode it.
I can easily program the source database to export the data and then run the XORing fuction, upload those files to the server and then having them de-encoded on the fly too.
Is in that last step where I can not achieve what I want.
Here is my Python script:
To encode:
b = bytearray(open('file.csv', 'rb').read())
for i in range(len(b)):
b[i] ^= 0x71
open('file.out', 'wb').write(b)
To decode:
b = bytearray(open('file.out', 'rb').read())
for i in range(len(b)):
b[i] ^= 0x71
I need to achieve that small decoding function in JS.
Thank you all in advance for your time.
Base64
It isn't true that base64 makes you lose non-ASCII characters like ñ or á. Why should it? Base64 can encode any binary data, and encoded text is nothing else than binary data.
So encoding involves two steps:
A text encoding (such as UTF-8) converts your text to bytes, and the base64 encoding turns that into an ASCII string.
Decoding works the same, but backwards (reverse order of the two corresponding decoding functions).
This is how text encoding for UTF-8 works in JavaScript:
function encode_utf8(s) {
return unescape(encodeURIComponent(s));
}
function decode_utf8(s) {
return decodeURIComponent(escape(s));
}
I got this from here. Please note that I'm not a JS crack at all, and there might be more convenient methods now that I couldn't find.
Let's try this:
s = 'Se bañó todo el día.';
b = encode_utf8(s); # text encoding
a = btoa(b); # base64 encoding
console.log(a); # prints U2UgYmHDscOzIHRvZG8gZWwgZMOtYS4=
d = decode_utf8(atob(a)); # decode base64, then UTF-8
console.log(d); # prints Se bañó todo el día.
No character lost here.
XOR method
If you still want to do the XOR thing, you can decode as follows:
convert the UTF8-encoded string to an array of code points with Array.from()
XOR-decode with the ^ operator (or ^= assignment)
convert the result to a string with String.fromCodePoint()
decode the string with decode_utf8()
I'm not providing code for this, though.
Especially the third step might be a bit cumbersome, and I'm not sure if it's worth the pain.
After all, your users can just inspect the JS code to find out how the data are "encrypted", be it base64 or the XOR method.
Note
If you come from a Python background, be aware that there is no distinction like Python's str and bytes type. Both input and output of the {en,de}code_utf8() functions are always strings, same type. When you encode a string, you just get back another string where every codepoint is below 256, and it might be longer than the input string.
Encoding a string with German umlauts like ä,ü,ö,ß with Javascript encodeURI() causes a weird bug after decoding it in PHP with rawurldecode(). Although the string seems to be correctly decoded it isn't. See below example screenshots from my IDE
Also the strlen() of the - with rawurldecode() - decoded string gives more characters than it really has!
Problems occur when I need to process the decoded string, for example if I want to replace the German characters ä,ü,ö with ae, ue and oe. This can be seen in the example provided here.
I have also made an PHP fiddle where this whole weirdness can be seen.
What I've tried so far:
- utf8_decode
- iconv
- and also first two suggestions from here
This is a Unicode equivalence issue and it looks like your IDE doesnt handle multibyte strings very well.
In unicode you can represent Ü with either:
the single unicode codepoint (U+00DC) or %C3%9C in utf8
or use a capital U (U+0055) with a modifier (U+0308) or %55%CC%88 in utf8
Your GWT string uses the latter method called NFD while your one from PHP uses the first method called NFC. That's why your GWT string is 3 characters longer even though they are both valid encodings of logically identical unicode strings. Your problem is that they are not identical byte for byte in PHP.
More details about utf-8 normalisation.
If you want to do preg replacements on the strings you need to normalise them to the same form first. From your example I can see your IDE is using NFC since it's the PHP string that works. So I suggest normalising to NFC form in PHP (the default), then doing the preg_replace.
http://php.net/manual/en/normalizer.normalize.php
function cleanImageName($name)
{
$name = Normalizer::normalize( $name, Normalizer::FORM_C );
$clean = preg_replace(
Otherwise you have to do something like this which is based on this article.
Using getJSON to retrieve some data which I am utf8 encoding on the server-side end...
"title":"new movie \u0091The Tree of Life\u0092 on day 6"
The page that is is displayed on is charset ISO-8859-1 and I am doing this...
$.getJSON('index.php', { q: q }, function(data){
for (var i = 0; i < data.length; i++) {
alert(data[i].title + "\n" + utf8_decode(data[i].title));
}
});
The utf8_decode function comes from here.
The problem is that I am still seeing the magic squares for both versions...
new movie The Tree of Life on day 6
new movie ᔨe Tree of Life⠯n day 6
This leads me to believe that perhaps the character is of neither encoding. However it works if I paste the string onto a page and set the charset to either UTF8 or ISO-8859-1 :-/
Any help would be great!
There is no need to escape or decode any characters in data transmitted in JSON. It's done automatically. It is also independent of the page's encoding. You can easily transmit and display the euro sign (\u20ac) with your code even though ISO-8859-1 does not contain the euro sign.
You problem are the characters \u0091 and \u0092. They aren't valid Unicode characters. They are for private use only.
It rather looks as if you in fact have data that originally used the Windows-1250 character set but was not properly translated to Unicode/JSON. In Windows-1250, these two characters are typographic single quotes.
Did you tried without utf8_decode ?
If the characters in your string exist in ISO-8859-1, this will just work, as Javascript decodes the \u0091 in the encoding of the page.
I'm writing a Google Chrome extension that builds upon myanimelist.net REST api. Sometimes the XMLHttpRequest response text contains unicode.
For example:
<title>Onegai My Melody Sukkiri�</title>
If I create a HTML node from the text it looks like this:
Onegai My Melody Sukkiri�
The actual title, however, is this:
Onegai My Melody Sukkiri♪
Why is my text not correctly rendered and how can I fix it?
Update
Code: background.html
I think these are the crucial parts:
function htmlDecode(input){
var e = document.createElement('div');
e.innerHTML = input;
return e.childNodes.length === 0 ? "" : e.childNodes[0].nodeValue;
}
function xmlDecode(input){
var result = input;
result = result.replace(/</g, "<");
result = result.replace(/>/g, ">");
result = result.replace(/\n/g, "
");
return htmlDecode(result);
}
Further:
var parser = new DOMParser();
var xmlText = response.value;
var doc = parser.parseFromString(xmlDecode(xmlText), "text/xml");
<title>Onegai My Melody Sukkiri�</title>
Oh dear! Not only is that the wrong text, it's not even well-formed XML. acirc and ordf are HTML entities which are not predefined in XML, and then there's an invalid UTF-8 sequence (one high byte, presumably originally 0x99) between them.
The problem is that myanimelist are generating their output ‘XML’ (but “if it ain't well-formed, it ain't XML”) using the PHP function htmlentities(). This tries to HTML-escape not only the potentially-sensitive-in-HTML characters <&"', but also all non-ASCII characters.
This generates the wrong characters because PHP defaults to treating the input to htmlentities() as ISO-8859-1 instead of UTF-8 which is the encoding they're actually using. But it was the wrong thing to begin with because the HTML entity set doesn't exist in XML. What they really wanted to use was htmlspecialchars(), which leaves the non-ASCII characters alone, only escaping the really sensitive ones. Because those are the same ones that are sensitive in XML, htmlspecialchars() works just as well for XML as HTML.
htmlentities() is almost always the Wrong Thing; htmlspecialchars() should typically be used instead. The one place you might want to encode non-ASCII bytes to entity references would be when you're targeting pure ASCII output. But even then htmlentities() fails because it doesn't make character references (&#...;) for the characters that don't have a predefined entity names. Pretty useless.
Anyway, you can't really recover the mangled data from this. The � represents a byte sequence that was UTF-8-undecodable to the XMLHttpRequest, so that information is irretrievably lost. You will have to persuade myanimelist to fix their broken XML output as per the above couple of paragraphs before you can go any further.
Also they should be returning it as Content-Type: text/xml not text/html as at the moment. Then you could pick up the responseXML directly from the XMLHttpRequest object instead of messing about with DOMParsers.
So, I've come across something similar to what's going on here at work, and I did a bit more research to confirm my hypothesis.
If you take a look at the returned value you posted above, you'll notice the tell-tell entity "â". 99% of the time when you see this entity, if means you have a character encoding issue (typically UTF-8 characters are being encoded as ISO-8859-1).
The first thing I would test for is to force a character encoding in the API return. (It's a long shot, but you could look)
Second, I'd try to force a character encoding onto the data returned (I know there's a .htaccess override, but I don't know what's allowed in Chrome extensions so you'll have to research that).
What I believe is going on, is when you crate the node with the data, you don't have a character encoding set on the document, and browsers (typically, in my experience) default to ISO-8859-1. So, check to make sure it's not your document that's the problem.
Finally, if you can't find the source (or can't prevent it) of the character encoding, you'll have to write a conversation table to replace the malformed values you're getting with the ones you want { JS' "replace" should be fine (http://www.w3schools.com/jsref/jsref_replace.asp) }.
You can't just use a simple search and replace to fix encoding issue since they are unicode, not characters typed on a keyboard.
Your data must be stored on the server in UTF-8 format if you are planning on retrieving it via AJAX. This problem is probably due to someone pasting in characters from MS-Word which use a completely different encoding scheme (ISO-8859).
If you can't fix the data, you're kinda screwed.
For more details, see: UTF-8 vs. Unicode