Javascript encodeURI() vs. PHP rawurldecode() and special characters - javascript

Encoding a string with German umlauts like ä,ü,ö,ß with Javascript encodeURI() causes a weird bug after decoding it in PHP with rawurldecode(). Although the string seems to be correctly decoded it isn't. See below example screenshots from my IDE
Also the strlen() of the - with rawurldecode() - decoded string gives more characters than it really has!
Problems occur when I need to process the decoded string, for example if I want to replace the German characters ä,ü,ö with ae, ue and oe. This can be seen in the example provided here.
I have also made an PHP fiddle where this whole weirdness can be seen.
What I've tried so far:
- utf8_decode
- iconv
- and also first two suggestions from here

This is a Unicode equivalence issue and it looks like your IDE doesnt handle multibyte strings very well.
In unicode you can represent Ü with either:
the single unicode codepoint (U+00DC) or %C3%9C in utf8
or use a capital U (U+0055) with a modifier (U+0308) or %55%CC%88 in utf8
Your GWT string uses the latter method called NFD while your one from PHP uses the first method called NFC. That's why your GWT string is 3 characters longer even though they are both valid encodings of logically identical unicode strings. Your problem is that they are not identical byte for byte in PHP.
More details about utf-8 normalisation.
If you want to do preg replacements on the strings you need to normalise them to the same form first. From your example I can see your IDE is using NFC since it's the PHP string that works. So I suggest normalising to NFC form in PHP (the default), then doing the preg_replace.
http://php.net/manual/en/normalizer.normalize.php
function cleanImageName($name)
{
$name = Normalizer::normalize( $name, Normalizer::FORM_C );
$clean = preg_replace(
Otherwise you have to do something like this which is based on this article.

Related

UTF8 String not allowing charAt() or substring to pull out specific characters

In my code I'm trying to isolate out the first character of a variable, it is the UTF8 symbol: 🌈
The code to outputs are as follows:
Code:
console.log(login_name);
console.log(login_name.charAt(0));
console.log(login_name.substring(0,1));
Output:
🌈 ✨✨✨UTF8MB4
�
�
Obviously, I want .charAt() to print 🌈 and not �. Any known oddities with utf8mb4 that I'm missing? My main problem is I don't know how to word this specific problem.
Also if I swap the rainbow for/ target the ✨, it functions as it should and prints properly.
JavaScript can't handle Unicode properly. charAt() operates on code units instead of code points.
Luckily JavaScript has workarounds. To get the characters in a string instead of UTF-16/UCS-2 code units you need to call Array.from(yourstring), which will get you an array of characters. From there on you can get the first element in the usual way.
let characters = Array.from(login_name);
console.log(characters.shift());

JavaScript not decoding parameter

So from the textarea I take the shortcode %91samurai id="19"%93 it should be [samurai id="19"]:
var not_decoded_content = jQuery('[data-module_type="et_pb_text_forms_00132547"]')
.find('#et_pb_et_pb_text_form_content').html();
But when I try to decode the %91 and %93
self.content = decodeURI(not_decoded_content);
I get the error:
Uncaught URIError: URI malformed
How can i solve this problem?
The encodings are invalid. If you can't fix the whatever-system-produces-them to correctly produce %5B and %5D, then your only option is to do a replacement yourself: replace all %91 with character 91 which is '[', then replace all %93 with character 93 which is ']'.
Note that javascript String Replace as-is won't do "Replace all occurrences". If you need that, then create a loop (while it contains(...) do a replace), or search the internet for javascript replace all, you should find plenty results.
And a final note, I am used to using decodeURIComponent(...). If you can make the whatever-system-produces-them to correctly produce %5B and %5D, and you still get that error, then try using decodeURIComponent(...) instead of decodeURI(...).
The string you're trying to decode is not a URI. Use decodeURIComponent() instead.
UPDATE
Hmm, that's not actually the issue, the issues are the %91 and %93.
encodeURI('[]')
gives %5b%5d, it looks like whatever has encoded this string has used the decimal rather than hexadecimal value.
Decimal 91 = hex 5b
Decimal 93 = hex 5d
Trying again with the hex values
decodeURI('%5bsamurai id="19"%5d') == '[samurai id="19"]'
I know this is not the solution you want to see, but can you try using "%E2%80%98" for %91 and "%E2%80%9C" for %93 ?
The %91 and %93 are part of control characters which html does not like to decode (for reasons beyond me). Simply put, they're not your ordinary ASCII characters for HTML to play around with.

Reconstructing a XORed UTF-8 encoded text in javascript, client side, no Node

I am driving nuts trying to achieve this in JavaScript.
First I will describe the scenario, and then I'll put my code, Python version, which I can't seem to translate into JavaScript.
I have a web page running on a server. I have no access to it whatsoever, so the only way I have to achieve basic functionality is using JavaScript.
The web page is used to compare information. The information is stored in CSV format, which I use to create HTML tables on the fly by using AJAX calls. For the sake of not having that information quickly available to users, enabling them to print the source code and 'stealing it', I came across a range of solutions, like encoding in Base64 (I know this is considered 'security by obscurity' and it's a bad practice, but I have no other choice here).
Base64 it's very easy to use in this case, but I lose all the special characters from UTF-8 (like á é í ó ú ñ etc), which are part of my language (Spanish).
So here comes the preferred solution, which works like a charm in Python: using bitwise XOR. What could I achieve using this method:
If someone figures out the url of the CSV file, it wouldn't be so easy to read the text without basic programming knowledge to de-encode it.
I can easily program the source database to export the data and then run the XORing fuction, upload those files to the server and then having them de-encoded on the fly too.
Is in that last step where I can not achieve what I want.
Here is my Python script:
To encode:
b = bytearray(open('file.csv', 'rb').read())
for i in range(len(b)):
b[i] ^= 0x71
open('file.out', 'wb').write(b)
To decode:
b = bytearray(open('file.out', 'rb').read())
for i in range(len(b)):
b[i] ^= 0x71
I need to achieve that small decoding function in JS.
Thank you all in advance for your time.
Base64
It isn't true that base64 makes you lose non-ASCII characters like ñ or á. Why should it? Base64 can encode any binary data, and encoded text is nothing else than binary data.
So encoding involves two steps:
A text encoding (such as UTF-8) converts your text to bytes, and the base64 encoding turns that into an ASCII string.
Decoding works the same, but backwards (reverse order of the two corresponding decoding functions).
This is how text encoding for UTF-8 works in JavaScript:
function encode_utf8(s) {
return unescape(encodeURIComponent(s));
}
function decode_utf8(s) {
return decodeURIComponent(escape(s));
}
I got this from here. Please note that I'm not a JS crack at all, and there might be more convenient methods now that I couldn't find.
Let's try this:
s = 'Se bañó todo el día.';
b = encode_utf8(s); # text encoding
a = btoa(b); # base64 encoding
console.log(a); # prints U2UgYmHDscOzIHRvZG8gZWwgZMOtYS4=
d = decode_utf8(atob(a)); # decode base64, then UTF-8
console.log(d); # prints Se bañó todo el día.
No character lost here.
XOR method
If you still want to do the XOR thing, you can decode as follows:
convert the UTF8-encoded string to an array of code points with Array.from()
XOR-decode with the ^ operator (or ^= assignment)
convert the result to a string with String.fromCodePoint()
decode the string with decode_utf8()
I'm not providing code for this, though.
Especially the third step might be a bit cumbersome, and I'm not sure if it's worth the pain.
After all, your users can just inspect the JS code to find out how the data are "encrypted", be it base64 or the XOR method.
Note
If you come from a Python background, be aware that there is no distinction like Python's str and bytes type. Both input and output of the {en,de}code_utf8() functions are always strings, same type. When you encode a string, you just get back another string where every codepoint is below 256, and it might be longer than the input string.

Javascript string comparison fails when comparing unicode characters

I want to compare two strings in JavaScript that are the same, and yet the equality operator == returns false. One string contains a special character (eg. the danish å).
JavaScript code:
var filenameFromJS = "Designhåndbog.pdf";
var filenameFromServer = "Designhåndbog.pdf";
print(filenameFromJS == filenameFromServer); // This prints false why?
The solution
What worked for me is unicode normalization as slevithan pointed out.
I forked my original jsfiddle to make a version using the normalization lib suggested by slevithan. Link: http://jsfiddle.net/GWZ8j/1/.
Unlike what some other people here have said, this has nothing to do with encodings. Rather, your two strings use different code points to render the same visual characters.
To solve this correctly, you need to perform Unicode normalization on the two strings before comparing them. Unforunately, JavaScript doesn't have this functionality built in. Here is a JavaScript library that can perform the normalization for you: https://github.com/walling/unorm
The JavaScript equality operator == will appear to be failing under the following circumstances. In all cases it is programmer error. Not a bug in JavaScript.
The two strings do not contain the same number and sequence of characters.
There is whitespace or newlines before, within or after one string. Use a trim() operator on both and look closely at both strings.
Surprise typecasting. The programmer is comparing datatypes that are incompatible.
There are unicode characters which look identical to other unicode characters but in fact are different unicode characters.
UTF-8 is a complex thing. The charset has two different codes for characters such as á, é etc. As you already see in the URL encoded version, the HEX bytes of which the character is made differ for both versions.
See this answer for more information.
I had this same problem.
Adding
<meta charset="UTF-8">
to the HTML file fixed the issue.
In my case the templating engine was baking a json string into the HTML file. This string was in unicode.
While the template was also a unicode file, the JS engine was treating the string I wrote into the template as a latin-1 encoded string, until I added the meta tag.
I was comparing the typed in string to one of the JSON objects items (location.title == "Mühle")
Let the browser normalize unicode for you. This approach worked for me:
function normalizeUnicode(s) {
let div = $('<div style="display: none"></div>').html(s).appendTo('body');
let res = div.html();
div.remove();
return res;
}
normalizeUnicode(unicodeVal1) == normalizeUnicode(unicodeVal2)

javascript json - problem decoding ajax json array from php

I'm using php's json_encode() to convert an array to json which then echo's it and is read from a javascript ajax request.
The problem is the echo'd text has unicode characters which the javascript json parse() function doesn't convert to.
Example array value is "2\u00000\u00001\u00000\u0000-\u00001\u00000\u0000-\u00000\u00001" which is "2010-10-01".
Json.parse() only gives me "2".
Anyone help me with this issue?
Example:
var resArray = JSON.parse(this.responseText);
for(var x=0; x < resArray.length; x++) {
var twt = resArray[x];
alert(twt.date);
break;
}
You have NUL characters (character code zero) in the string. It's actually "2_0_1_0_-_1_0_-_0_1", where _ represents the NUL characters.
The unicode character escape is actually part of the JSON standard, so the parser should handle that correctly. However, the result is still a string will NUL characters in it, so when you try to use the string in Javascript the behaviour will depend on what the browser does with the NUL characters.
You can try this in some different browsers:
alert('as\u0000df');
Internet Explorer will display only as
Firefox will display asdf but the NUL character doesn't display.
The best solution would be to remove the NUL characters before you convert the data to JSON.
To add to what Guffa said:
When you have alternating zero bytes, what has almost certainly happened is that you've read a UTF-16 data source without converting it to an ASCII-compatible encoding such as UTF-8. Whilst you can throw away the nulls, this will mangle the string if it contains any characters outside of ASCII range. (Not an issue for date strings of course, but it may affect any other strings you're reading from the same source.)
Check where your PHP code is reading the 2010-10-01 string from, and either convert it on the fly using eg iconv('utf-16le', 'utf-8', $string), or change the source to use a more reasonable encoding. If it's a text file, for example, save it in a text editor using ‘UTF-8 without BOM’, and not ‘Unicode’, which is a highly misleading name Windows text editors use to mean UTF-16LE.

Categories