Javascript processing Cyrillic input

Javascript processing Cyrillic input - javascript

When i get a json feed from a Cyrillic site, the data is in a \ufffd format instead of Cyrillic chars.
(example feed: http://jsonduit.com/v1/f/l/7sg?cb=getJsonP_1284131679846_0)
So when i set the source html to the input, i get weird boxes instead of characters.
I tried to unescape the input but that wont work too.
How do i revert the feed back to Cyrillic?
(btw, the source page encoding is set to UTF-8)

decodeURIComponent("stringToDecodeToCyrillic")
Example:
decodeURIComponent("%D0%90%D0%BB%D0%B5%D0%BA%D1%81%D0%B5%D0%B9") === "Алексей"
Fastest way to encode cyrillic letters for url

It seems you receive UTF8 string. Use the following class to decode:
UTF8 = {
encode: function(s){
for(var c, i = -1, l = (s = s.split("")).length, o = String.fromCharCode; ++i < l;
s[i] = (c = s[i].charCodeAt(0)) >= 127 ? o(0xc0 | (c >>> 6)) + o(0x80 | (c & 0x3f)) : s[i]
);
return s.join("");
},
decode: function(s){
for(var a, b, i = -1, l = (s = s.split("")).length, o = String.fromCharCode, c = "charCodeAt"; ++i < l;
((a = s[i][c](0)) & 0x80) &&
(s[i] = (a & 0xfc) == 0xc0 && ((b = s[i + 1][c](0)) & 0xc0) == 0x80 ?
o(((a & 0x03) << 6) + (b & 0x3f)) : o(128), s[++i] = "")
);
return s.join("");
}
};
Usage:
var newString = UTF8.decode( yourString );

Related

How to understand Ternary JavaScript expression?

I'm bad with JS at now, especially with the operator "?"
And I'm trying to understand the following code.
Maybe it could be more friendly ?
So, if I don't want to use this operator, how can it be looks.
JavaScript code:
function(t) {
for (var e, r = t.length, n = "", i = 0, s = 0, a = 0; i < r; )
(s = t.charCodeAt(i)) < 128 ? (n += String.fromCharCode(s),
i++) : s > 191 && s < 224 ? (a = t.charCodeAt(i + 1),
n += String.fromCharCode((31 & s) << 6 | 63 & a),
i += 2) : (a = t.charCodeAt(i + 1),
e = t.charCodeAt(i + 2),
n += String.fromCharCode((15 & s) << 12 | (63 & a) << 6 | 63 & e),
i += 3);
return n
}

It seems like this is same as:
function(t) {
for (var e, r = t.length, n = "", i = 0, s = 0, a = 0; i < r; )
if((s = t.charCodeAt(i)) < 128) {
n += String.fromCharCode(s);
i++;
} else if(s > 191 && s < 224) {
a = t.charCodeAt(i + 1);
n += String.fromCharCode((31 & s) << 6 | 63 & a);
i += 2;
} else {
a = t.charCodeAt(i + 1);
e = t.charCodeAt(i + 2);
n += String.fromCharCode((15 & s) << 12 | (63 & a) << 6 |
63 & e);
i += 3;
}
return n
}

This is an overly complicated expression involving multiple ternary operators.
I think it should be simplified.
A ternary operator behaves like an if but it is an expression that returns one value out of 2 options, depending on the first operand.
For example:
operand ? valueIfTrue : valueIfFalse is a ternary expression that returns valueIfTrue if operand is "truthy" and returns valueIfFalse if operand is "falsey".
You can substitute any expression in place of valueIfTrue and valueIfFalse and this way you can get really complicated expressions, sometimes unnecessarily complex.
As an example of making expressions complicated, let's consider: For example: operand ? valueIfTrue : valueIfFalse
If we then replace valueIfTrue with another ternary operator, e.g. myOtherOperand ? myOtherIfTrue : myOtherIfFalse then the original expression becomes:
operand ? myOtherOperand ? myOtherIfTrue : myOtherIfFalse: valueIfFalse
This is not a nice way to write it, it can be improved like this, I just put parenthesis.
operand ? (myOtherOperand ? myOtherIfTrue : myOtherIfFalse) : valueIfFalse
It can be improved again by formatting like this:
operand
? myOtherOperand
? myOtherIfTrue // if both operand and myOtherOperand are true
: myOtherIfFalse // if operand is true and myOtherOperand is false
: valueIfFalse // this will be returned if operand is false
This shows that code formatting is essential for understanding it. But of course the first step is to have simple code. Anyways, here is how I would format the code in the question so it can be easier to understand:
function myFunction(t) {
for (var e, r = t.length, n = "", i = 0, s = 0, a = 0; i < r; ) {
(s = t.charCodeAt(i)) < 128
? (n += String.fromCharCode(s), i++)
: s > 191 && s < 224
? (a = t.charCodeAt(i + 1), n += String.fromCharCode((31 & s) << 6 | 63 & a), i += 2)
: (a = t.charCodeAt(i + 1),
e = t.charCodeAt(i + 2),
n += String.fromCharCode((15 & s) << 12 | (63 & a) << 6 | 63 & e),
i += 3); // end of ternary operators
return n;
}
}
Now it is clearer and we see statements separated by commas inside of the two ternary operators that are used. Commas are used to execute multiple things in the same expression, e.g. (n += String.fromCharCode(s), i++) will increase n and also i. In this case, it is better to move those outside of a ternary and into a normal if statement like this:
function myFunction(t) {
for (var e, r = t.length, n = "", i = 0, s = 0, a = 0; i < r;) {
const firstCheck = (s = t.charCodeAt(i)) < 128;
const secondCheck = s > 191 && s < 224;
if (firstCheck) {
n += String.fromCharCode(s);
i++;
} else if (secondCheck) {
// This is originally: (a = t.charCodeAt(i + 1), n += String.fromCharCode((31 & s) << 6 | 63 & a), i += 2);
a = t.charCodeAt(i + 1);
n += String.fromCharCode((31 & s) << 6 | 63 & a);
i += 2;
} else {
// this is originally:
// (a = t.charCodeAt(i + 1),
// e = t.charCodeAt(i + 2),
// n += String.fromCharCode((15 & s) << 12 | (63 & a) << 6 | 63 & e),
// i += 3);
a = t.charCodeAt(i + 1);
e = t.charCodeAt(i + 2);
n += String.fromCharCode((15 & s) << 12 | (63 & a) << 6 | 63 & e);
i += 3;
}
return n;
}
}
So basically break it down and take it step by step to understand it, then you can change it because you understand it.

Encode and decode skipping the characters

I am trying to store some strings into the database using their encoded format. But when retrieving back the string is malformed.
Here is my code example where you easily can see that String passed to encode is not the same as after decode. Why is this happening?
Is there any other library which can help me encoding and decoding? Any suggestion on the same will be helpful.
var Base64 = {
_keyStr : "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/=",
encode : function(e) {
var t = "";
var n, r, i, s, o, u, a;
var f = 0;
e = Base64._utf8_encode(e);
while (f < e.length) {
n = e.charCodeAt(f++);
r = e.charCodeAt(f++);
i = e.charCodeAt(f++);
s = n >> 2;
o = (n & 3) << 4 | r >> 4;
u = (r & 15) << 2 | i >> 6;
a = i & 63;
if (isNaN(r)) {
u = a = 64
} else if (isNaN(i)) {
a = 64
}
t = t + this._keyStr.charAt(s) + this._keyStr.charAt(o)
+ this._keyStr.charAt(u) + this._keyStr.charAt(a)
}
return t
},
decode : function(e) {
var t = "";
var n, r, i;
var s, o, u, a;
var f = 0;
e = e.replace(/[^A-Za-z0-9+/=]/g, "");
while (f < e.length) {
s = this._keyStr.indexOf(e.charAt(f++));
o = this._keyStr.indexOf(e.charAt(f++));
u = this._keyStr.indexOf(e.charAt(f++));
a = this._keyStr.indexOf(e.charAt(f++));
n = s << 2 | o >> 4;
r = (o & 15) << 4 | u >> 2;
i = (u & 3) << 6 | a;
t = t + String.fromCharCode(n);
if (u != 64) {
t = t + String.fromCharCode(r)
}
if (a != 64) {
t = t + String.fromCharCode(i)
}
}
t = Base64._utf8_decode(t);
return t
},
_utf8_encode : function(e) {
e = e.replace(/rn/g, "n");
var t = "";
for (var n = 0; n < e.length; n++) {
var r = e.charCodeAt(n);
if (r < 128) {
t += String.fromCharCode(r)
} else if (r > 127 && r < 2048) {
t += String.fromCharCode(r >> 6 | 192);
t += String.fromCharCode(r & 63 | 128)
} else {
t += String.fromCharCode(r >> 12 | 224);
t += String.fromCharCode(r >> 6 & 63 | 128);
t += String.fromCharCode(r & 63 | 128)
}
}
return t
},
_utf8_decode : function(e) {
var t = "";
var n = 0;
var r = c1 = c2 = 0;
while (n < e.length) {
r = e.charCodeAt(n);
if (r < 128) {
t += String.fromCharCode(r);
n++
} else if (r > 191 && r < 224) {
c2 = e.charCodeAt(n + 1);
t += String.fromCharCode((r & 31) << 6 | c2 & 63);
n += 2
} else {
c2 = e.charCodeAt(n + 1);
c3 = e.charCodeAt(n + 2);
t += String.fromCharCode((r & 15) << 12 | (c2 & 63) << 6 | c3
& 63);
n += 3
}
}
return t
}
}
var str = "background:url(/drona-courses/player_assets/skin_0/DRONA_default_skinRightCorner.png) ;"
var encoded = Base64.encode(str);
//console.log(encoded);
var decoded = Base64.decode(encoded);
console.log(str,"......Input");
console.log(decoded,".....Output");

Base64 Encoding in common browsers
In JavaScript there are two functions respectively for decoding and encoding base64 strings:
atob()
btoa()
The atob() function decodes a string of data which has been encoded using base-64 encoding. Conversely, the btoa() function creates a base-64 encoded ASCII string from a "string" of binary data.

Use atob and btoa.
const foo = "bar"
const encodedFoo = btoa(foo)
const decodedFoo = atob(encodedFoo)
console.log(encodedFoo)
console.log(decodedFoo)
You can read more about it here.

Convert html-input to proper encoding

I have a html-form with one html-input-field. The input is copied via clipboard from other programs. Sometimes the copied text is not utf-8, but ansi (tested with notepad++). Than, umlauts like ü are copied as Ã¼.
As I don't want to change the encoding of the clipboard-text everytime (with i.e.notepad++), I would like to do this with javascript directly when parsing and spliting the input-text.
Is there a way to do this without implementing an own function for this (which would be the next thing I would do for the most common umlauts)?

Stealing from the internet this:
//+ Jonas Raoni Soares Silva
//# http://jsfromhell.com/geral/utf-8 [rev. #1]
var UTF8 = {
encode: function(s){
for(var c, i = -1, l = (s = s.split("")).length, o = String.fromCharCode; ++i < l;
s[i] = (c = s[i].charCodeAt(0)) >= 127 ? o(0xc0 | (c >>> 6)) + o(0x80 | (c & 0x3f)) : s[i]
);
return s.join("");
},
decode: function(s){
for(var a, b, i = -1, l = (s = s.split("")).length, o = String.fromCharCode, c = "charCodeAt"; ++i < l;
((a = s[i][c](0)) & 0x80) &&
(s[i] = (a & 0xfc) == 0xc0 && ((b = s[i + 1][c](0)) & 0xc0) == 0x80 ?
o(((a & 0x03) << 6) + (b & 0x3f)) : o(128), s[++i] = "")
);
return s.join("");
}
};
You can then add your input:
<input type="text" id="test">
And listen to the PASTE event and, after a few milliseconds (else you will get "" as .val), you can replace the entire value of the input with the decoded one:
$('#test').on('paste', function(e) {
var controller = $(this);
setTimeout(function(){
controller.val(UTF8.decode(controller.val()));
},10);
});
Codepen:
http://codepen.io/anon/pen/GgYZeb
Please note that it is only listening to the PASTE event. You can also add other events if you're interested.

Counterpart to Python's chr() in JavaScript

The JavaScript method String.fromCharCode() behaves equivalently to Python's unichar() in the following sense:
print unichr(213) # prints Õ on the console
console.log(String.fromCharCode(213)); // prints Õ on the console as well
For my purposes, however, I need a JavaScript equivalent to the Python function chr(). Is there such a JavaScript function or a way to make String.fromCharCode() behave like chr()?
That is, I need something in JavaScript that mimics
print chr(213) # prints � on the console

So turns out you just want to work with raw bytes in node.js, there's a module for that. If you are a real wizard, you can get this stuff to work with javascript strings alone but it's harder and far less efficient.
var b = new Buffer(1);
b[0] = 213;
console.log(b.toString()); //�
var b = new Buffer(3);
b[0] = 0xE2;
b[1] = 0x98;
b[2] = 0x85;
console.log(b.toString()); //★
print chr(213) # prints � on the console
So this prints a raw byte (0xD5), that is interpreted in UTF-8 (most likely) which is not valid UTF-8 byte sequence and thus is displayed as the replacement character (�).
The interpretation as UTF-8 is not relevant here, you most likely just want raw bytes.
To create raw bytes in javascript you could use UInt8Array.
var a = new Uint8Array(1);
a[0] = 213;
You could optionally then interpret the raw bytes as utf-8:
console.log( utf8decode(a)); // "�"
//Not recommended for production use ;D
//Doesn't handle > BMP to keep the answer shorter
function utf8decode(uint8array) {
var codePoints = [],
i = 0,
byte, codePoint, len = uint8array.length;
for (i = 0; i < len; ++i) {
byte = uint8array[i];
if ((byte & 0xF8) === 0xF0 && len > i + 3) {
codePoint = ((byte & 0x7) << 18) | ((uint8array[++i] & 0x3F) << 12) | ((uint8array[++i] & 0x3F) << 6) | (uint8array[++i] & 0x3F);
if (!(0xFFFF < codePoint && codePoint <= 0x10FFFF)) {
codePoints.push(0xFFFD, 0xFFFD, 0xFFFD, 0xFFFD);
} else {
codePoints.push(codePoint);
}
} else if ((byte & 0xF0) === 0xE0 && len > i + 2) {
codePoint = ((byte & 0xF) << 12) | ((uint8array[++i] & 0x3F) << 6) | (uint8array[++i] & 0x3F);
if (!(0x7FF < codePoint && codePoint <= 0xFFFF)) {
codePoints.push(0xFFFD, 0xFFFD, 0xFFFD);
} else {
codePoints.push(codePoint);
}
} else if ((byte & 0xE0) === 0xC0 && len > i + 1) {
codePoint = ((byte & 0x1F) << 6) | ((uint8array[++i] & 0x3F));
if (!(0x7F < codePoint && codePoint <= 0x7FF)) {
codePoints.push(0xFFFD, 0xFFFD);
} else {
codePoints.push(codePoint);
}
} else if ((byte & 0x80) === 0x00) {
codePoints.push(byte & 0x7F);
} else {
codePoints.push(0xFFFD);
}
}
return String.fromCharCode.apply(String, codePoints);
}
What you are most likely trying to do has nothing to do with trying to interpret the bytes as utf8 though.
Another example:
//UTF-8 For the black star U+2605 ★:
var a = new Uint8Array(3);
a[0] = 0xE2;
a[1] = 0x98;
a[2] = 0x85;
utf8decode(a) === String.fromCharCode(0x2605) //True
utf8decode(a) // ★
In python 2.7 (Ubuntu):
print chr(0xE2) + chr(0x98) + chr(0x85)
#prints ★

If you want this "Questionmark in a box" for every number that is not in the standard ASCII table, how about this little function?
function chr(c) {
return (c < 0 || c > 126) ? '�' : String.fromCharCode(c);
}

Strange use of "for" cycle in Javascript, please explain

I found this strange JavaScript I cannot understand. The for cycle has a strange syntax (many parameters), can you explain me how it is intended to work? Thanks
decode: function(s){
for(var a, b, i = -1, l = (s = s.split("")).length, o = String.fromCharCode, c = "charCodeAt"; ++i < l;
((a = s[i][c](0)) & 0x80) &&
(s[i] = (a & 0xfc) == 0xc0 && ((b = s[i + 1][c](0)) & 0xc0) == 0x80 ?
o(((a & 0x03) << 6) + (b & 0x3f)) : o(128), s[++i] = "")
);
return s.join("");
}

That's an ordinary for loop, but with a very long var statement in the first part.
It's just like
var a, b, c;
Also the iterator statement in the for loop contains a lot of operations instead of the loop actually having a body.
Either this function was written by a terrible programmer with no regard for readable code, or it has been intentionally minified and obfuscated.

interesting function, apparently trans-coding a certain set of chars, kind of esoteric and will only work with an ASCII code but here's the breakdown:
for (var i = 0; i < s.length; i++) {
var a = s.charCodeAt(i);
if (a & 0x80) { // (a >= 128) if extended ascii
var b = s.charCodeAt(i + 1);
var specialA = (a & 0xfc) === 0xc0; // a IS [À, Á, Â or Ã] essentially [192, 193, 194, 195]
var specialB = (b & 0xc0) === 0x80; // b >= 128 & b <= 191 eg. b is not a special Latin Ascii Letter
if (specialA && specialB) {
var txA = (a & 0x03) << 6; // [0, 64, 128, 192]
var txB = b & 0x3f; // 0 - 63
s[i] = String.fromCharCode(txA + txB);
} else {
s[i] = String.fromCharCode(128);
s[++i] = "";
}
}
}
hope this helps, either way i found the decoding interesting, reminds of reading raw assembler, lol -ck

The different parts of the for loop is all there, divided by the semicolons (;).
The var part:
var a, b, i = -1, l = (s = s.split("")).length, o = String.fromCharCode, c = "charCodeAt";
The check part:
++i < l;
The update part:
((a = s[i][c](0)) & 0x80) &&
(s[i] = (a & 0xfc) == 0xc0 && ((b = s[i + 1][c](0)) & 0xc0) == 0x80 ?
o(((a & 0x03) << 6) + (b & 0x3f)) : o(128), s[++i] = "")
After the for() statement comes a ; right away, meaning that the loop doesn't have a body, but all the statements in the var-, check-, and update part will still be executed untill the check is no longer true.
Looks like someone didn't want their code to be readable. Where did you find it, anyway?

Breaking the loop into a one more readable:
rearranged loop parameters
changed (...)&&(...) with an if(...){(...)}
changed l to len
moved s = s.split(...) outside the len
.
var a, b, s = s.split(""), o = String.fromCharCode, c = "charCodeAt";
for(var i = -1, len = s.length; ++i < len;){
if((a = s[i][c](0)) & 0x80){
(s[i] = (a & 0xfc) == 0xc0 && ((b = s[i + 1][c](0)) & 0xc0) == 0x80 ? o(((a & 0x03) << 6) + (b & 0x3f)) : o(128), s[++i] = "");
}
}
changed i initial value and how/where it increases
moved a = s[i][c](0) outside
.
var a, b, s = s.split(""), o = String.fromCharCode, c = "charCodeAt";
for(var i = 0, len = s.length; i < len; i++){
a = s[i][c](0);
if(a & 0x80){
s[i] = (a & 0xfc);
(s[i] == 0xc0 && ((b = s[i + 1][c](0)) & 0xc0) == 0x80 ? o(((a & 0x03) << 6) + (b & 0x3f)) : o(128), s[++i] = "");
}
}
created tmp to make things easier to read
stored the ternary operation result in tmp
splitted (s[i] == 0xc0 && tmp, s[++i] = ""); with an
if(...){s[++i] = "";}
replaced the new loop inside the your example
.
decode: function(s){
var tmp, a, b, s = s.split(""), o = String.fromCharCode, c = "charCodeAt";
for(var i = 0, len = s.length; i < len; i++){
a = s[i][c](0);
if(a & 0x80){
s[i] = (a & 0xfc);
if(((b = s[i + 1][c](0)) & 0xc0) == 0x80){
tmp = o(((a & 0x03) << 6) + (b & 0x3f));
}else{
tmp = o(128);
}
if(s[i] == 0xc0 && tmp){
s[++i] = "";
}
}
}
return s.join("");
}
Final result /\

We Keep Coding

JavaScript is the programming language of the Web.

Javascript processing Cyrillic input - javascript

decodeURIComponent("stringToDecodeToCyrillic") Example: decodeURIComponent("%D0%90%D0%BB%D0%B5%D0%BA%D1%81%D0%B5%D0%B9") === "Алексей" Fastest way to encode cyrillic letters for url

Related

How to understand Ternary JavaScript expression?

Encode and decode skipping the characters

Convert html-input to proper encoding

Counterpart to Python's chr() in JavaScript

Strange use of "for" cycle in Javascript, please explain

Categories

Resources