In javascript world,
I learnt that Javascript source code charset is usually UTF-8(but not always).
I learnt that Javascript (execution) charset is UTF-16.
How do I interpret these two terms?
Note: Answer can be given language agnostic-ally, by taking another language like java
Pretty well most source code is written in utf-8, or should be. As source code is mostly English, using ASCII compatible characters, and utf-8 is most efficient in this character range, there is a great advantage. In any case, it has become the de facto standard.
JavaScript was developed before the rest of the world settled on utf-8, so it follows the Java practice of using utf-16 for all strings, which was pretty forward-thinking at the time. This means that all strings, whether coded in the source, or obtained some other way, will be (re-)encoded in in utf-16.
For the most part it’s unimportant. Source code is for humans and the execution character set is for machines. However, the fact does have two minor issues:
JavaScript strings may waste a lot of space if your strings are largely ASCII range (which they would be in English, or even in other languages which use spaces).
like utf-8, utf-16 is also variable width, though most characters in most languages fit within the normal 2 bytes; however JavaScript may mis-calculate the length of a string if some of the characters extend to 4 bytes.
Apart from questions of which encoding better suits a particular human language, there is no other advantage of one over the other. If JavaScript were developed more recently, it would probably have used utf-8 encoding for strings.
Related
I want to determin via JavaScript which Unicode code points
of a JavaScript string are printable or not. Accessing the
code points itself via codePointAt() works.
In C/C++ I would then use with a Unicode locale:
The iswgraph() is a built-in function in C/C++ which checks
if the given wide character has an graphical representation or not.
https://www.geeksforgeeks.org/iswgraph-in-c-c-with-examples/
What would be a JavaScript substitute, method or function, that
can be called from within a browser or node.js and that delivers
a similar indication?
Unlike unicode_utils.py, we can compress the Unicode data and reduce to 1% of the 1114111 entries. Unicode CONTROL, FORMAT, and others can be used. The Dogelog runtime currently gives.
The meaning is:
General Category = 15 = Cc (CONTROL)
The encoding of General Category is from Java, not from .NET.
The two built-ins and the unicode data are realized via JavaScript. How this slim Unicode code point classifier was rolled is documented here. It also contains the source of the compressor:
Preview: Unicode general categories Prolog API. (Jekejeke)
https://gist.github.com/jburse/03c9b3cec528288d5c29c3d8bbb986c2#gistcomment-3808667
Preview: Unicode numeric values Prolog API. (Jekejeke)
https://gist.github.com/jburse/03c9b3cec528288d5c29c3d8bbb986c2#gistcomment-3809517
Is there any standard on how to write Unicode characters in JavaScript/JSON?
For instance, is there any difference between \u011b and \u011B? Most of the web examples use second format. Also, there is an option for ASCII characters to be written in a short format like \xe1. Which format is preferable (standard). Is it good practice to mix these formats together and what about performance?
For the first question: both version are valid. It is more a coding convention, you should prefer what convention is already used in your files/project. Then check on your community (convention used by other programs you heavily use, what they prefer, and as last option you can choose one way. But in any case, keep consistent.
Personally I prefer none of them for code: UTF-8 is so wide used and browsers should understand it, so I would put directly the right character (as character, not as escape sequence). If codepoint is important, I would add it into a comment. it is expected that all developers and tools will have UTF-8 editors.
Javascript uses UCS-2, so the precursor of UTF-16, but considering unicode code points to be just 16bit length (so some emoji would use two characters).
The byte format should not be used for text: it hides the meaning. There are exceptions: e.g. to check which encoding you get from user, or if you have BOM. [But so just for signatures]. For other binary cases, it is ok to use \x1e escapes, e.g. for key identification.
Note: you should really follow one coding guidelines. Google for it and you will find many, e.g. this from Google (which is maybe too much): https://google.github.io/styleguide/jsguide.html
I have found this sentence while reading one of the JavaScript books:
JavaScript programs are written using the Unicode character set
What I don't understand is, how does JavaScript files makes sure, that whatever I write in .js file, would be a Unicode Character Set?
Does that mean whenever I type using keyboard on my computer, it'd always use Unicode? How does it work?
This means that the language definition employs Unicode charset. In particular, this usually means that string literals can include Unicode chars, and also may mean that identifiers can include some Unicode chars too (I don't know JavaScript, but in particular it's allowed in the Haskell language).
Now, the JavaScript implementation can choose any way to map bytes in .js file into internal Unicode representation. It may pretend that all .js files are written in UTF-8, or in 7-bit ASCII encoding, or anything else. You need to consult the implementation manual to reveal that.
And yeah, you need to know that any file consists of bytes, not characters. How characters, that you are typed in editor, converted to bytes stored in the file, is up to your editor (usually it provides a choice between use of local 8-bit encodings, UTF-8 and sometimes UTF-16). How the bytes stored in the file are converted to characters is up to your language implementation (in this case, JavaScript one).
I am setting up a little website and would like to make it international. All the content will be stored in an external xml in different languages and parsed into the html via javascript.
Now the problem is, there are also german umlauts, russian, chinese and japanese symbols and also right-to-left languages like arabic and farsi.
What would be the best way/solution? Is there an "international encoding" which can display all languages properly? Or is there any other solution you would suggest?
Thanks in advance!
All of the Unicode transformations (UTF-8, UTF-16, UTF-32) can encode all Unicode characters. You pick which you want to use based on the size: If most of your text is in western scripts, probably UTF-8, as it will use only one byte for most of the characters, but 2, 3, or 4 if needed. If you're encoding far east scripts, you'll probably want one of the other transformations.
The fundamental thing here is that it's all Unicode; the transformations are just different ways of representing the same characters.
The co-founder of Stack Overflow had a good article on this topic: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Regardless of what encoding you use for your document, note that if you're doing processing of these strings in JavaScript, JavaScript strings are UTF-16 (except that invalid values are tolerated). (Even if the document is in UTF-8 or UTF-32.) This means that, for instance, each of those emojis people are so excited about these days look like two "characters" to JavaScript, because they take two words of UTF-16 to represent. Like 😎, for instance:
console.log("😎".length); // 2
So you'll need to be careful not to split up the two halves of characters that are encoded in two words of UTF-16.
The normal (and recommended) solution for multi-lingual sites is to use UTF-8. That can can deal with any characters that have been assigned Unicode codepoints with a couple of caveats:
Unicode is a versioned standard, and a different Javascript implementations may support different Unicode versions.
If your text includes characters outside of the Unicode Basic Multilingual Plane (BMP), then you need to do your text processing (in Javascript) in a way that is Unicode aware. For instance, if you use the Javascript String class you need to take proper account of surrogate pairs when doing text manipulation.
(A Javascript String is actually encoded as UTF-16. It has methods that allow you to manipulate it as Unicode codepoints, methods / attribute such as substring and length use codeunit rather than codepoint indexing. If you are not careful, you can end up splitting a string between the low and high parts of a surrogate pair. The result will be something that cannot be displayed properly. This only affects codepoints in higher planes ... but that includes the new emoji codepoints.)
Is it save to write JavaScript source code (to be executed in the browser) which includes UTF-8 character literals?
For example, I would like to use an ellipses literal in a string as such:
var foo = "Oops… Something went wrong";
Do "modern" browsers support this? Is there a published browser support matrix somewhere?
JavaScript is by specification a Unicode language, so Unicode characters in strings should be safe. You can use hex escapes (\u8E24) as an alternative. Make sure your script files are served with proper content type headers.
Note that characters beyond one- and two-byte sequences are problematic, and that JavaScript regular expressions are terrible with characters beyond the first codepage. (Well maybe not "terrible", but primitive at best.)
You can also use Unicode letters, Unicode combining marks, and Unicode connector punctuation characters in identifiers, in case you want to impress your friends. Thus
var wavy﹏line = "wow";
is perfectly good JavaScript (but good luck with your bug report if you find a browser where it doesn't work).
Read all about it in the spec, or use it to fall asleep at night :)