Which encoding to use for many international languages

Which encoding to use for many international languages - javascript

I am setting up a little website and would like to make it international. All the content will be stored in an external xml in different languages and parsed into the html via javascript.
Now the problem is, there are also german umlauts, russian, chinese and japanese symbols and also right-to-left languages like arabic and farsi.
What would be the best way/solution? Is there an "international encoding" which can display all languages properly? Or is there any other solution you would suggest?
Thanks in advance!

All of the Unicode transformations (UTF-8, UTF-16, UTF-32) can encode all Unicode characters. You pick which you want to use based on the size: If most of your text is in western scripts, probably UTF-8, as it will use only one byte for most of the characters, but 2, 3, or 4 if needed. If you're encoding far east scripts, you'll probably want one of the other transformations.
The fundamental thing here is that it's all Unicode; the transformations are just different ways of representing the same characters.
The co-founder of Stack Overflow had a good article on this topic: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Regardless of what encoding you use for your document, note that if you're doing processing of these strings in JavaScript, JavaScript strings are UTF-16 (except that invalid values are tolerated). (Even if the document is in UTF-8 or UTF-32.) This means that, for instance, each of those emojis people are so excited about these days look like two "characters" to JavaScript, because they take two words of UTF-16 to represent. Like 😎, for instance:
console.log("😎".length); // 2
So you'll need to be careful not to split up the two halves of characters that are encoded in two words of UTF-16.

The normal (and recommended) solution for multi-lingual sites is to use UTF-8. That can can deal with any characters that have been assigned Unicode codepoints with a couple of caveats:
Unicode is a versioned standard, and a different Javascript implementations may support different Unicode versions.
If your text includes characters outside of the Unicode Basic Multilingual Plane (BMP), then you need to do your text processing (in Javascript) in a way that is Unicode aware. For instance, if you use the Javascript String class you need to take proper account of surrogate pairs when doing text manipulation.
(A Javascript String is actually encoded as UTF-16. It has methods that allow you to manipulate it as Unicode codepoints, methods / attribute such as substring and length use codeunit rather than codepoint indexing. If you are not careful, you can end up splitting a string between the low and high parts of a surrogate pair. The result will be something that cannot be displayed properly. This only affects codepoints in higher planes ... but that includes the new emoji codepoints.)

Related

JavaScript Unicode standard format

Is there any standard on how to write Unicode characters in JavaScript/JSON?
For instance, is there any difference between \u011b and \u011B? Most of the web examples use second format. Also, there is an option for ASCII characters to be written in a short format like \xe1. Which format is preferable (standard). Is it good practice to mix these formats together and what about performance?

For the first question: both version are valid. It is more a coding convention, you should prefer what convention is already used in your files/project. Then check on your community (convention used by other programs you heavily use, what they prefer, and as last option you can choose one way. But in any case, keep consistent.
Personally I prefer none of them for code: UTF-8 is so wide used and browsers should understand it, so I would put directly the right character (as character, not as escape sequence). If codepoint is important, I would add it into a comment. it is expected that all developers and tools will have UTF-8 editors.
Javascript uses UCS-2, so the precursor of UTF-16, but considering unicode code points to be just 16bit length (so some emoji would use two characters).
The byte format should not be used for text: it hides the meaning. There are exceptions: e.g. to check which encoding you get from user, or if you have BOM. [But so just for signatures]. For other binary cases, it is ok to use \x1e escapes, e.g. for key identification.
Note: you should really follow one coding guidelines. Google for it and you will find many, e.g. this from Google (which is maybe too much): https://google.github.io/styleguide/jsguide.html

Store ASCII code into variable, then display

I'm currently using Visual Studio Community and learning Javascript/html5. I came across a problem where I'm trying to store a ASCII code in a variable using javascript, however, when I try to display it, it'll show the actual code instead of the intended character.
<script type="text/javascript">
x = "&#24171";
function ranAlert(){
alert(x);
}
</script>
<form>
<input type="button" value="me" onclick="ranAlert()" />
</form>
The output here will show when I click on the button, "&#24171". But how would I get it to display the actual intended character "幫"?

For your character, you need to use JavaScript notation and not HTML, because you're not using the string for any HTML markup; you're just passing it to alert():
alert("\u5e6b");
The JavaScript notation uses hex, so you'd need to translate 24171 to hex notation. You can do that in your browser console:
(24171).toString(16)
gives 5e6b.
Since this character can be represented with UTF-16, a simpler way to deal with the problem is to simply use the character directly in the string:
alert("幫");
For some Unicode codepoints, representation in JavaScript requires a pair of JavaScript characters, because JavaScript strings use UTF-16 encoding and some characters require more than 16 bits to represent. The extremely helpful website FileFormat.org has resources for Unicode, including ready-to-use representations for every character in several popular languages including JavaScript.
If you want to use that codepoint in CSS, say as the value of a content: property, you escape slightly differently. The CSS encoding scheme allows 6-hex-digit values:
.my-glyph { content: "\005E6B"; }
Finally note that using the term "character" is not entirely accurate, because Unicode reflects the fact that many writing systems involve glyphs composed from several parts. I'm not very strict on terminology in general so don't follow my example; it's important to keep in mind that what looks like a single glyph (what an alphabet-user would call a "character") may actually be made up of several separate codepoints.

'Source code charset' Vs 'Execution charset'

In javascript world,
I learnt that Javascript source code charset is usually UTF-8(but not always).
I learnt that Javascript (execution) charset is UTF-16.
How do I interpret these two terms?
Note: Answer can be given language agnostic-ally, by taking another language like java

Pretty well most source code is written in utf-8, or should be. As source code is mostly English, using ASCII compatible characters, and utf-8 is most efficient in this character range, there is a great advantage. In any case, it has become the de facto standard.
JavaScript was developed before the rest of the world settled on utf-8, so it follows the Java practice of using utf-16 for all strings, which was pretty forward-thinking at the time. This means that all strings, whether coded in the source, or obtained some other way, will be (re-)encoded in in utf-16.
For the most part it’s unimportant. Source code is for humans and the execution character set is for machines. However, the fact does have two minor issues:
JavaScript strings may waste a lot of space if your strings are largely ASCII range (which they would be in English, or even in other languages which use spaces).
like utf-8, utf-16 is also variable width, though most characters in most languages fit within the normal 2 bytes; however JavaScript may mis-calculate the length of a string if some of the characters extend to 4 bytes.
Apart from questions of which encoding better suits a particular human language, there is no other advantage of one over the other. If JavaScript were developed more recently, it would probably have used utf-8 encoding for strings.

Is it safe to use UTF-8 character literals in JavaScript source code?

Is it save to write JavaScript source code (to be executed in the browser) which includes UTF-8 character literals?
For example, I would like to use an ellipses literal in a string as such:
var foo = "Oops… Something went wrong";
Do "modern" browsers support this? Is there a published browser support matrix somewhere?

JavaScript is by specification a Unicode language, so Unicode characters in strings should be safe. You can use hex escapes (\u8E24) as an alternative. Make sure your script files are served with proper content type headers.
Note that characters beyond one- and two-byte sequences are problematic, and that JavaScript regular expressions are terrible with characters beyond the first codepage. (Well maybe not "terrible", but primitive at best.)
You can also use Unicode letters, Unicode combining marks, and Unicode connector punctuation characters in identifiers, in case you want to impress your friends. Thus
var wavy﹏line = "wow";
is perfectly good JavaScript (but good luck with your bug report if you find a browser where it doesn't work).
Read all about it in the spec, or use it to fall asleep at night :)

Regex: Disable Symbols

Is there any way to disable all symbols, punctuations, block elements, geometric shapes and dingbats such like these:
✁ ✂ ✃ ✄ ✆ ✇ ✈ ✉ ✌ ✍ ✎ ✏ ✐ ✑ ✒ ✓ ✔ ✕ ⟻ ⟼ ⟽ ⟾ ⟿ ⟻ ⟼ ⟽ ⟾ ⟿ ▚ ▛ ▜ ▝ ▞ ▟
without writing down all of them in the Regular Expression Pattern, while enable all other normal language characters such like chinese, arabic etc.. such like these:
文化中国 الجزيرة نت
?
I'm building a javascript validation function and my real problem is that I can't use:
[a-zA-Z0-9]
Because this ignores a lots of languages too not just the symbols.

The Unicode standard divides up all the possible characters into code charts. Each code chart contains related characters. If you want to exclude (or include) only certain classes of characters, you will have to make a suitable list of exclusions (or inclusions). Unicode is big, so this might be a lot of work.

Not really.
JavaScript doesn't support Unicode Character Properties. The closest you'll get is excluding ranges by Unicode code point as Greg Hewgill suggested.
For example, to match all of the characters under Mathematical Symbols:
/[\u2190-\u259F]/

This depends on your regex dialect. Unfortunately, probably most existing JavaScript engines don't support Unicode character classes.
In regex engines such as the one in (recent) Perl or .Net, Unicode character classes can be referenced.
\p{L}: any kind of letter from any language.
\p{N}: any number symbol from any language (including, as I recall, the Indian and Arabic and CJK number glyphs).
Because Unicode supports composed and decomposed glyphs, you may run into certain complexities: namely, if only decomposed forms exist, it's possible that you might accidentally exclude some diacritic marks in your matching pattern, and you may need to explicitly allow glyphs of the type Mark. You can mitigate this somewhat by using, if I recall correctly, a string that has been normalized using kC normalization (only for characters that have a composed form). In environments that support Unicode well, there's usually a function that allows you to normalize Unicode strings fairly easily (true in Java and .Net, at least).
Edited to add: If you've started down this path, or have considered it, in order to regain some sanity, you may want to experiment with the Unicode Plugin for XRegExp (which will require you to take a dependency on XRegExp).

JavaScript regular expressions do not have native Unicode support. An alternative to to validate (or sanitize) the string at server site, or to use a non-native regex library. While I've never used it, XRegExp is such a library, and it has a Unicode Plugin.

Take a look at the Unicode Planes. You probably want to exclude everything but planes 0 and 2. After that, it gets ugly as you'll have to exclude a lot of plane 0 on a case-by-case basis.

We Keep Coding

JavaScript is the programming language of the Web.

Which encoding to use for many international languages - javascript

Related

JavaScript Unicode standard format

Store ASCII code into variable, then display

'Source code charset' Vs 'Execution charset'

Is it safe to use UTF-8 character literals in JavaScript source code?

Regex: Disable Symbols

Categories

Resources