I'm currently using Visual Studio Community and learning Javascript/html5. I came across a problem where I'm trying to store a ASCII code in a variable using javascript, however, when I try to display it, it'll show the actual code instead of the intended character.
<script type="text/javascript">
x = "幫";
function ranAlert(){
alert(x);
}
</script>
<form>
<input type="button" value="me" onclick="ranAlert()" />
</form>
The output here will show when I click on the button, "幫". But how would I get it to display the actual intended character "幫"?
For your character, you need to use JavaScript notation and not HTML, because you're not using the string for any HTML markup; you're just passing it to alert():
alert("\u5e6b");
The JavaScript notation uses hex, so you'd need to translate 24171 to hex notation. You can do that in your browser console:
(24171).toString(16)
gives 5e6b.
Since this character can be represented with UTF-16, a simpler way to deal with the problem is to simply use the character directly in the string:
alert("幫");
For some Unicode codepoints, representation in JavaScript requires a pair of JavaScript characters, because JavaScript strings use UTF-16 encoding and some characters require more than 16 bits to represent. The extremely helpful website FileFormat.org has resources for Unicode, including ready-to-use representations for every character in several popular languages including JavaScript.
If you want to use that codepoint in CSS, say as the value of a content: property, you escape slightly differently. The CSS encoding scheme allows 6-hex-digit values:
.my-glyph { content: "\005E6B"; }
Finally note that using the term "character" is not entirely accurate, because Unicode reflects the fact that many writing systems involve glyphs composed from several parts. I'm not very strict on terminology in general so don't follow my example; it's important to keep in mind that what looks like a single glyph (what an alphabet-user would call a "character") may actually be made up of several separate codepoints.
Related
Is there any standard on how to write Unicode characters in JavaScript/JSON?
For instance, is there any difference between \u011b and \u011B? Most of the web examples use second format. Also, there is an option for ASCII characters to be written in a short format like \xe1. Which format is preferable (standard). Is it good practice to mix these formats together and what about performance?
For the first question: both version are valid. It is more a coding convention, you should prefer what convention is already used in your files/project. Then check on your community (convention used by other programs you heavily use, what they prefer, and as last option you can choose one way. But in any case, keep consistent.
Personally I prefer none of them for code: UTF-8 is so wide used and browsers should understand it, so I would put directly the right character (as character, not as escape sequence). If codepoint is important, I would add it into a comment. it is expected that all developers and tools will have UTF-8 editors.
Javascript uses UCS-2, so the precursor of UTF-16, but considering unicode code points to be just 16bit length (so some emoji would use two characters).
The byte format should not be used for text: it hides the meaning. There are exceptions: e.g. to check which encoding you get from user, or if you have BOM. [But so just for signatures]. For other binary cases, it is ok to use \x1e escapes, e.g. for key identification.
Note: you should really follow one coding guidelines. Google for it and you will find many, e.g. this from Google (which is maybe too much): https://google.github.io/styleguide/jsguide.html
I am setting up a little website and would like to make it international. All the content will be stored in an external xml in different languages and parsed into the html via javascript.
Now the problem is, there are also german umlauts, russian, chinese and japanese symbols and also right-to-left languages like arabic and farsi.
What would be the best way/solution? Is there an "international encoding" which can display all languages properly? Or is there any other solution you would suggest?
Thanks in advance!
All of the Unicode transformations (UTF-8, UTF-16, UTF-32) can encode all Unicode characters. You pick which you want to use based on the size: If most of your text is in western scripts, probably UTF-8, as it will use only one byte for most of the characters, but 2, 3, or 4 if needed. If you're encoding far east scripts, you'll probably want one of the other transformations.
The fundamental thing here is that it's all Unicode; the transformations are just different ways of representing the same characters.
The co-founder of Stack Overflow had a good article on this topic: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Regardless of what encoding you use for your document, note that if you're doing processing of these strings in JavaScript, JavaScript strings are UTF-16 (except that invalid values are tolerated). (Even if the document is in UTF-8 or UTF-32.) This means that, for instance, each of those emojis people are so excited about these days look like two "characters" to JavaScript, because they take two words of UTF-16 to represent. Like 😎, for instance:
console.log("😎".length); // 2
So you'll need to be careful not to split up the two halves of characters that are encoded in two words of UTF-16.
The normal (and recommended) solution for multi-lingual sites is to use UTF-8. That can can deal with any characters that have been assigned Unicode codepoints with a couple of caveats:
Unicode is a versioned standard, and a different Javascript implementations may support different Unicode versions.
If your text includes characters outside of the Unicode Basic Multilingual Plane (BMP), then you need to do your text processing (in Javascript) in a way that is Unicode aware. For instance, if you use the Javascript String class you need to take proper account of surrogate pairs when doing text manipulation.
(A Javascript String is actually encoded as UTF-16. It has methods that allow you to manipulate it as Unicode codepoints, methods / attribute such as substring and length use codeunit rather than codepoint indexing. If you are not careful, you can end up splitting a string between the low and high parts of a surrogate pair. The result will be something that cannot be displayed properly. This only affects codepoints in higher planes ... but that includes the new emoji codepoints.)
In javascript world,
I learnt that Javascript source code charset is usually UTF-8(but not always).
I learnt that Javascript (execution) charset is UTF-16.
How do I interpret these two terms?
Note: Answer can be given language agnostic-ally, by taking another language like java
Pretty well most source code is written in utf-8, or should be. As source code is mostly English, using ASCII compatible characters, and utf-8 is most efficient in this character range, there is a great advantage. In any case, it has become the de facto standard.
JavaScript was developed before the rest of the world settled on utf-8, so it follows the Java practice of using utf-16 for all strings, which was pretty forward-thinking at the time. This means that all strings, whether coded in the source, or obtained some other way, will be (re-)encoded in in utf-16.
For the most part it’s unimportant. Source code is for humans and the execution character set is for machines. However, the fact does have two minor issues:
JavaScript strings may waste a lot of space if your strings are largely ASCII range (which they would be in English, or even in other languages which use spaces).
like utf-8, utf-16 is also variable width, though most characters in most languages fit within the normal 2 bytes; however JavaScript may mis-calculate the length of a string if some of the characters extend to 4 bytes.
Apart from questions of which encoding better suits a particular human language, there is no other advantage of one over the other. If JavaScript were developed more recently, it would probably have used utf-8 encoding for strings.
there is a way to use patterns like "\p{L}" in javascript, natively?
(i suppose that is a perl-compatible syntax)
I'm interested firstly in firefox support, and webkit, possibly
No, \p{..} is not supported natively by any of the big browsers. However, it does work in JavaScript if you use the XRegExp library and it's Unicode plugins.
Unfortunately, no. You can only specify a set of characters in the usual syntax, writing characters and ranges in brackets, but this becomes awkward since e.g. letters are scattered all around the Unicode space, with other characters between them.
There’s an inefficient workaround: fetch the UnicodeData.txt file from the Unicode site, put its content inside your JavaScript code as data, and parse it. And then you could have the data e.g. in an array of objects containing the Unicode properties, such as gc (General Category), which tells you whether the character is a letter or not. But even then, you would just have the data handy for simple testing, not as something you can use as a constituent of a regexp.
In theory, you could use the data to construct a regexp... but it would be rather large.
No, Javascript has slightly different syntax. To catch unicode you have to use character selector like \uXXXX. However, on practice if your page and files in UTF-8, setting non-ASCII characters in range [абвг] does work too.
http://www.javascriptkit.com/jsref/regexp.shtml
The library found here:
http://inimino.org/~inimino/blog/javascript_cset
seems to work for me and is fairly small and independent of other libraries.
I have an html box with which users may enter text. I would like to ensure all text entered in the box is either encoded in UTF-8 or converted to UTF-8 when a user finishes typing. Furthermore, I don't quite understand how various UTF encoding are chosen when being entered into a text box.
Generally I'm curious about the following:
How does a browser determine which encodings to use when a user is typing into a text box?
How can javascript determine the encoding of a string value in an html text box?
Can I force the browser to only use UTF-8 encoding?
How can I encode arbitrary encodings to UTF-8 I assume there is a JavaScript library for this?
** Edit **
Removed some questions unnecessary to my goals.
This tutorial helped me understand JavaScript character codes better, but is buggy and does not actually translate character codes to utf-8 in all cases.
http://www.webtoolkit.info/javascript-base64.html
How does a browser determine which encodings to use when a user is typing into a text box?
It uses the encoding the page was decoded as by default. According to the spec, you should be able to override this with the accept-charset attribute of the <form> element, but IE is buggy, so you shouldn't rely on this (I've seen several different sources describe several different bugs, and I don't have all the relevant versions of IE in front of me to test, so I'll leave it at that).
How can javascript determine the encoding of a string value in an html text box?
All strings in JavaScript are encoded in UTF-16. The browser will map everything into UTF-16 for JavaScript, and from UTF-16 into whatever the page is encoded in.
UTF-16 is an encoding that grew out of UCS-2. Originally, it was thought that 65,536 code points would be enough for all of Unicode, and so a 16 bit character encoding would be sufficient. It turned out that the is not the case, and so the character set was expanded to 1,114,112 code points. In order to maintain backwards compatibility, a few unused ranges of the 16 bit character set were set aside for surrogate pairs, in which two 16 bit code units were used to encode a single character. Read up on UTF-16 and UCS-2 on Wikipedia for details.
The upshot is that when you have a string str in JavaScript, str.length does not give you the number of characters, it gives you the number of code units, where two code units may be used to encode a single character, if that character is not within the Basic Multilingual Plane. For instance, "abc".length gives you 3, but "𐤀𐤁𐤂".length gives you 6; and "𐤀𐤁𐤂".substring(0,1) gives what looks like an empty string, since a half of a surrogate pair cannot be displayed, but the string still contains that invalid character (I will not guarantee this works cross browser; I believe it is acceptable to drop broken characters). To get a valid character, you must use "𐤀𐤁𐤂".substring(0,2).
Can I force the browser to only use UTF-8 encoding?
The best way to do this is to deliver your page in UTF-8. Ensure that your web server is sending the appropriate Content-type: text/html; charset=UTF-8 headers. You may also want to embed a <meta charset="UTF-8"> element in your <head> element, for cases in which the Content-Type does not get set properly (such as if your page is loaded off of the local disk).
How can I encode arbitrary encodings to UTF-8 I assume there is a JavaScript library for this?
There isn't much need in JavaScript to encode text in particular encodings. If you are simply writing to the DOM, or reading or filling in form controls, you should just use JavaScript strings which are treated as sequences of UTF-16 code units. XMLHTTPRequest, when used to send(data) via POST, will use UTF-8 (if you pass it a document with a different encoding declared in the <?xml ...> declaration, it may or may not convert that to UTF-8, so for compatibility you generally shouldn't use anything other than UTF-8).
I would like to ensure all text entered in the box is either encoded in UTF-8
Text in an HTML DOM including input fields has no intrinsic byte encoding; it is stored as Unicode characters (specifically, at a DOM and ECMAScript standard level, UTF-16 code units; on the rare case you use characters outside the Basic Multilingual Plane it is possible to see the difference, eg. '𝅘𝅥𝅯'.length is 2).
It is only when the form is sent that the text is serialised into bytes using a particular encoding, by default the same encoding as was used to parse the page So you should serve your page containing the form as UTF-8 (via Content-Type header charset parameter and/or equivalent <meta> tag).
Whilst in principle there is an override for this in the accept-charset attribute of the <form> element, it doesn't work correctly (and is actively harmful in many cases) in IE. So avoid that one.
There are no explicit encoding-handling functions available in JavaScript itself. You can hack together a Unicode-to-UTF-8-bytes encoder by chaining unescape(encodeURIComponent(str)) (and similarly the other way round with the inverse function), but that's about it.
The text in a text box is not encoded in any way; it is "text", an abstract series of characters. In almost every contemporary application, that text is expressed as a sequence of Unicode code points, which are integers mapped to particular abstract characters. Text doesn't get "encoded" until it is turned into a sequence of bytes, as when submitting the form. At that time, the encoding is determined by the encoding of the HTML page in which the form appears, or by the accept-charset attribute of the form element.