I need to do something like this:
Have a variable of some type.
Run in a loop and assign all the possible ASCII characters to this variable and print them, one by one.
Is something similar possible for UNICODE also?
I'm not sure how exactly you want to print, but this will console.log printable ascii
for(var i=32;i<127;++i) console.log(String.fromCharCode(i));
You can document.write then if that's your intention. And if the environment is unicode, it should work for unicode as well, I believe.
Others have shown how to print the printable Ascii characters. It is possible to print all other Ascii characters, too, though they are control characters with system-dependent effect (often no effect). To create a string containing all Ascii characters into a string, you could do this:
var s = '';
for (var i = 0; i <= 127; i++) s += String.fromCharCode(i);
Unicode is much more tricky, because the Unicode coding space, from 0 to 0x10FFFF, contains a large number of unassigned code points as well as code points designated as noncharacters. There are also Private Use code points, which may be used to denote characters by “private agreement” but have no generally assigned meaning. Moreover, many Unicode characters are nonspacing, i.e. meant to combine with the preceding character (e.g., turning “a” to “â”), so you can’t visually print them in a row. There is no simple way in JavaScript to determine, from a integer, the class of the corresponding code point – you might need to read the UnicodeData.txt file, parse it, and use the information there to classify code points.
Finally, there is the programming issue that the JavaScript concept of character corresponds to a 16-bit code unit (not code point), and any Unicode code point larger than 0xFFFF needs to be represented using two code units (so-called surrogates). If you are using JavaScript in the context of an HTML document and you want to print characters in th HTML content, then the simplest way is to use character references like 𐐀 (which denotes the Unicode character at code point 10400 hexadecimal) and assign the string to the innerHTML property of an element.
If you need to write ranges of Unicode characters, you might take a look at the Full Unicode Input utility that I recently wrote. Its source code illustrates some ways of dealing with Unicode characters in JavaScript.
There are some of the ASCII characters that are non-printable, but for example getting the characters from 32 (space) to 126 (~), you would use:
var s = '';
for (var i = 32; i <= 127; i++) s += String.fromCharCode(i);
The unicode character set has more than 110,000 different characters (see Unicode), but a normal font doesn't contain all of them, so you can't display them anyway. You would have to specify what parts of the character space you are interested in.
Related
By default, String.prototype.normalize() uses NFC as an argument. NFC replaces multiple characters with single one.
MDN
You can specify "NFC" to get the composed canonical form, in which
multiple code points are replaced with single code points where
possible.
And here's an example from MDN. It works.
let str = '\u006E\u0303';
str = str.normalize();
console.log(`${str}: ${str.length}`);
But then I decided to try this method with other characters. For example:
let str = '\u0057\u0303';
str = str.normalize();
console.log(`${str}: ${str.length}`);
What's wrong in the second example? Why doesn't it work?
It doesn't replace multiple characters it replaces multiple codepoints and only where possible.
ñ, being a character used in Spanish has its own codepoint in unicode: — U+00D1 — so you can just say ñ instead of "Take an n and then put a ~ on top of it".
W̃, being a representation of a phonic sound doesn't have its own codepoint. It is a character used comparatively rarely so hasn't been given precious space in the more efficient bits of Unicode. The only way you can have one is to say "Take a W and then put a ~ on top of it".
For the API I'm working on, I have to map bytes to characters by these rules:
Bytes 0x20 to 0x24 inclusive are mapped to the corresponding characters U+0020 to U+0024 inclusive.
Bytes 0x26 to 0x7E inclusive are mapped to the corresponding characters U+0026 to U+007E inclusive.
Other bytes are mapped to the three character sequence consisting of a percent character followed by two uppercase hex characters, e.g. byte 0 maps to “%00” and byte 0xAB maps to “%AB”.
That's for encoding, I have to make function for decoding also.
Is this maybe some existing encoding? I googled it, but couldn't find anything.
I know U+ is for Unicode.
Can I just map it like:
if(bytesArray[i] == 0x21)
{
bytesArray[i] = U+0021;
}
?
Since you are translating values in a string, you can use String.replace. The only thing that actually needs replacing is the %XX notation; other characters can stay unchanged.
In this case you can use the user defined function of the replace command:
decodedArray = byteArray.replace (/%([0-9A-F][0-9A-F])/g,
function (match, p1)
{
return String.fromCharCode(parseInt(p1, 16));
} );
This is a safe operation because per the specification, a literal % character may not appear in the encoded string (in case you're wondering: that is the code 0x25 that gets singled out in the otherwise printable plain ASCII range). Even if a stray % appears in the string, it will only be replaced if actually followed by two valid capitalized hex bytes.
Encoding is similarly straightforward:
encodedArray = originalString.replace (/[^ -$&-~]/g,
function (match)
{
if (match.charCodeAt(0) < 16)
return '%0'+match.charCodeAt(0).toString(16).toUpperCase();
return '%'+match.charCodeAt(0).toString(16).toUpperCase();
} );
Note the 'match' regex is the exact ASCII representation of your to-be-translated code list; it is negated because you want to find all characters
outside that range.
This works better than a mapping, because that would require you to manually loop over each character and inspect it to decide whether or not it needs translating. This would work reasonably well for encoding (you'd need a list of all codes, with either a one- or 3-byte string to which it encodes) but for decoding you would need to look 2 bytes ahead, test if they are valid (!), and skip over them if so. Compared to that, the replace solution ought to be much faster.
I've been trying to set content of a text input dynamically using JS, the problem I encountered is I can not have the browser render the special symbols rather than chars so for example
document.getElementById("textField").value = "nbsp";
Instead of displaying a space it displays  , anybody got any idea?
Thanks a lot
It seems that you want to enter special characters like NO-BREAK SPACE in a JavaScript string literal. You can do that directly, provided that the character encoding of the file containing JavaScript code is properly declared, as it should be anyway:
document.getElementById("textField").value = ' ';
Here the character between apostrophes is the real NO-BREAK SPACE character. In rendering, it is usually indistinguishable from normal SPACE, but it has different effects. Similarly you can write e.g.
document.getElementById("textField").value = 'Ω';
using the Greek letter capital omega directly.
If you do not know how to enter such characters (e.g., via Windows CharMap program) or if you cannot control character encoding issues, you can use JavaScript Unicode escape notations for characters, e.g.
document.getElementById("textField").value = '\u00A0'; // no-break space
or
document.getElementById("textField").value = '\u03A9'; // capital omega
For the small set of characters with Unicode numbers less than 0x100, you can alternatively use \x escapes, e.g. '\xA0' instead of '\u00A0'. (But if you didn’t know this, it is better to learn to use the universal \u escape insteadd.)
is an HTML entity and you can't put an HTML entity in a text field like that.
Try using unicode, like this:
document.getElementById("textField").value = '\xA0';
What about using jquery and this:
$("#textField").html(' ').text()
Or in more general:
$(element).html(encodedString).text()
document.getElementById("textField").value = " ";
you should use " " instead of "nbsp"
For characters in Basic Multilingual Plane, we can use '\uxxxx' escape it. For example, you can use /[\u4e00-\u9fff]/ to match a common chinese character(0x4e00-0x9fff is the range of CJK Unified Ideographs).
But for characters out of Basic Multilingual Plane, their codes are bigger than 0xffff. So you can't use format '\uxxxx' to escape it, because '\u20000' means character '\u2000' and character '0', not a character which code is 0x20000.
How can I escape characters out of Basic Multilingual Plane? Use those characters directly is not a good idea, because they can't show in most fonts.
Characters outside the BMP are not recognized directly by Javascript -- they're represented internally as UTF-16 surrogate pairs. For instance, the character you mentioned, U+20000 (currently allocated to "CJK Unified Ideographs Ext. B") is represented as the surrogate pair U+D840 U+DC00. As a Javascript string, this would simply be "\u2840\uDC00". (Note that s.length is 2 for this string, even though it displays as a single character.)
Wikipedia has details on the encoding scheme used.
You can use a pair of escaped surrogate code points, as described in #duskwuff’s answer. You can use my Full Unicode input utility to get the notations (button “Show \u”), or use the Fileformat.info character search to find them out (item “C/C++/Java source code”, because JavaScript uses the same notation here).
Alternatively, you can enter the characters directly: “You can enter non-BMP characters as such into string literals in your JavaScript code,whether in a separate file or as embedded in HTML. Naturally, you need suitable Unicode support in the editor you use. But JavaScript implementations need not support non-BMP characters in program source. They may, and modern browser implementations generally do.” (Going Global with JavaScript and Globalize.js, p. 177) There are some caveats like properly declaring the character encoding.
Font support is a different issue, but when working with characters, you generally want to see them at some point anyway, at least in testing. So you more or less need some font(s) that cover the characters. The Fileformat.info pages also contain links to browser support info, such as (U+20000) Font Support – a good starting point, though not quite complete. For example, U+20000 '𠀀' is also supported in SimSun-ExtB
Interesting problem.
Now that we have ES6, we can do this:
let newSpeak = '\u{1F4A9}'
Note that internally it's still UTF-16 with surrogate pairs:
newSpeak.length === 2 // "wrong"
[...newSpeak].length === 1
newSpeak === '\uD83D\uDCA9'
Unicode is huge.
Also, it's not just the literals:
newSpeak.charCodeAt(0) === 0xD83D // "wrong"
newSpeak.codePointAt(0) === 0x1F4A9
String.fromCharCode(0x1F4A9) !== newSpeak
String.fromCodePoint(0x1F4A9) === newSpeak
for (let i = 0; i < newSpeak.length; i++) console.log(newSpeak[i]) // "wrong"
for (let c of newSpeak) console.log(c)
[...'🏃🚚'].map(c => `__${c}`).join('') === "__🏃__🚚"
I � handling Unicode.
I have an ASP Classic page with SHIFT_JIS charset. The meta tag under the page's head section is like this:
<meta http-equiv="Content-Type" content="text/html; charset=shift_jis">
My page has a text box (txtName) that should only allow 200 characters. I have a Javascript function that validates the character length, which is called on the onclick() event of my Submit button.
if(document.frmPage.txtName.value.length > 200) {
alert("You have exceeded the maximum length of 200.");
return false;
}
The problem is, Javascript is not getting the correct length of Japanese character encoded in SHIFT_JIS. For example, the character 测 has a SHIFT_JIS length of 8 characters, but Javascript is only recognizing it as one character, probably because of the Unicode encoding that Javascript uses by default. Some characters like ケ have 2 or 3 characters when in SHIFT_JIS.
If I will only depend on the length provided by Javascript, long Japanese characters would pass the page validation and it will try to save on the database, which will then fail because of the 200 maximum length of the DB column.
The browser that I'm using is Internet Explorer. Is there a way to get the SHIFT_JIS length of the Japanese character using Javascript? Is it possible to convert from Unicode to SHIFT_JIS using Javascript? How?
Thanks for the help!
For example, the character 测 has a SHIFT_JIS length of 8 characters, but Javascript is only recognizing it as one character, probably because of the Unicode encoding
Let's be clear: 测, U+6D4B (Han Character 'measure, estimate, conjecture') is a single character. When you encode it to a particular encoding like Shift-JIS, it may very well become multiple bytes.
In general JavaScript doesn't make encoding tables available so you can't find out how many bytes a character will take up. If you really need to, you have to carry around enough data to work it out yourself. For example, if you assume that the input contains only characters that are valid in Shift-JIS, this function would work out how many bytes are needed by keeping a list of all the characters that are a single byte, and assuming every other character takes two bytes:
function getShiftJISByteLength(s) {
return s.replace(/[^\x00-\x80。「」、・ヲァィゥェォャュョッーアイウエオカキクケコサシスセソタチツテトナニヌネノハヒフヘホマミムメモヤユヨラリルレロワン ゙ ゚]/g, 'xx').length;
}
However, there are no 8-byte sequences in Shift-JIS, and the character 测 is not available in Shift-JIS at all. (It's a Chinese character not used in Japan.)
Why you might be thinking it constitutes an 8-byte sequence is this: when a browser can't submit a character in a form, because it does not exist in the target charset, it replaces it with an HTML character reference: in this case 测. This is a lossy mangling: you can't tell whether the user typed literally 测 or 测. And if you are displaying the submitted content 测 as 测 then that means you are forgetting to HTML-encode your output, which probably means your application is highly vulnerable to cross-site scripting.
The only sensible answer is to use UTF-8 instead of Shift-JIS. UTF-8 can happily encode 测, or any other character, without having to resort to broken HTML character references. If you need to store content limited by encoded byte length in your database, there is a sneaky hack you can use to get the number of UTF-8 bytes in a string:
function getUTF8ByteLength(s) {
return unescape(encodeURIComponent(s)).length;
}
although probably it would be better to store native Unicode strings in the database so that the length limit refers to actual characters and not bytes in some encoding.
You are getting confused between characters and bytes. 测 is ONE character, however you look at it. In UTF-16 (which is what Javascript uses), it's two BYTES. In Shift_JIS, 8 bytes, apparently. But in both cases, it's ONE character. So what you are trying to do is limit the text length to 200 BYTES. Since Javascript is using UTF-16 (UCS-2, really) you can get it's byte length by multiplying the string length by 2, but that doesn't help you with Shift_JIS. Then again, you should probably consider switching to Unicode anyway, if you're working with Javascript...