I have an ASP Classic page with SHIFT_JIS charset. The meta tag under the page's head section is like this:
<meta http-equiv="Content-Type" content="text/html; charset=shift_jis">
My page has a text box (txtName) that should only allow 200 characters. I have a Javascript function that validates the character length, which is called on the onclick() event of my Submit button.
if(document.frmPage.txtName.value.length > 200) {
alert("You have exceeded the maximum length of 200.");
return false;
}
The problem is, Javascript is not getting the correct length of Japanese character encoded in SHIFT_JIS. For example, the character 测 has a SHIFT_JIS length of 8 characters, but Javascript is only recognizing it as one character, probably because of the Unicode encoding that Javascript uses by default. Some characters like ケ have 2 or 3 characters when in SHIFT_JIS.
If I will only depend on the length provided by Javascript, long Japanese characters would pass the page validation and it will try to save on the database, which will then fail because of the 200 maximum length of the DB column.
The browser that I'm using is Internet Explorer. Is there a way to get the SHIFT_JIS length of the Japanese character using Javascript? Is it possible to convert from Unicode to SHIFT_JIS using Javascript? How?
Thanks for the help!
For example, the character 测 has a SHIFT_JIS length of 8 characters, but Javascript is only recognizing it as one character, probably because of the Unicode encoding
Let's be clear: 测, U+6D4B (Han Character 'measure, estimate, conjecture') is a single character. When you encode it to a particular encoding like Shift-JIS, it may very well become multiple bytes.
In general JavaScript doesn't make encoding tables available so you can't find out how many bytes a character will take up. If you really need to, you have to carry around enough data to work it out yourself. For example, if you assume that the input contains only characters that are valid in Shift-JIS, this function would work out how many bytes are needed by keeping a list of all the characters that are a single byte, and assuming every other character takes two bytes:
function getShiftJISByteLength(s) {
return s.replace(/[^\x00-\x80。「」、・ヲァィゥェォャュョッーアイウエオカキクケコサシスセソタチツテトナニヌネノハヒフヘホマミムメモヤユヨラリルレロワン ゙ ゚]/g, 'xx').length;
}
However, there are no 8-byte sequences in Shift-JIS, and the character 测 is not available in Shift-JIS at all. (It's a Chinese character not used in Japan.)
Why you might be thinking it constitutes an 8-byte sequence is this: when a browser can't submit a character in a form, because it does not exist in the target charset, it replaces it with an HTML character reference: in this case 测. This is a lossy mangling: you can't tell whether the user typed literally 测 or 测. And if you are displaying the submitted content 测 as 测 then that means you are forgetting to HTML-encode your output, which probably means your application is highly vulnerable to cross-site scripting.
The only sensible answer is to use UTF-8 instead of Shift-JIS. UTF-8 can happily encode 测, or any other character, without having to resort to broken HTML character references. If you need to store content limited by encoded byte length in your database, there is a sneaky hack you can use to get the number of UTF-8 bytes in a string:
function getUTF8ByteLength(s) {
return unescape(encodeURIComponent(s)).length;
}
although probably it would be better to store native Unicode strings in the database so that the length limit refers to actual characters and not bytes in some encoding.
You are getting confused between characters and bytes. 测 is ONE character, however you look at it. In UTF-16 (which is what Javascript uses), it's two BYTES. In Shift_JIS, 8 bytes, apparently. But in both cases, it's ONE character. So what you are trying to do is limit the text length to 200 BYTES. Since Javascript is using UTF-16 (UCS-2, really) you can get it's byte length by multiplying the string length by 2, but that doesn't help you with Shift_JIS. Then again, you should probably consider switching to Unicode anyway, if you're working with Javascript...
Related
A user can copy and paste into a textarea html input and sometimes is pasting invalid UTF-8 characters, for example, a copy and paste from a rtf file that contains tabs.
How can I check if a string is a valid UTF-8?
Exposition
I think you misunderstand what "UTF-8 characters" means; UTF-8 is an encoding of Unicode which can represent any character, glyph, and grapheme that is defined in the (ever growing) Unicode standard. There are fewer Unicode code points than there are possible UTF8 byte values, so the only "invalid UTF8 characters" are UTF8 byte sequences that don't map to any Unicode code point, but I assume this is not what you're referring to.
for example, a copy and paste from a rtf file that contains tabs.
RTF is a formatting system which works independently of the underlying encoding scheme - you can use RTF with ASCII, UTF-8, UTF-16 and other encodings. With respect to the HTML textboxes in your post, both the <input type="text"> and <textarea> elements in HTML only respect plaintext, so any RTF formatting will be automatically stripped when pasted by a user, hence why JS-heavy "rich-edit" and contenteditable components are notuncommon in web-applications, though in this answer I assume you're not using a rich-edit component in a web-page).
Tabs in RTF files are not an RTF feature: they're just normal ASCII-style tab characters, i.e. \t or 0x09, which also appear in Unicode, and thus, can also appear in UTF-8 encoded text; furthermore, it's perfectly valid for web-browsers to allow users to paste those into <input> and <textarea>.
Javascript (ECMAScript) itself is Unicode-native; that is, the ECMAScript specification does require JS engines to use UTF-16 representations in some places, such as in the abstract operation IsStringWellFormedUnicode:
7.2.9 Static Semantics: IsStringWellFormedUnicode
The abstract operation IsStringWellFormedUnicode takes argument string (a String) and returns a Boolean. It interprets string as a sequence of UTF-16 encoded code points, as described in 6.1.4, and determines whether it is a well formed UTF-16 sequence.
...but that part of the specification is intended for JS engine programmers, and not people who write JS for use in browsers - in fact, I'd say it's safe to asume that within a web-browser, any-and-all JS string values will always be valid strings that can always be serialized out to UTF-8 and UTF-16, and also that JS scripts should not be concerned with the actual in-memory encoding of the string's content.
Your question
So given that your question is written as this:
A user can copy and paste into a textarea html input and sometimes is pasting invalid UTF-8 characters, for example, a copy and paste from a rtf file that contains tabs.
How can I check if a string is a valid UTF-8?
I'm going to interpret it as this:
A user can copy RTF text from a program like WordPad and paste it into a HTML <textarea> or <input type="text"> in a web-browser, and when it's pasted the plaintext representation of the RTF still contains certain characters that my application should not accept such as whitespace like tabs.
How can I detect these unwanted characters and inform the user - or remove those unwanted characters?
...to which my answer is:
I suggest just stripping-out unwanted characters using a regular-expression that matches non-visible characters (from here: Match non printable/non ascii characters and remove from text )
let textBoxContent = document.getElementById( 'myTextarea' ).value;
textBoxContent = textBoxContent.replace( /[^\x20-\x7E]+/g, '' );
The expression [^\x20-\x7E] matches any character NOT in the codepoint range 0x20 (32, a normal space character ' ') to 0x7E (127, the tidle '~' character), all other characters will be removed, including non-Latin text.
The g switch at the end makes it a global find-and-replace operation; without the g then only the first unwanted character would be removed.
The range 0x20-0x7E works because Unicode's first 127 codepoints are identical to ASCII and can be seen here: http://www.asciitable.com/
Just an idea:
function checkUTF8(text) {
var utf8Text = text;
try {
// Try to convert to utf-8
utf8Text = decodeURIComponent(escape(text));
// If the conversion succeeds, text is not utf-8
}catch(e) {
// console.log(e.message); // URI malformed
// This exception means text is utf-8
}
return utf8Text; // returned text is always utf-8
}
Encoding a string with German umlauts like ä,ü,ö,ß with Javascript encodeURI() causes a weird bug after decoding it in PHP with rawurldecode(). Although the string seems to be correctly decoded it isn't. See below example screenshots from my IDE
Also the strlen() of the - with rawurldecode() - decoded string gives more characters than it really has!
Problems occur when I need to process the decoded string, for example if I want to replace the German characters ä,ü,ö with ae, ue and oe. This can be seen in the example provided here.
I have also made an PHP fiddle where this whole weirdness can be seen.
What I've tried so far:
- utf8_decode
- iconv
- and also first two suggestions from here
This is a Unicode equivalence issue and it looks like your IDE doesnt handle multibyte strings very well.
In unicode you can represent Ü with either:
the single unicode codepoint (U+00DC) or %C3%9C in utf8
or use a capital U (U+0055) with a modifier (U+0308) or %55%CC%88 in utf8
Your GWT string uses the latter method called NFD while your one from PHP uses the first method called NFC. That's why your GWT string is 3 characters longer even though they are both valid encodings of logically identical unicode strings. Your problem is that they are not identical byte for byte in PHP.
More details about utf-8 normalisation.
If you want to do preg replacements on the strings you need to normalise them to the same form first. From your example I can see your IDE is using NFC since it's the PHP string that works. So I suggest normalising to NFC form in PHP (the default), then doing the preg_replace.
http://php.net/manual/en/normalizer.normalize.php
function cleanImageName($name)
{
$name = Normalizer::normalize( $name, Normalizer::FORM_C );
$clean = preg_replace(
Otherwise you have to do something like this which is based on this article.
I need to do something like this:
Have a variable of some type.
Run in a loop and assign all the possible ASCII characters to this variable and print them, one by one.
Is something similar possible for UNICODE also?
I'm not sure how exactly you want to print, but this will console.log printable ascii
for(var i=32;i<127;++i) console.log(String.fromCharCode(i));
You can document.write then if that's your intention. And if the environment is unicode, it should work for unicode as well, I believe.
Others have shown how to print the printable Ascii characters. It is possible to print all other Ascii characters, too, though they are control characters with system-dependent effect (often no effect). To create a string containing all Ascii characters into a string, you could do this:
var s = '';
for (var i = 0; i <= 127; i++) s += String.fromCharCode(i);
Unicode is much more tricky, because the Unicode coding space, from 0 to 0x10FFFF, contains a large number of unassigned code points as well as code points designated as noncharacters. There are also Private Use code points, which may be used to denote characters by “private agreement” but have no generally assigned meaning. Moreover, many Unicode characters are nonspacing, i.e. meant to combine with the preceding character (e.g., turning “a” to “â”), so you can’t visually print them in a row. There is no simple way in JavaScript to determine, from a integer, the class of the corresponding code point – you might need to read the UnicodeData.txt file, parse it, and use the information there to classify code points.
Finally, there is the programming issue that the JavaScript concept of character corresponds to a 16-bit code unit (not code point), and any Unicode code point larger than 0xFFFF needs to be represented using two code units (so-called surrogates). If you are using JavaScript in the context of an HTML document and you want to print characters in th HTML content, then the simplest way is to use character references like 𐐀 (which denotes the Unicode character at code point 10400 hexadecimal) and assign the string to the innerHTML property of an element.
If you need to write ranges of Unicode characters, you might take a look at the Full Unicode Input utility that I recently wrote. Its source code illustrates some ways of dealing with Unicode characters in JavaScript.
There are some of the ASCII characters that are non-printable, but for example getting the characters from 32 (space) to 126 (~), you would use:
var s = '';
for (var i = 32; i <= 127; i++) s += String.fromCharCode(i);
The unicode character set has more than 110,000 different characters (see Unicode), but a normal font doesn't contain all of them, so you can't display them anyway. You would have to specify what parts of the character space you are interested in.
The javascript length and subString function does not take into account non ascii characters.
I have a function that substrings the users input to 400 characters if they enter more than 400 characters.
e.g
function reduceInput (data) {
if (data.length > 400)
{
var reducedSize = data.substring(0,400);
}
return reducedSize;
}
However, if non ascii chars is entered (double byte chars) then this does not work. It does not take into account the character types in the equation.
I have another function that loops round each charaters, and if it is a non ascii, it increments a counter and then works out what the true count is. It works but it is a bit of a hack.
Is there a more efficient approach to doing this or is there no other alternative?
Thanks
The native character set of JavaScript and web browsers in general is UTF-16. Strings are sequences of UTF-16 code units. There is no concept of "double byte" character encodings.
If you want to calculate how many bytes a String will take up in a particular double-byte encoding, you will need to know what encoding it is and how to encode it yourself; that information will not be accessible to JavaScript natively. So for example with Shift_JIS you will have to know which characters are kana that can be encoded to a single byte, and which take part in double-byte kanji sequences.
There is not any encoding that stores all code units that represent ASCII in one byte and all code units other than ASCII in two bytes, so whatever question you are trying to solve by counting non-ASCII as two, the loop-and-add probably isn't the right answer.
In any case, the old-school double-byte encodings are a horrible anachronism to be avoided wherever possible. If you want a space-efficient byte encoding, you want UTF-8. It's easy to calculate the length of a string in UTF-8 bytes because JS has a sneaky built-in UTF-8 encoder you can leverage:
var byten= unescape(encodeURIComponent(chars)).length;
Snipping a string to 400 bytes is somewhat trickier because you want to avoid breaking a multi-byte sequence. You'll get an exception if you try to UTF-8-decode something with a broken sequence at the end, so catch it and try again:
var bytes= unescape(encodeURIComponent(chars)).slice(0, 400);
while (bytes.length>0) {
try {
chars= decodeURIComponent(escape(bytes));
break
} catch (e) {
bytes= bytes.slice(0, -1);
}
}
But it's unusual to want to limit input based on number of bytes it will take up in a particular encoding. Straight limit on number of characters is far more typical. What're you trying to do?
A regex can do the job i think
var data = /.{0,400}/.exec(originalData)[0];
I'm using php's json_encode() to convert an array to json which then echo's it and is read from a javascript ajax request.
The problem is the echo'd text has unicode characters which the javascript json parse() function doesn't convert to.
Example array value is "2\u00000\u00001\u00000\u0000-\u00001\u00000\u0000-\u00000\u00001" which is "2010-10-01".
Json.parse() only gives me "2".
Anyone help me with this issue?
Example:
var resArray = JSON.parse(this.responseText);
for(var x=0; x < resArray.length; x++) {
var twt = resArray[x];
alert(twt.date);
break;
}
You have NUL characters (character code zero) in the string. It's actually "2_0_1_0_-_1_0_-_0_1", where _ represents the NUL characters.
The unicode character escape is actually part of the JSON standard, so the parser should handle that correctly. However, the result is still a string will NUL characters in it, so when you try to use the string in Javascript the behaviour will depend on what the browser does with the NUL characters.
You can try this in some different browsers:
alert('as\u0000df');
Internet Explorer will display only as
Firefox will display asdf but the NUL character doesn't display.
The best solution would be to remove the NUL characters before you convert the data to JSON.
To add to what Guffa said:
When you have alternating zero bytes, what has almost certainly happened is that you've read a UTF-16 data source without converting it to an ASCII-compatible encoding such as UTF-8. Whilst you can throw away the nulls, this will mangle the string if it contains any characters outside of ASCII range. (Not an issue for date strings of course, but it may affect any other strings you're reading from the same source.)
Check where your PHP code is reading the 2010-10-01 string from, and either convert it on the fly using eg iconv('utf-16le', 'utf-8', $string), or change the source to use a more reasonable encoding. If it's a text file, for example, save it in a text editor using ‘UTF-8 without BOM’, and not ‘Unicode’, which is a highly misleading name Windows text editors use to mean UTF-16LE.