Adding '.' using jQuery - javascript

The javascript length and subString function does not take into account non ascii characters.
I have a function that substrings the users input to 400 characters if they enter more than 400 characters.
e.g
function reduceInput (data) {
if (data.length > 400)
{
var reducedSize = data.substring(0,400);
}
return reducedSize;
}
However, if non ascii chars is entered (double byte chars) then this does not work. It does not take into account the character types in the equation.
I have another function that loops round each charaters, and if it is a non ascii, it increments a counter and then works out what the true count is. It works but it is a bit of a hack.
Is there a more efficient approach to doing this or is there no other alternative?
Thanks

The native character set of JavaScript and web browsers in general is UTF-16. Strings are sequences of UTF-16 code units. There is no concept of "double byte" character encodings.
If you want to calculate how many bytes a String will take up in a particular double-byte encoding, you will need to know what encoding it is and how to encode it yourself; that information will not be accessible to JavaScript natively. So for example with Shift_JIS you will have to know which characters are kana that can be encoded to a single byte, and which take part in double-byte kanji sequences.
There is not any encoding that stores all code units that represent ASCII in one byte and all code units other than ASCII in two bytes, so whatever question you are trying to solve by counting non-ASCII as two, the loop-and-add probably isn't the right answer.
In any case, the old-school double-byte encodings are a horrible anachronism to be avoided wherever possible. If you want a space-efficient byte encoding, you want UTF-8. It's easy to calculate the length of a string in UTF-8 bytes because JS has a sneaky built-in UTF-8 encoder you can leverage:
var byten= unescape(encodeURIComponent(chars)).length;
Snipping a string to 400 bytes is somewhat trickier because you want to avoid breaking a multi-byte sequence. You'll get an exception if you try to UTF-8-decode something with a broken sequence at the end, so catch it and try again:
var bytes= unescape(encodeURIComponent(chars)).slice(0, 400);
while (bytes.length>0) {
try {
chars= decodeURIComponent(escape(bytes));
break
} catch (e) {
bytes= bytes.slice(0, -1);
}
}
But it's unusual to want to limit input based on number of bytes it will take up in a particular encoding. Straight limit on number of characters is far more typical. What're you trying to do?

A regex can do the job i think
var data = /.{0,400}/.exec(originalData)[0];

Related

Javascript unicode to ASCII

Need to convert the unicode of the SOH value '\u0001' to Ascii. Why is this not working?
var soh = String.fromCharCode(01);
It returns '\u0001'
Or when I try
var soh = '\u0001'
It returns a smiley face.
How can I get the unicode to become the proper SOH value(a blank unprintable character)
JS has no ASCII strings, they're intrinsically UTF-16.
In a browser you're out of luck. If you're coding for node.js you're lucky!
You can use a buffer to transcode strings into octets and then manipulate the binary data at will. But you won't get necessarily a valid string back out of the buffer once you've messed with it.
Either way you'll have to read more about it here:
https://mathiasbynens.be/notes/javascript-encoding
or here:
https://nodejs.org/api/buffer.html
EDIT: in the comment you say you use node.js, so this is an excerpt from the second link above.
const buf5 = Buffer.from('test');
// Creates a Buffer containing ASCII bytes [74, 65, 73, 74].
To create the SOH character embedded in a common ASCII string use the common escape sequence\x01 like so:
const bufferWithSOH = Buffer.from("string with \x01 SOH", "ascii");
This should do it. You can then send the bufferWithSOH content to an output stream such as a network, console or file stream.
Node.js documentation will guide you on how to use strings in a Buffer pretty well, just look up the second link above.
To ascii would be would be an array of bytes: 0x00 0x01 You would need to extract the unicode code point after the \u and call parseInt, then extract the bytes from the Number. JavaScript might not be the best language for this.

How to map binary data to characters in javascript?

For the API I'm working on, I have to map bytes to characters by these rules:
Bytes 0x20 to 0x24 inclusive are mapped to the corresponding characters U+0020 to U+0024 inclusive.
Bytes 0x26 to 0x7E inclusive are mapped to the corresponding characters U+0026 to U+007E inclusive.
Other bytes are mapped to the three character sequence consisting of a percent character followed by two uppercase hex characters, e.g. byte 0 maps to “%00” and byte 0xAB maps to “%AB”.
That's for encoding, I have to make function for decoding also.
Is this maybe some existing encoding? I googled it, but couldn't find anything.
I know U+ is for Unicode.
Can I just map it like:
if(bytesArray[i] == 0x21)
{
bytesArray[i] = U+0021;
}
?
Since you are translating values in a string, you can use String.replace. The only thing that actually needs replacing is the %XX notation; other characters can stay unchanged.
In this case you can use the user defined function of the replace command:
decodedArray = byteArray.replace (/%([0-9A-F][0-9A-F])/g,
function (match, p1)
{
return String.fromCharCode(parseInt(p1, 16));
} );
This is a safe operation because per the specification, a literal % character may not appear in the encoded string (in case you're wondering: that is the code 0x25 that gets singled out in the otherwise printable plain ASCII range). Even if a stray % appears in the string, it will only be replaced if actually followed by two valid capitalized hex bytes.
Encoding is similarly straightforward:
encodedArray = originalString.replace (/[^ -$&-~]/g,
function (match)
{
if (match.charCodeAt(0) < 16)
return '%0'+match.charCodeAt(0).toString(16).toUpperCase();
return '%'+match.charCodeAt(0).toString(16).toUpperCase();
} );
Note the 'match' regex is the exact ASCII representation of your to-be-translated code list; it is negated because you want to find all characters
outside that range.
This works better than a mapping, because that would require you to manually loop over each character and inspect it to decide whether or not it needs translating. This would work reasonably well for encoding (you'd need a list of all codes, with either a one- or 3-byte string to which it encodes) but for decoding you would need to look 2 bytes ahead, test if they are valid (!), and skip over them if so. Compared to that, the replace solution ought to be much faster.

Encode to alphanumeric in JavaScript

If I have a random string and want to encode it into another string that only contains alphanumeric characters, what is the most efficient way to do it in JavaScript / NodeJS?
Obviously it must be possible to convert the output string back to the original input string when needed.
Thanks!
To encode to an alphanumeric string you should use an alphanumeric encoding. Some popular ones include hexadecimal (base16), base32, base36, base58 and base62. The alternatives to hexadecimal are used because the larger alphabet results in a shorter encoded string. Here's some info:
Hexadecimal is popular because it is very common.
Base32 and Base36 are useful for case-insensitive encodings. Base32 is more human readable because it removes some easy-to-misread letters. Base32 is used in gaming and for license keys.
Base58 and Base62 are useful for case-sensitive encodings. Base58 is also designed to be more human readable by removing some easy-to-misread letters. Base58 is used by Flickr, Bitcoin and others.
In NodeJS hexadecimal encoding is natively supported, and can be done as follows:
// Encode
var hex = new Buffer(string).toString('hex');
// Decode
var string = new Buffer(hex, 'hex').toString();
It's important to note that there are different implementations of some of these. For example, Flickr and Bitcoin use different implementations of Base58.
Why not just store the 2 strings in different variables so no need to convert back?
To extract all alphanumerics you could use the regex function like so:
var str='dssldjf348902.-/dsfkjl';
var patt=/[^\w\d]*/g;
var str2 = str.replace(patt,'');
str2 becomes dssldjf348902dsfkjl

Printing all ASCII characters in Javascript

I need to do something like this:
Have a variable of some type.
Run in a loop and assign all the possible ASCII characters to this variable and print them, one by one.
Is something similar possible for UNICODE also?
I'm not sure how exactly you want to print, but this will console.log printable ascii
for(var i=32;i<127;++i) console.log(String.fromCharCode(i));
You can document.write then if that's your intention. And if the environment is unicode, it should work for unicode as well, I believe.
Others have shown how to print the printable Ascii characters. It is possible to print all other Ascii characters, too, though they are control characters with system-dependent effect (often no effect). To create a string containing all Ascii characters into a string, you could do this:
var s = '';
for (var i = 0; i <= 127; i++) s += String.fromCharCode(i);
Unicode is much more tricky, because the Unicode coding space, from 0 to 0x10FFFF, contains a large number of unassigned code points as well as code points designated as noncharacters. There are also Private Use code points, which may be used to denote characters by “private agreement” but have no generally assigned meaning. Moreover, many Unicode characters are nonspacing, i.e. meant to combine with the preceding character (e.g., turning “a” to “â”), so you can’t visually print them in a row. There is no simple way in JavaScript to determine, from a integer, the class of the corresponding code point – you might need to read the UnicodeData.txt file, parse it, and use the information there to classify code points.
Finally, there is the programming issue that the JavaScript concept of character corresponds to a 16-bit code unit (not code point), and any Unicode code point larger than 0xFFFF needs to be represented using two code units (so-called surrogates). If you are using JavaScript in the context of an HTML document and you want to print characters in th HTML content, then the simplest way is to use character references like 𐐀 (which denotes the Unicode character at code point 10400 hexadecimal) and assign the string to the innerHTML property of an element.
If you need to write ranges of Unicode characters, you might take a look at the Full Unicode Input utility that I recently wrote. Its source code illustrates some ways of dealing with Unicode characters in JavaScript.
There are some of the ASCII characters that are non-printable, but for example getting the characters from 32 (space) to 126 (~), you would use:
var s = '';
for (var i = 32; i <= 127; i++) s += String.fromCharCode(i);
The unicode character set has more than 110,000 different characters (see Unicode), but a normal font doesn't contain all of them, so you can't display them anyway. You would have to specify what parts of the character space you are interested in.

How to get the length of Japanese characters in Javascript?

I have an ASP Classic page with SHIFT_JIS charset. The meta tag under the page's head section is like this:
<meta http-equiv="Content-Type" content="text/html; charset=shift_jis">
My page has a text box (txtName) that should only allow 200 characters. I have a Javascript function that validates the character length, which is called on the onclick() event of my Submit button.
if(document.frmPage.txtName.value.length > 200) {
alert("You have exceeded the maximum length of 200.");
return false;
}
The problem is, Javascript is not getting the correct length of Japanese character encoded in SHIFT_JIS. For example, the character 测 has a SHIFT_JIS length of 8 characters, but Javascript is only recognizing it as one character, probably because of the Unicode encoding that Javascript uses by default. Some characters like ケ have 2 or 3 characters when in SHIFT_JIS.
If I will only depend on the length provided by Javascript, long Japanese characters would pass the page validation and it will try to save on the database, which will then fail because of the 200 maximum length of the DB column.
The browser that I'm using is Internet Explorer. Is there a way to get the SHIFT_JIS length of the Japanese character using Javascript? Is it possible to convert from Unicode to SHIFT_JIS using Javascript? How?
Thanks for the help!
For example, the character 测 has a SHIFT_JIS length of 8 characters, but Javascript is only recognizing it as one character, probably because of the Unicode encoding
Let's be clear: 测, U+6D4B (Han Character 'measure, estimate, conjecture') is a single character. When you encode it to a particular encoding like Shift-JIS, it may very well become multiple bytes.
In general JavaScript doesn't make encoding tables available so you can't find out how many bytes a character will take up. If you really need to, you have to carry around enough data to work it out yourself. For example, if you assume that the input contains only characters that are valid in Shift-JIS, this function would work out how many bytes are needed by keeping a list of all the characters that are a single byte, and assuming every other character takes two bytes:
function getShiftJISByteLength(s) {
return s.replace(/[^\x00-\x80。「」、・ヲァィゥェォャュョッーアイウエオカキクケコサシスセソタチツテトナニヌネノハヒフヘホマミムメモヤユヨラリルレロワン ゙ ゚]/g, 'xx').length;
}
However, there are no 8-byte sequences in Shift-JIS, and the character 测 is not available in Shift-JIS at all. (It's a Chinese character not used in Japan.)
Why you might be thinking it constitutes an 8-byte sequence is this: when a browser can't submit a character in a form, because it does not exist in the target charset, it replaces it with an HTML character reference: in this case 测. This is a lossy mangling: you can't tell whether the user typed literally 测 or 测. And if you are displaying the submitted content 测 as 测 then that means you are forgetting to HTML-encode your output, which probably means your application is highly vulnerable to cross-site scripting.
The only sensible answer is to use UTF-8 instead of Shift-JIS. UTF-8 can happily encode 测, or any other character, without having to resort to broken HTML character references. If you need to store content limited by encoded byte length in your database, there is a sneaky hack you can use to get the number of UTF-8 bytes in a string:
function getUTF8ByteLength(s) {
return unescape(encodeURIComponent(s)).length;
}
although probably it would be better to store native Unicode strings in the database so that the length limit refers to actual characters and not bytes in some encoding.
You are getting confused between characters and bytes. 测 is ONE character, however you look at it. In UTF-16 (which is what Javascript uses), it's two BYTES. In Shift_JIS, 8 bytes, apparently. But in both cases, it's ONE character. So what you are trying to do is limit the text length to 200 BYTES. Since Javascript is using UTF-16 (UCS-2, really) you can get it's byte length by multiplying the string length by 2, but that doesn't help you with Shift_JIS. Then again, you should probably consider switching to Unicode anyway, if you're working with Javascript...

Categories