Convert Unicode To ASCII in JavaScript - javascript

In c#, I have ö ASCII character 148. but when I change like
char convstr = (char)148;
It returns \u0094. Its "”".Its unicode.
If I want it back ASCII code, I just use Strings.Asc("”"). It will return to 148.
So How can I get from "”" to 148 in JavaScript?
I tried Like that=>
"”".charCodeAt(0) but it return 8221. I think its Unicode.
And If you don't mind, please explain to me why char convstr = (char)148; return \u0094 also. I am also stuck in there.
=================================

148 isn't the value for ö in ASCII (ie the 7-bit US ASCII encoding that goes only up to 127) nor the commonly-referred-to-as-ASCII codepages 1252 (Windows Latin 1) and ISO/IEC 8859-1. 1252 has ” in that location while the ISO codepage has nothing. That value is used for ö only in the old DOS codepages, 437 and 865.
Windows, .NET and C# strings are Unicode natively. This pages proves this - StackOverflow is an ASP.NET site. You can convert data in non-Unicode encodings easily, either through the Encoding class, or by specifying the encoding when loading data from streams with a StreamReader.
For example, this will convert the byte value 148 to ö using the 437 codepage :
var result=Encoding.GetEncoding(437).GetString(new byte[]{148});
Debug.Assert(result=="ö");
While this returns ”:
var result=Encoding.GetEncoding(1252).GetString(new byte[]{148});
The StreamReader(string,Encoding) overload and its variants can load data from files using the specified encoding, eg :
using(var reader=new StreamReader(path,Encoding.GetEncoding(437)))
{
var line=reader.ReadLine();
....
}

Related

Javascript unicode to ASCII

Need to convert the unicode of the SOH value '\u0001' to Ascii. Why is this not working?
var soh = String.fromCharCode(01);
It returns '\u0001'
Or when I try
var soh = '\u0001'
It returns a smiley face.
How can I get the unicode to become the proper SOH value(a blank unprintable character)
JS has no ASCII strings, they're intrinsically UTF-16.
In a browser you're out of luck. If you're coding for node.js you're lucky!
You can use a buffer to transcode strings into octets and then manipulate the binary data at will. But you won't get necessarily a valid string back out of the buffer once you've messed with it.
Either way you'll have to read more about it here:
https://mathiasbynens.be/notes/javascript-encoding
or here:
https://nodejs.org/api/buffer.html
EDIT: in the comment you say you use node.js, so this is an excerpt from the second link above.
const buf5 = Buffer.from('test');
// Creates a Buffer containing ASCII bytes [74, 65, 73, 74].
To create the SOH character embedded in a common ASCII string use the common escape sequence\x01 like so:
const bufferWithSOH = Buffer.from("string with \x01 SOH", "ascii");
This should do it. You can then send the bufferWithSOH content to an output stream such as a network, console or file stream.
Node.js documentation will guide you on how to use strings in a Buffer pretty well, just look up the second link above.
To ascii would be would be an array of bytes: 0x00 0x01 You would need to extract the unicode code point after the \u and call parseInt, then extract the bytes from the Number. JavaScript might not be the best language for this.

How to transcode six digits emoji code to Unicode by Javascript? (`328054` to `\ue052`)

There are some texts contain six digits emojis. I need transcode it to Unicode by JavaScript.
Just like this:
origin: 328054
Unicode: \ue052 ( U+E052 'the dog face' Emoji )
How can I transcode this six digits emoji code to Unicode by Javascript?
origin: 328054
I have no idea what you mean. If treated decimal, U+50176 is not a valid Unicode character. If treated hexadecimal, it lies outside the range of code points Unicode can represent.
Unicode: \ue052 ( U+E052 )
U+E052 is reserved for private use. You don't mean that one. It seems to have been used by SoftBank to encode the Dog Face emoji. Unless you live in Japan, and use their network, it hardly will work for you.
'the dog face' Emoji
is assigned U+1F436: 🐶.
How can I encode this in Javascript?
JavaScript uses UTF-16, and since your code point is higher than U+D7FF, you will need two characters to encode it as a surrogate pair. You still can easily get the string from the code point by using String.fromCodePoint:
var df = String.fromCodePoint(0x1F436);
df.length; // 2
You can get the character codes that you need for escaping from that string using the charCodeAt method:
String.fromCodePoint(0x1F436).charCodeAt(0).toString(16) // d83d
String.fromCodePoint(0x1F436).charCodeAt(1).toString(16) // dc36
So the JS string literal you seem to be after is "\ud83d\udc36".

Adding UTF-8 BOM to string/Blob

I need to add a UTF-8 byte-order-mark to generated text data on client side. How do I do that?
Using new Blob(['\xEF\xBB\xBF' + content]) yields '"my data"', of course.
Neither did '\uBBEF\x22BF' work (with '\x22' == '"' being the next character in content).
Is it possible to prepend the UTF-8 BOM in JavaScript to a generated text?
Yes, I really do need the UTF-8 BOM in this case.
Prepend \ufeff to the string. See http://msdn.microsoft.com/en-us/library/ie/2yfce773(v=vs.94).aspx
See discussion between #jeff-fischer and #casey for details on UTF-8 and UTF-16 and the BOM. What actually makes the above work is that the string \ufeff is always used to represent the BOM, regardless of UTF-8 or UTF-16 being used.
See p.36 in The Unicode Standard 5.0, Chapter 2 for a detailed explanation. A quote from that page
The endian order entry for UTF-8 in Table 2-4 is marked N/A because
UTF-8 code units are 8 bits in size, and the usual machine issues of
endian order for larger code units do not apply. The serialized order
of the bytes must not depart from the order defined by the UTF- 8
encoding form. Use of a BOM is neither required nor recommended for
UTF-8, but may be encountered in contexts where UTF-8 data is
converted from other encoding forms that use a BOM or where the BOM is
used as a UTF-8 signature.
I had the same issue and this is the solution I came up with:
var blob = new Blob([
new Uint8Array([0xEF, 0xBB, 0xBF]), // UTF-8 BOM
"Text",
... // Remaining data
],
{ type: "text/plain;charset=utf-8" });
Using Uint8Array prevents the browser from converting those bytes into string (tested on Chrome and Firefox).
You should replace text/plain with your desired MIME type.
I'm editing my original answer. The above answer really demands elaboration as this is a convoluted solution by Node.js.
The short answer is, yes, this code works.
The long answer is, no, FEFF is not the byte order mark for utf-8. Apparently node took some sort of shortcut for writing encodings within files. FEFF is the UTF16 Little Endian encoding as can be seen within the Byte Order Mark wikipedia article and can also be viewed within a binary text editor after having written the file. I've verified this is the case.
http://en.wikipedia.org/wiki/Byte_order_mark#Representations_of_byte_order_marks_by_encoding
Apparently, Node.JS uses the \ufeff to signify any number of encoding. It takes the \ufeff marker and converts it into the correct byte order mark based on the 3rd options parameter of writeFile. The 3rd parameter you pass in the encoding string. Node.JS takes this encoding string and converts the \ufeff fixed byte encoding into any one of the actual encoding's byte order marks.
UTF-8 Example:
fs.writeFile(someFilename, '\ufeff' + html, { encoding: 'utf8' }, function(err) {
/* The actual byte order mark written to the file is EF BB BF */
}
UTF-16 Little Endian Example:
fs.writeFile(someFilename, '\ufeff' + html, { encoding: 'utf16le' }, function(err) {
/* The actual byte order mark written to the file is FF FE */
}
So, as you can see the \ufeff is simply a marker stating any number of resulting encodings. The actual encoding that makes it into the file is directly dependent the encoding option specified. The marker used within the string is really irrelevant to what gets written to the file.
I suspect that the reasoning behind this is because they chose not to write byte order marks and the 3 byte mark for UTF-8 isn't easily encoded into the javascript string to be written to disk. So, they used the UTF16LE BOM as a placeholder mark within the string which gets substituted at write-time.
This is my solution:
var blob = new Blob(["\uFEFF"+csv], {
type: 'text/csv; charset=utf-18'
});
This works for me:
let blob = new Blob(["\ufeff", csv], { type: 'text/csv;charset=utf-8' });
BOM (Byte Order Marker) might be necessary to use because some programs need it to use the correct character encoding.
Example:
When opening a csv file without a BOM in a system with a default character encoding of Shift_JIS instead of UTF-8 in MS Excel, it will open it in default encoding. This will result in garbage characters. If you specify the BOM for UTF-8, it will fix it.
This fixes it for me. was getting a BOM with authorize.net api and cloudflare workers:
const data = JSON.parse((await res.text()).trim());

How to get the length of Japanese characters in Javascript?

I have an ASP Classic page with SHIFT_JIS charset. The meta tag under the page's head section is like this:
<meta http-equiv="Content-Type" content="text/html; charset=shift_jis">
My page has a text box (txtName) that should only allow 200 characters. I have a Javascript function that validates the character length, which is called on the onclick() event of my Submit button.
if(document.frmPage.txtName.value.length > 200) {
alert("You have exceeded the maximum length of 200.");
return false;
}
The problem is, Javascript is not getting the correct length of Japanese character encoded in SHIFT_JIS. For example, the character 测 has a SHIFT_JIS length of 8 characters, but Javascript is only recognizing it as one character, probably because of the Unicode encoding that Javascript uses by default. Some characters like ケ have 2 or 3 characters when in SHIFT_JIS.
If I will only depend on the length provided by Javascript, long Japanese characters would pass the page validation and it will try to save on the database, which will then fail because of the 200 maximum length of the DB column.
The browser that I'm using is Internet Explorer. Is there a way to get the SHIFT_JIS length of the Japanese character using Javascript? Is it possible to convert from Unicode to SHIFT_JIS using Javascript? How?
Thanks for the help!
For example, the character 测 has a SHIFT_JIS length of 8 characters, but Javascript is only recognizing it as one character, probably because of the Unicode encoding
Let's be clear: 测, U+6D4B (Han Character 'measure, estimate, conjecture') is a single character. When you encode it to a particular encoding like Shift-JIS, it may very well become multiple bytes.
In general JavaScript doesn't make encoding tables available so you can't find out how many bytes a character will take up. If you really need to, you have to carry around enough data to work it out yourself. For example, if you assume that the input contains only characters that are valid in Shift-JIS, this function would work out how many bytes are needed by keeping a list of all the characters that are a single byte, and assuming every other character takes two bytes:
function getShiftJISByteLength(s) {
return s.replace(/[^\x00-\x80。「」、・ヲァィゥェォャュョッーアイウエオカキクケコサシスセソタチツテトナニヌネノハヒフヘホマミムメモヤユヨラリルレロワン ゙ ゚]/g, 'xx').length;
}
However, there are no 8-byte sequences in Shift-JIS, and the character 测 is not available in Shift-JIS at all. (It's a Chinese character not used in Japan.)
Why you might be thinking it constitutes an 8-byte sequence is this: when a browser can't submit a character in a form, because it does not exist in the target charset, it replaces it with an HTML character reference: in this case 测. This is a lossy mangling: you can't tell whether the user typed literally 测 or 测. And if you are displaying the submitted content 测 as 测 then that means you are forgetting to HTML-encode your output, which probably means your application is highly vulnerable to cross-site scripting.
The only sensible answer is to use UTF-8 instead of Shift-JIS. UTF-8 can happily encode 测, or any other character, without having to resort to broken HTML character references. If you need to store content limited by encoded byte length in your database, there is a sneaky hack you can use to get the number of UTF-8 bytes in a string:
function getUTF8ByteLength(s) {
return unescape(encodeURIComponent(s)).length;
}
although probably it would be better to store native Unicode strings in the database so that the length limit refers to actual characters and not bytes in some encoding.
You are getting confused between characters and bytes. 测 is ONE character, however you look at it. In UTF-16 (which is what Javascript uses), it's two BYTES. In Shift_JIS, 8 bytes, apparently. But in both cases, it's ONE character. So what you are trying to do is limit the text length to 200 BYTES. Since Javascript is using UTF-16 (UCS-2, really) you can get it's byte length by multiplying the string length by 2, but that doesn't help you with Shift_JIS. Then again, you should probably consider switching to Unicode anyway, if you're working with Javascript...

Adding '.' using jQuery

The javascript length and subString function does not take into account non ascii characters.
I have a function that substrings the users input to 400 characters if they enter more than 400 characters.
e.g
function reduceInput (data) {
if (data.length > 400)
{
var reducedSize = data.substring(0,400);
}
return reducedSize;
}
However, if non ascii chars is entered (double byte chars) then this does not work. It does not take into account the character types in the equation.
I have another function that loops round each charaters, and if it is a non ascii, it increments a counter and then works out what the true count is. It works but it is a bit of a hack.
Is there a more efficient approach to doing this or is there no other alternative?
Thanks
The native character set of JavaScript and web browsers in general is UTF-16. Strings are sequences of UTF-16 code units. There is no concept of "double byte" character encodings.
If you want to calculate how many bytes a String will take up in a particular double-byte encoding, you will need to know what encoding it is and how to encode it yourself; that information will not be accessible to JavaScript natively. So for example with Shift_JIS you will have to know which characters are kana that can be encoded to a single byte, and which take part in double-byte kanji sequences.
There is not any encoding that stores all code units that represent ASCII in one byte and all code units other than ASCII in two bytes, so whatever question you are trying to solve by counting non-ASCII as two, the loop-and-add probably isn't the right answer.
In any case, the old-school double-byte encodings are a horrible anachronism to be avoided wherever possible. If you want a space-efficient byte encoding, you want UTF-8. It's easy to calculate the length of a string in UTF-8 bytes because JS has a sneaky built-in UTF-8 encoder you can leverage:
var byten= unescape(encodeURIComponent(chars)).length;
Snipping a string to 400 bytes is somewhat trickier because you want to avoid breaking a multi-byte sequence. You'll get an exception if you try to UTF-8-decode something with a broken sequence at the end, so catch it and try again:
var bytes= unescape(encodeURIComponent(chars)).slice(0, 400);
while (bytes.length>0) {
try {
chars= decodeURIComponent(escape(bytes));
break
} catch (e) {
bytes= bytes.slice(0, -1);
}
}
But it's unusual to want to limit input based on number of bytes it will take up in a particular encoding. Straight limit on number of characters is far more typical. What're you trying to do?
A regex can do the job i think
var data = /.{0,400}/.exec(originalData)[0];

Categories