Javascript unicode to ASCII

Javascript unicode to ASCII - javascript

Need to convert the unicode of the SOH value '\u0001' to Ascii. Why is this not working?
var soh = String.fromCharCode(01);
It returns '\u0001'
Or when I try
var soh = '\u0001'
It returns a smiley face.
How can I get the unicode to become the proper SOH value(a blank unprintable character)

JS has no ASCII strings, they're intrinsically UTF-16.
In a browser you're out of luck. If you're coding for node.js you're lucky!
You can use a buffer to transcode strings into octets and then manipulate the binary data at will. But you won't get necessarily a valid string back out of the buffer once you've messed with it.
Either way you'll have to read more about it here:
https://mathiasbynens.be/notes/javascript-encoding
or here:
https://nodejs.org/api/buffer.html
EDIT: in the comment you say you use node.js, so this is an excerpt from the second link above.
const buf5 = Buffer.from('test');
// Creates a Buffer containing ASCII bytes [74, 65, 73, 74].
To create the SOH character embedded in a common ASCII string use the common escape sequence\x01 like so:
const bufferWithSOH = Buffer.from("string with \x01 SOH", "ascii");
This should do it. You can then send the bufferWithSOH content to an output stream such as a network, console or file stream.
Node.js documentation will guide you on how to use strings in a Buffer pretty well, just look up the second link above.

To ascii would be would be an array of bytes: 0x00 0x01 You would need to extract the unicode code point after the \u and call parseInt, then extract the bytes from the Number. JavaScript might not be the best language for this.

Related

JavaScript not decoding parameter

So from the textarea I take the shortcode %91samurai id="19"%93 it should be [samurai id="19"]:
var not_decoded_content = jQuery('[data-module_type="et_pb_text_forms_00132547"]')
.find('#et_pb_et_pb_text_form_content').html();
But when I try to decode the %91 and %93
self.content = decodeURI(not_decoded_content);
I get the error:
Uncaught URIError: URI malformed
How can i solve this problem?

The encodings are invalid. If you can't fix the whatever-system-produces-them to correctly produce %5B and %5D, then your only option is to do a replacement yourself: replace all %91 with character 91 which is '[', then replace all %93 with character 93 which is ']'.
Note that javascript String Replace as-is won't do "Replace all occurrences". If you need that, then create a loop (while it contains(...) do a replace), or search the internet for javascript replace all, you should find plenty results.
And a final note, I am used to using decodeURIComponent(...). If you can make the whatever-system-produces-them to correctly produce %5B and %5D, and you still get that error, then try using decodeURIComponent(...) instead of decodeURI(...).

The string you're trying to decode is not a URI. Use decodeURIComponent() instead.
UPDATE
Hmm, that's not actually the issue, the issues are the %91 and %93.
encodeURI('[]')
gives %5b%5d, it looks like whatever has encoded this string has used the decimal rather than hexadecimal value.
Decimal 91 = hex 5b
Decimal 93 = hex 5d
Trying again with the hex values
decodeURI('%5bsamurai id="19"%5d') == '[samurai id="19"]'

I know this is not the solution you want to see, but can you try using "%E2%80%98" for %91 and "%E2%80%9C" for %93 ?
The %91 and %93 are part of control characters which html does not like to decode (for reasons beyond me). Simply put, they're not your ordinary ASCII characters for HTML to play around with.

Encode to alphanumeric in JavaScript

If I have a random string and want to encode it into another string that only contains alphanumeric characters, what is the most efficient way to do it in JavaScript / NodeJS?
Obviously it must be possible to convert the output string back to the original input string when needed.
Thanks!

To encode to an alphanumeric string you should use an alphanumeric encoding. Some popular ones include hexadecimal (base16), base32, base36, base58 and base62. The alternatives to hexadecimal are used because the larger alphabet results in a shorter encoded string. Here's some info:
Hexadecimal is popular because it is very common.
Base32 and Base36 are useful for case-insensitive encodings. Base32 is more human readable because it removes some easy-to-misread letters. Base32 is used in gaming and for license keys.
Base58 and Base62 are useful for case-sensitive encodings. Base58 is also designed to be more human readable by removing some easy-to-misread letters. Base58 is used by Flickr, Bitcoin and others.
In NodeJS hexadecimal encoding is natively supported, and can be done as follows:
// Encode
var hex = new Buffer(string).toString('hex');
// Decode
var string = new Buffer(hex, 'hex').toString();
It's important to note that there are different implementations of some of these. For example, Flickr and Bitcoin use different implementations of Base58.

Why not just store the 2 strings in different variables so no need to convert back?
To extract all alphanumerics you could use the regex function like so:
var str='dssldjf348902.-/dsfkjl';
var patt=/[^\w\d]*/g;
var str2 = str.replace(patt,'');
str2 becomes dssldjf348902dsfkjl

Adding UTF-8 BOM to string/Blob

I need to add a UTF-8 byte-order-mark to generated text data on client side. How do I do that?
Using new Blob(['\xEF\xBB\xBF' + content]) yields 'ï»¿"my data"', of course.
Neither did '\uBBEF\x22BF' work (with '\x22' == '"' being the next character in content).
Is it possible to prepend the UTF-8 BOM in JavaScript to a generated text?
Yes, I really do need the UTF-8 BOM in this case.

Prepend \ufeff to the string. See http://msdn.microsoft.com/en-us/library/ie/2yfce773(v=vs.94).aspx
See discussion between #jeff-fischer and #casey for details on UTF-8 and UTF-16 and the BOM. What actually makes the above work is that the string \ufeff is always used to represent the BOM, regardless of UTF-8 or UTF-16 being used.
See p.36 in The Unicode Standard 5.0, Chapter 2 for a detailed explanation. A quote from that page
The endian order entry for UTF-8 in Table 2-4 is marked N/A because
UTF-8 code units are 8 bits in size, and the usual machine issues of
endian order for larger code units do not apply. The serialized order
of the bytes must not depart from the order defined by the UTF- 8
encoding form. Use of a BOM is neither required nor recommended for
UTF-8, but may be encountered in contexts where UTF-8 data is
converted from other encoding forms that use a BOM or where the BOM is
used as a UTF-8 signature.

I had the same issue and this is the solution I came up with:
var blob = new Blob([
new Uint8Array([0xEF, 0xBB, 0xBF]), // UTF-8 BOM
"Text",
... // Remaining data
],
{ type: "text/plain;charset=utf-8" });
Using Uint8Array prevents the browser from converting those bytes into string (tested on Chrome and Firefox).
You should replace text/plain with your desired MIME type.

I'm editing my original answer. The above answer really demands elaboration as this is a convoluted solution by Node.js.
The short answer is, yes, this code works.
The long answer is, no, FEFF is not the byte order mark for utf-8. Apparently node took some sort of shortcut for writing encodings within files. FEFF is the UTF16 Little Endian encoding as can be seen within the Byte Order Mark wikipedia article and can also be viewed within a binary text editor after having written the file. I've verified this is the case.
http://en.wikipedia.org/wiki/Byte_order_mark#Representations_of_byte_order_marks_by_encoding
Apparently, Node.JS uses the \ufeff to signify any number of encoding. It takes the \ufeff marker and converts it into the correct byte order mark based on the 3rd options parameter of writeFile. The 3rd parameter you pass in the encoding string. Node.JS takes this encoding string and converts the \ufeff fixed byte encoding into any one of the actual encoding's byte order marks.
UTF-8 Example:
fs.writeFile(someFilename, '\ufeff' + html, { encoding: 'utf8' }, function(err) {
/* The actual byte order mark written to the file is EF BB BF */
}
UTF-16 Little Endian Example:
fs.writeFile(someFilename, '\ufeff' + html, { encoding: 'utf16le' }, function(err) {
/* The actual byte order mark written to the file is FF FE */
}
So, as you can see the \ufeff is simply a marker stating any number of resulting encodings. The actual encoding that makes it into the file is directly dependent the encoding option specified. The marker used within the string is really irrelevant to what gets written to the file.
I suspect that the reasoning behind this is because they chose not to write byte order marks and the 3 byte mark for UTF-8 isn't easily encoded into the javascript string to be written to disk. So, they used the UTF16LE BOM as a placeholder mark within the string which gets substituted at write-time.

This is my solution:
var blob = new Blob(["\uFEFF"+csv], {
type: 'text/csv; charset=utf-18'
});

This works for me:
let blob = new Blob(["\ufeff", csv], { type: 'text/csv;charset=utf-8' });
BOM (Byte Order Marker) might be necessary to use because some programs need it to use the correct character encoding.
Example:
When opening a csv file without a BOM in a system with a default character encoding of Shift_JIS instead of UTF-8 in MS Excel, it will open it in default encoding. This will result in garbage characters. If you specify the BOM for UTF-8, it will fix it.

This fixes it for me. was getting a BOM with authorize.net api and cloudflare workers:
const data = JSON.parse((await res.text()).trim());

Using Crockford's base 32 for IDs in URLs?

I'd like to write some IDs for use in URLs in Crockford's base32. I'm using the base32 npm module.
So, for example, if the user types in http://domain/page/4A2A I'd like it to map to the same underlying ID as http://domain/page/4a2a
This is because I want human-friendly URLs, where the user doesn't have to worry about the difference between upper- and lower-case letters, or between "l" and "1" - they just get the page they expect.
But I'm struggling to implement this, basically because I'm too dim to understand how encoding works. First I tried:
var encoded1 = base32.encode('4a2a');
var encoded2 = base32.encode('4A2A');
console.log(encoded1, encoded2);
But they map to different underlying IDs:
6hgk4r8 6h0k4g8
OK, so maybe I need to use decode?
var encoded1 = base32.decode('4a2a');
var encoded2 = base32.decode('4A2A');
console.log(encoded1, encoded2);
No, that just gives me empty strings:
" "
What am I doing wrong, and how can I get 4A2A and 4A2A to map to the same thing?

For an incoming request, you'll want to decode the URL fragment. When you create URLs, you will take your identifier and encode it. So, given a URL http://domain/page/dnwnyub46m50, you will take that fragment and decode it. Example:
#> echo 'dnwnyub46m50'| base32 -d
my_id5
The library you linked to is case-insensitive, so you get the same result this way:
echo 'DNWNYUB46M50'| base32 -d
my_id5
When dealing with any encoding scheme (Base-16/32/64), you have two basic operations: encode, which works on a raw stream of bits/bytes, and decode which takes an encoded set of bytes and returns the original bit/byte stream. The Wikipedia page on Base32 encoding is a great resource.
When you decode a string, you get raw bytes: it may be that those bytes are not compatible with ASCII, UTF-8, or some other encoding which you are trying to work with. This is why your decoded examples look like spaces: the tools you are using do not recognize the resulting bytes as valid characters.
How you go about encoding identifiers depends on how your identifiers are generated. You didn't say how you were generating the underlying identifiers, so I can't make any assumptions about how you should handle the raw bytes that come out of the decoder, nor about the content of the raw bytes being passed into the encoder.
It's also important to mention that the library you linked to is not compatible with Crockford's Base32 encoding. The library excludes I, L, O, S, while Crockford's encoding excludes I, L, O, U. This would be a problem if you were trying to interoperate with another system that used a different library. If no one besides you will ever need to decode your URL fragments, then interoperability doesn't matter.

The source of your confusion is that a base64 or base32 are methods of representing numbers- whereas you are attempting in your examples to encode or decode text strings.
Encoding and decoding text strings as base32 is done by first converting the string into a large number. In your first examples, where you are encoding "4a2a" and "4A2A", those are strings with two different numeric values, that consequently translate to encoded base32 numbers with two different values, 6hgk4r8 6h0k4g8
when you "decode" 4a2a and 4A2A you say you get empty strings. However this is not true, the strings are not empty, they contain what the decoded number looks like, when interpreted as a string. Which is to say, it looks like nothing because 4a2a produces an unprintable character. It's invisible. What you want is to feed the encoder numbers, not strings.

JavaScript has
parseInt(num, 32)
and
num.toString(32)
built in in a way that's compatible with Java and across JavaScript versions.

javascript json - problem decoding ajax json array from php

I'm using php's json_encode() to convert an array to json which then echo's it and is read from a javascript ajax request.
The problem is the echo'd text has unicode characters which the javascript json parse() function doesn't convert to.
Example array value is "2\u00000\u00001\u00000\u0000-\u00001\u00000\u0000-\u00000\u00001" which is "2010-10-01".
Json.parse() only gives me "2".
Anyone help me with this issue?
Example:
var resArray = JSON.parse(this.responseText);
for(var x=0; x < resArray.length; x++) {
var twt = resArray[x];
alert(twt.date);
break;
}

You have NUL characters (character code zero) in the string. It's actually "2_0_1_0_-_1_0_-_0_1", where _ represents the NUL characters.
The unicode character escape is actually part of the JSON standard, so the parser should handle that correctly. However, the result is still a string will NUL characters in it, so when you try to use the string in Javascript the behaviour will depend on what the browser does with the NUL characters.
You can try this in some different browsers:
alert('as\u0000df');
Internet Explorer will display only as
Firefox will display asdf but the NUL character doesn't display.
The best solution would be to remove the NUL characters before you convert the data to JSON.

To add to what Guffa said:
When you have alternating zero bytes, what has almost certainly happened is that you've read a UTF-16 data source without converting it to an ASCII-compatible encoding such as UTF-8. Whilst you can throw away the nulls, this will mangle the string if it contains any characters outside of ASCII range. (Not an issue for date strings of course, but it may affect any other strings you're reading from the same source.)
Check where your PHP code is reading the 2010-10-01 string from, and either convert it on the fly using eg iconv('utf-16le', 'utf-8', $string), or change the source to use a more reasonable encoding. If it's a text file, for example, save it in a text editor using ‘UTF-8 without BOM’, and not ‘Unicode’, which is a highly misleading name Windows text editors use to mean UTF-16LE.

We Keep Coding

JavaScript is the programming language of the Web.