What kind of binary data cannot be stringified to JSON?

What kind of binary data cannot be stringified to JSON? - javascript

I've read everywhere that JSON cannot encode binary data, so I wrote this simple test to check if that's actually true.
function test(elem){
let reader = new FileReader ;
reader.onload = ()=>{
let json = JSON.stringify(reader.result) ;
let isCorrect = JSON.parse(json) === reader.result ;
alert('JSON stringification correct: ' + isCorrect) ;
} ;
reader.readAsBinaryString(elem.files[0]) ;
}
Choose a binary file: <br>
<input type=file onchange="test(this)">
You have to choose a binary file from your computer and the test function will read that file as a binary string, then it will JSON.stringify that string and then parse it back and compare it with the original binary string.
I have tried with lots and lots of binary files (.exe files mostly), and I just can't find a single file that cannot be JSON-ified.
Can you give an example of something that cannot be converted to a JSON string?

I think you do not understand this correctly.
First of all what do mean with "a JSON string"? Do you mean the result of JSON.stringify() or the data type in a JSON document? Let's look at the latter, because I think this what the statement "cannot contain binary data" is about.
If you look at the spec a JSON string cannot contain every possible character. Especially control characters are not allowed. This means a JSON string cannot contain arbitrary (binary) data directly. However, you can use an escape sequence (\u) to represent these characters, which is a type of encoding. JSON.stringify() does this for you automatically.
For example:
s = String.fromCodePoint(65,0,66); // A "binary" string, 'A', 0x00, 'B'
JSON.stringify(s); // "A\u0000B";
JSON.parse() knows about these escape sequences as well and will restore the binary data.
So a JSON string data type can encode binary data but it cannot contain all binary data directly, without encoding.
Some additional notes:
Handling binary data correctly in JavaScript (and many other languages) can be difficult. String data types were not designed for binary data. For example, you have to know the encoding that is used to store the String data internally.
Usually, binary data is not encoded using escape sequences but using more efficient encoding schemes such as Base64.

Related

Obtaining same Base64 encoded output of encrypted string in Java and JavaScript

After getting byte array encryptedMessageInBytes from AES encryption function call cipher.doFinal in Java, I convert the byte array to base64 like this:
String encryptedMessageInBase64 = new String(Base64.getEncoder().encode(encryptedMessageInBytes));
In JavaScript, I simply do .toString() to the output I get from CryptoJS.AES.encrypt method and I get exact same base64 string. i.e.
var encryptedMessageInBase64 = CryptoJS.AES.encrypt("Message", "Secret Passphrase").toString();
It gives same base64 string as in Java code.
However, in one of the Java source code, they have done like this:
String encryptedMessageInBase64 = Base64.getUrlEncoder().encodeToString(encryptedMessageInBytes);
What shall I do in JavaScript to obtain same base64 string?

Here is answer:
However, in one of the Java source code, they have done like this:
String encryptedMessageInBase64 = Base64.getUrlEncoder().encodeToString(encryptedMessageInBytes);*
Here, basically they have done UrlEncoding instead of Base64 encoding. It is nothing but replacing + and / characters with - and _ characters. Url encoding is when we want to send encoded string over HTTP where it treats these two as special characters this is why we need to replace them with some other characters which are HTTP friendly.

How to remove illegal characters from nodejs buffer?

I got a base64 encoded string of a csv file from frontend. In backend i am converting base64 string to binary and then trying to convert it to json object.
var csvDcaData = new Buffer(source, 'base64').toString('binary')//convert base64 to binary
Problem is, Ui is sending some illegal characters with on of the field which are not visible to user in plain csv. "ï»¿" these are characters appended in one of csv field.
I want to remove these kind of characters from data from base64 but i am not able to recognize them in buffer, after conversion these characters appear.
It is possible in any way to detect such kind of characters from buffer.

The source is sending you a message. The message consists of metadata and text. The first few bytes of the message are identifiable as metadata because they are the Byte-Order Mark (BOM) encoded in UTF-8. That strongly suggests that the text is encoded in UTF-8. Nonetheless, to read the text you should find out from the sender which encoding is used.
Yes, the BOM "characters" should be stripped off when wanting to deal only in the text. They are not characters in the sense that they are not part of the text. (Though, if you decode the bytes as UTF-8, it matches the codepoint U+FEFF.)
So, though perhaps esoteric, the message does not contain illegal characters but actually has useful metadata.
Also, given that you are not stripping off the BOM, the fact that you are seeing "ï»¿" instead of "" (U+FEFF ZERO WIDTH NO-BREAK SPACE) means that you are not using UTF-8 to decode the text. That could result in data loss. There is no text but encoded text. You always have to know and use the correct encoding.
Now, source is a JavaScript string (which, by-the-way, uses the UTF-16 encoding of Unicode). The content of the string is a message encoded in Base64. The message is a sequence of bytes which are the UTF-8 encoding of a BOM and text. You want the text in a JavaScript string. (And the text happens to be some form of CSV. For that, you'll need to know the line ending, delimiter, and text-qualifier.) There is a lot for you and the sender to talk about. Perhaps the sender has documented all this.
const stripBom = require('strip-bom');
const original = "¡You win one million ₹! Now you can get a real 🚲";
const base64String = Buffer.from("\u{FEFF}" + original, "utf-8").toString("base64");
console.log(base64String);
const decodedString =
stripBom(Buffer.from(base64String, "base64").toString("utf-8"));
console.log(decodedString);
console.log(original === decodedString);

create binary payload in node-red

New to node-red and javascript. I need to use the TCP input to connect to a relay controller for status. I'm using a function node to generate a two-byte request that will flow to the TCP input node and on to the controller but don't know how to format this in java. I can set
msg.payload = "hello";
to send a string, but I need to send 2 bytes: 0xEF 0xAA. In C# I would just create the string
msg.payload = "\xEF\xAA";
or something. How to do this in java/node-red?

Binary payloads are NodeJS buffer objects so can be created like this:
msg.payload = new Buffer([0xEF,0xAA]);

As of today (nodered 0.17.5), this can be achieved doing the following, see the documentation:
msg.payload = Buffer.from("\xEF\xAA")
or
msg.payload = Buffer.from('hello world', 'ascii');
As you can see, you can also specify an encoding parameter:
The character encodings currently supported by Node.js include:
'ascii' - for 7-bit ASCII data only. This encoding is fast and will strip the high bit if set.
'utf8' - Multibyte encoded Unicode characters. Many web pages and other document formats use UTF-8.
'utf16le' - 2 or 4 bytes, little-endian encoded Unicode characters. Surrogate pairs (U+10000 to U+10FFFF) are supported.
'ucs2' - Alias of 'utf16le'.
'base64' - Base64 encoding. When creating a Buffer from a string, this encoding will also correctly accept "URL and Filename Safe Alphabet" as specified in RFC4648, Section 5.
'latin1' - A way of encoding the Buffer into a one-byte encoded string (as defined by the IANA in RFC1345, page 63, to be the Latin-1 supplement block and C0/C1 control codes).
'binary' - Alias for 'latin1'.
'hex' - Encode each byte as two hexadecimal characters.

Convert binary data to string in node

The code below is returning data in binary format, how can I convert this to string?
fs.readFile('C:/test.prn', function (err, data) {
bufferString = data.toString();
bufferStringSplit = bufferString.split('\n');
console.log(bufferStringSplit)
});
console.log(bufferStringSplit)
output
&b27WPML ? ?????§201501081339&b16WPML ? ?????? *o5W? ?&l6H&l0S*r4800S&l-1M*o5W
? ??&l0E&l0L&u600D&l74A*o5W?? :*o5W?? :*o-3M&l-2H&l0O*o5W?? *o7 ?*g20W?? ??X?X ???X?X
?,??????????%]?? ?M???/????r????WWW???Y???~???$???///?9???DDD?N??Y???0v0w0v0w0v0w0v145w??T????!??###??????????'''?d??????????EEE?hhh??????????????
?'''?d??????EEE?hhh???=??5???-}???#????%???s?????? ?+???¦??

That is most likely happening because your .prn file is binary, that is, it does not contain plain text such as ASCII, UTF8 or ISO-8859-1. You need to convert it either within you JS code or with an external tool. Alternatively, you can read and handle it as a binary but you won't be operating on "normal" strings then.

A *.prn is most likely a printer file (http://filext.com/file-extension/PRN), thus it is binary and cannot be shown as a string.
You either need to process the file as binary or convert it to a string of an encoding of you choice.

Using Crockford's base 32 for IDs in URLs?

I'd like to write some IDs for use in URLs in Crockford's base32. I'm using the base32 npm module.
So, for example, if the user types in http://domain/page/4A2A I'd like it to map to the same underlying ID as http://domain/page/4a2a
This is because I want human-friendly URLs, where the user doesn't have to worry about the difference between upper- and lower-case letters, or between "l" and "1" - they just get the page they expect.
But I'm struggling to implement this, basically because I'm too dim to understand how encoding works. First I tried:
var encoded1 = base32.encode('4a2a');
var encoded2 = base32.encode('4A2A');
console.log(encoded1, encoded2);
But they map to different underlying IDs:
6hgk4r8 6h0k4g8
OK, so maybe I need to use decode?
var encoded1 = base32.decode('4a2a');
var encoded2 = base32.decode('4A2A');
console.log(encoded1, encoded2);
No, that just gives me empty strings:
" "
What am I doing wrong, and how can I get 4A2A and 4A2A to map to the same thing?

For an incoming request, you'll want to decode the URL fragment. When you create URLs, you will take your identifier and encode it. So, given a URL http://domain/page/dnwnyub46m50, you will take that fragment and decode it. Example:
#> echo 'dnwnyub46m50'| base32 -d
my_id5
The library you linked to is case-insensitive, so you get the same result this way:
echo 'DNWNYUB46M50'| base32 -d
my_id5
When dealing with any encoding scheme (Base-16/32/64), you have two basic operations: encode, which works on a raw stream of bits/bytes, and decode which takes an encoded set of bytes and returns the original bit/byte stream. The Wikipedia page on Base32 encoding is a great resource.
When you decode a string, you get raw bytes: it may be that those bytes are not compatible with ASCII, UTF-8, or some other encoding which you are trying to work with. This is why your decoded examples look like spaces: the tools you are using do not recognize the resulting bytes as valid characters.
How you go about encoding identifiers depends on how your identifiers are generated. You didn't say how you were generating the underlying identifiers, so I can't make any assumptions about how you should handle the raw bytes that come out of the decoder, nor about the content of the raw bytes being passed into the encoder.
It's also important to mention that the library you linked to is not compatible with Crockford's Base32 encoding. The library excludes I, L, O, S, while Crockford's encoding excludes I, L, O, U. This would be a problem if you were trying to interoperate with another system that used a different library. If no one besides you will ever need to decode your URL fragments, then interoperability doesn't matter.

The source of your confusion is that a base64 or base32 are methods of representing numbers- whereas you are attempting in your examples to encode or decode text strings.
Encoding and decoding text strings as base32 is done by first converting the string into a large number. In your first examples, where you are encoding "4a2a" and "4A2A", those are strings with two different numeric values, that consequently translate to encoded base32 numbers with two different values, 6hgk4r8 6h0k4g8
when you "decode" 4a2a and 4A2A you say you get empty strings. However this is not true, the strings are not empty, they contain what the decoded number looks like, when interpreted as a string. Which is to say, it looks like nothing because 4a2a produces an unprintable character. It's invisible. What you want is to feed the encoder numbers, not strings.

JavaScript has
parseInt(num, 32)
and
num.toString(32)
built in in a way that's compatible with Java and across JavaScript versions.

We Keep Coding

JavaScript is the programming language of the Web.