Why hash in Node.js give different result with same character? - javascript

So i was try to hash ¤ character in node js, with this function
crypto.createHash('md5').update('¤', 'ascii').digest('hex')
give md5 hash
f37c6f3896b2c85fbbd01ae32e47b43f
and using Buffer
crypto.createHash('md5').update(new Buffer('¤', 'ascii').toString()).digest('hex')
give result like this:
9b759040321a408a5c7768b4511287a6
I tried to debug Hash.update() to take a look inside but i can't it seems hard compiled.
Why crypto encoding method is different with Buffer? what makes it different?

crypto is encoding the same way as buffers do, so let’s ignore it for now. Here’s a simplification of the issue:
const text = '¤';
const b1 = Buffer.from(text, 'ascii');
const b2 = Buffer.from(b1.toString());
b1 and b2 aren’t the same bytes. b1 is [0xa4], which doesn’t really make much sense as 0xa4 isn’t part of ASCII; Node is using the same code to encode strings as ASCII and Latin-1 here. I don’t know if that’s for compatibility or performance reasons or what, but it seems like a bad idea, results in values for which Buffer.from(s, 'ascii') is different from Buffer.from(Buffer.from(s, 'ascii').toString('ascii'), 'ascii'), and does not appear to be documented anywhere.
In modern versions of Node, the default encoding is UTF-8, so b1.toString() will try to interpret 0xa4 as UTF-8, fail, and produce a replacement character (�) instead, encoded as [0xef, 0xbf, 0xbd]. In non-modern versions of Node, it will do an environment-dependent wrong thing instead of a consistent wrong thing.
You can make your operations give the same result by passing a buffer instead of a UTF-8 encoding of a buffer:
crypto.createHash('md5').update(new Buffer('¤', 'ascii')).digest('hex')
(note how .toString() is removed)
but correct code, able to hash any sequence of Unicode codepoints, would use UTF-8 instead.
crypto.createHash('md5').update('¤', 'utf8').digest('hex')
crypto.createHash('md5').update(Buffer.from('¤', 'utf8')).digest('hex')

Related

Alphanumeric hash function in plain Javascript

I am looking for a way to covert a given string into an alphanumeric hash. The code will be executed on the client-side and must be entirely in vanilla JS or jQuery at the most.
Is there, however, also a non-cryptographic hash, i.e. just a string of alphanumerics that does not require crypto and Promises? I need both, i.e. a cryptographic as well as a non-cryptographic hash.
The second hash can be an ordinary string of alphanumerics, say 10 characters long. It should be recoverable, i.e. the same hash should be recreated always for a given string. It would be better if this second hash is not generated asynchronously (i.e. using Promises). I intend to use it as a key for a boolean in window.localStorage (for many different strings).
Final answers:
Crypto JS
bcrypt.js
Fast low-collision non-crypto hash in JavaScript for
Files
Modern browsers provide cryptographic algorithms implementation via window.crypto object. You can look at what "modern" means in this case by this link (at the bottom). If you are fine with supported browsers list, then you can reach your goal for example like this:
async function hash(target){
var buffer = await crypto.subtle.digest("SHA-256", new TextEncoder().encode(target));
var chars = Array.prototype.map.call(new Uint8Array(buffer), ch => String.fromCharCode(ch)).join('');
return btoa(chars);
};
It will hash your string (bytes of its utf-8 encoding) with SHA-256 and then convert result to base64.
Note that if you don't need cryptographically strong hash (you didn't clarify the purpose) - then there might be better (faster) alternatives.

Reconstructing a XORed UTF-8 encoded text in javascript, client side, no Node

I am driving nuts trying to achieve this in JavaScript.
First I will describe the scenario, and then I'll put my code, Python version, which I can't seem to translate into JavaScript.
I have a web page running on a server. I have no access to it whatsoever, so the only way I have to achieve basic functionality is using JavaScript.
The web page is used to compare information. The information is stored in CSV format, which I use to create HTML tables on the fly by using AJAX calls. For the sake of not having that information quickly available to users, enabling them to print the source code and 'stealing it', I came across a range of solutions, like encoding in Base64 (I know this is considered 'security by obscurity' and it's a bad practice, but I have no other choice here).
Base64 it's very easy to use in this case, but I lose all the special characters from UTF-8 (like á é í ó ú ñ etc), which are part of my language (Spanish).
So here comes the preferred solution, which works like a charm in Python: using bitwise XOR. What could I achieve using this method:
If someone figures out the url of the CSV file, it wouldn't be so easy to read the text without basic programming knowledge to de-encode it.
I can easily program the source database to export the data and then run the XORing fuction, upload those files to the server and then having them de-encoded on the fly too.
Is in that last step where I can not achieve what I want.
Here is my Python script:
To encode:
b = bytearray(open('file.csv', 'rb').read())
for i in range(len(b)):
b[i] ^= 0x71
open('file.out', 'wb').write(b)
To decode:
b = bytearray(open('file.out', 'rb').read())
for i in range(len(b)):
b[i] ^= 0x71
I need to achieve that small decoding function in JS.
Thank you all in advance for your time.
Base64
It isn't true that base64 makes you lose non-ASCII characters like ñ or á. Why should it? Base64 can encode any binary data, and encoded text is nothing else than binary data.
So encoding involves two steps:
A text encoding (such as UTF-8) converts your text to bytes, and the base64 encoding turns that into an ASCII string.
Decoding works the same, but backwards (reverse order of the two corresponding decoding functions).
This is how text encoding for UTF-8 works in JavaScript:
function encode_utf8(s) {
return unescape(encodeURIComponent(s));
}
function decode_utf8(s) {
return decodeURIComponent(escape(s));
}
I got this from here. Please note that I'm not a JS crack at all, and there might be more convenient methods now that I couldn't find.
Let's try this:
s = 'Se bañó todo el día.';
b = encode_utf8(s); # text encoding
a = btoa(b); # base64 encoding
console.log(a); # prints U2UgYmHDscOzIHRvZG8gZWwgZMOtYS4=
d = decode_utf8(atob(a)); # decode base64, then UTF-8
console.log(d); # prints Se bañó todo el día.
No character lost here.
XOR method
If you still want to do the XOR thing, you can decode as follows:
convert the UTF8-encoded string to an array of code points with Array.from()
XOR-decode with the ^ operator (or ^= assignment)
convert the result to a string with String.fromCodePoint()
decode the string with decode_utf8()
I'm not providing code for this, though.
Especially the third step might be a bit cumbersome, and I'm not sure if it's worth the pain.
After all, your users can just inspect the JS code to find out how the data are "encrypted", be it base64 or the XOR method.
Note
If you come from a Python background, be aware that there is no distinction like Python's str and bytes type. Both input and output of the {en,de}code_utf8() functions are always strings, same type. When you encode a string, you just get back another string where every codepoint is below 256, and it might be longer than the input string.

Escape with Unicode Code Points similar to PHP in Javascript

When using PHP to generate JSON, it encodes higher characters using the \u0123 code-point notation.
(I know this not necessary, but for unnamed reasons I want that.)
I am trying to achieve the same in JavaScript. I searched high and low and found nothing. The encodeUri function does nothing for me (even though many suggested that it would).
Any helpful hints? I hope that I do not have to use some big external library but rather something build-in or a few lines of nice code - this can not be this hard, can it...?!
I have an input string in the form of:
var stringVar = 'Hällö Würld.';
My desired conversion would give me something like:
var resultStringVar = 'H\u00e4ll\u00f6 W\u00fcrld.';
I’ve made a library for just that called jsesc. From its README:
This is a JavaScript library for escaping JavaScript strings while generating the shortest possible valid ASCII-only output. Here’s an online demo.
This can be used to avoid mojibake and other encoding issues, or even to avoid errors when passing JSON-formatted data (which may contain U+2028 LINE SEPARATOR, U+2029 PARAGRAPH SEPARATOR, or lone surrogates) to a JavaScript parser or an UTF-8 encoder, respectively.
For your specific example, use it as follows:
var stringVar = 'Hällö Würld.';
var resultStringVar = jsesc(stringVar, { 'json': true, 'wrap': false });

Using Crockford's base 32 for IDs in URLs?

I'd like to write some IDs for use in URLs in Crockford's base32. I'm using the base32 npm module.
So, for example, if the user types in http://domain/page/4A2A I'd like it to map to the same underlying ID as http://domain/page/4a2a
This is because I want human-friendly URLs, where the user doesn't have to worry about the difference between upper- and lower-case letters, or between "l" and "1" - they just get the page they expect.
But I'm struggling to implement this, basically because I'm too dim to understand how encoding works. First I tried:
var encoded1 = base32.encode('4a2a');
var encoded2 = base32.encode('4A2A');
console.log(encoded1, encoded2);
But they map to different underlying IDs:
6hgk4r8 6h0k4g8
OK, so maybe I need to use decode?
var encoded1 = base32.decode('4a2a');
var encoded2 = base32.decode('4A2A');
console.log(encoded1, encoded2);
No, that just gives me empty strings:
" "
What am I doing wrong, and how can I get 4A2A and 4A2A to map to the same thing?
For an incoming request, you'll want to decode the URL fragment. When you create URLs, you will take your identifier and encode it. So, given a URL http://domain/page/dnwnyub46m50, you will take that fragment and decode it. Example:
#> echo 'dnwnyub46m50'| base32 -d
my_id5
The library you linked to is case-insensitive, so you get the same result this way:
echo 'DNWNYUB46M50'| base32 -d
my_id5
When dealing with any encoding scheme (Base-16/32/64), you have two basic operations: encode, which works on a raw stream of bits/bytes, and decode which takes an encoded set of bytes and returns the original bit/byte stream. The Wikipedia page on Base32 encoding is a great resource.
When you decode a string, you get raw bytes: it may be that those bytes are not compatible with ASCII, UTF-8, or some other encoding which you are trying to work with. This is why your decoded examples look like spaces: the tools you are using do not recognize the resulting bytes as valid characters.
How you go about encoding identifiers depends on how your identifiers are generated. You didn't say how you were generating the underlying identifiers, so I can't make any assumptions about how you should handle the raw bytes that come out of the decoder, nor about the content of the raw bytes being passed into the encoder.
It's also important to mention that the library you linked to is not compatible with Crockford's Base32 encoding. The library excludes I, L, O, S, while Crockford's encoding excludes I, L, O, U. This would be a problem if you were trying to interoperate with another system that used a different library. If no one besides you will ever need to decode your URL fragments, then interoperability doesn't matter.
The source of your confusion is that a base64 or base32 are methods of representing numbers- whereas you are attempting in your examples to encode or decode text strings.
Encoding and decoding text strings as base32 is done by first converting the string into a large number. In your first examples, where you are encoding "4a2a" and "4A2A", those are strings with two different numeric values, that consequently translate to encoded base32 numbers with two different values, 6hgk4r8 6h0k4g8
when you "decode" 4a2a and 4A2A you say you get empty strings. However this is not true, the strings are not empty, they contain what the decoded number looks like, when interpreted as a string. Which is to say, it looks like nothing because 4a2a produces an unprintable character. It's invisible. What you want is to feed the encoder numbers, not strings.
JavaScript has
parseInt(num, 32)
and
num.toString(32)
built in in a way that's compatible with Java and across JavaScript versions.

Handling unicode in the http response xml

I'm writing a Google Chrome extension that builds upon myanimelist.net REST api. Sometimes the XMLHttpRequest response text contains unicode.
For example:
<title>Onegai My Melody Sukkiri�</title>
If I create a HTML node from the text it looks like this:
Onegai My Melody Sukkiri�
The actual title, however, is this:
Onegai My Melody Sukkiri♪
Why is my text not correctly rendered and how can I fix it?
Update
Code: background.html
I think these are the crucial parts:
function htmlDecode(input){
var e = document.createElement('div');
e.innerHTML = input;
return e.childNodes.length === 0 ? "" : e.childNodes[0].nodeValue;
}
function xmlDecode(input){
var result = input;
result = result.replace(/</g, "<");
result = result.replace(/>/g, ">");
result = result.replace(/\n/g, "
");
return htmlDecode(result);
}
Further:
var parser = new DOMParser();
var xmlText = response.value;
var doc = parser.parseFromString(xmlDecode(xmlText), "text/xml");
<title>Onegai My Melody Sukkiri�</title>
Oh dear! Not only is that the wrong text, it's not even well-formed XML. acirc and ordf are HTML entities which are not predefined in XML, and then there's an invalid UTF-8 sequence (one high byte, presumably originally 0x99) between them.
The problem is that myanimelist are generating their output ‘XML’ (but “if it ain't well-formed, it ain't XML”) using the PHP function htmlentities(). This tries to HTML-escape not only the potentially-sensitive-in-HTML characters <&"', but also all non-ASCII characters.
This generates the wrong characters because PHP defaults to treating the input to htmlentities() as ISO-8859-1 instead of UTF-8 which is the encoding they're actually using. But it was the wrong thing to begin with because the HTML entity set doesn't exist in XML. What they really wanted to use was htmlspecialchars(), which leaves the non-ASCII characters alone, only escaping the really sensitive ones. Because those are the same ones that are sensitive in XML, htmlspecialchars() works just as well for XML as HTML.
htmlentities() is almost always the Wrong Thing; htmlspecialchars() should typically be used instead. The one place you might want to encode non-ASCII bytes to entity references would be when you're targeting pure ASCII output. But even then htmlentities() fails because it doesn't make character references (&#...;) for the characters that don't have a predefined entity names. Pretty useless.
Anyway, you can't really recover the mangled data from this. The � represents a byte sequence that was UTF-8-undecodable to the XMLHttpRequest, so that information is irretrievably lost. You will have to persuade myanimelist to fix their broken XML output as per the above couple of paragraphs before you can go any further.
Also they should be returning it as Content-Type: text/xml not text/html as at the moment. Then you could pick up the responseXML directly from the XMLHttpRequest object instead of messing about with DOMParsers.
So, I've come across something similar to what's going on here at work, and I did a bit more research to confirm my hypothesis.
If you take a look at the returned value you posted above, you'll notice the tell-tell entity "â". 99% of the time when you see this entity, if means you have a character encoding issue (typically UTF-8 characters are being encoded as ISO-8859-1).
The first thing I would test for is to force a character encoding in the API return. (It's a long shot, but you could look)
Second, I'd try to force a character encoding onto the data returned (I know there's a .htaccess override, but I don't know what's allowed in Chrome extensions so you'll have to research that).
What I believe is going on, is when you crate the node with the data, you don't have a character encoding set on the document, and browsers (typically, in my experience) default to ISO-8859-1. So, check to make sure it's not your document that's the problem.
Finally, if you can't find the source (or can't prevent it) of the character encoding, you'll have to write a conversation table to replace the malformed values you're getting with the ones you want { JS' "replace" should be fine (http://www.w3schools.com/jsref/jsref_replace.asp) }.
You can't just use a simple search and replace to fix encoding issue since they are unicode, not characters typed on a keyboard.
Your data must be stored on the server in UTF-8 format if you are planning on retrieving it via AJAX. This problem is probably due to someone pasting in characters from MS-Word which use a completely different encoding scheme (ISO-8859).
If you can't fix the data, you're kinda screwed.
For more details, see: UTF-8 vs. Unicode

Categories