So i was try to hash ¤ character in node js, with this function
crypto.createHash('md5').update('¤', 'ascii').digest('hex')
give md5 hash
f37c6f3896b2c85fbbd01ae32e47b43f
and using Buffer
crypto.createHash('md5').update(new Buffer('¤', 'ascii').toString()).digest('hex')
give result like this:
9b759040321a408a5c7768b4511287a6
I tried to debug Hash.update() to take a look inside but i can't it seems hard compiled.
Why crypto encoding method is different with Buffer? what makes it different?
crypto is encoding the same way as buffers do, so let’s ignore it for now. Here’s a simplification of the issue:
const text = '¤';
const b1 = Buffer.from(text, 'ascii');
const b2 = Buffer.from(b1.toString());
b1 and b2 aren’t the same bytes. b1 is [0xa4], which doesn’t really make much sense as 0xa4 isn’t part of ASCII; Node is using the same code to encode strings as ASCII and Latin-1 here. I don’t know if that’s for compatibility or performance reasons or what, but it seems like a bad idea, results in values for which Buffer.from(s, 'ascii') is different from Buffer.from(Buffer.from(s, 'ascii').toString('ascii'), 'ascii'), and does not appear to be documented anywhere.
In modern versions of Node, the default encoding is UTF-8, so b1.toString() will try to interpret 0xa4 as UTF-8, fail, and produce a replacement character (�) instead, encoded as [0xef, 0xbf, 0xbd]. In non-modern versions of Node, it will do an environment-dependent wrong thing instead of a consistent wrong thing.
You can make your operations give the same result by passing a buffer instead of a UTF-8 encoding of a buffer:
crypto.createHash('md5').update(new Buffer('¤', 'ascii')).digest('hex')
(note how .toString() is removed)
but correct code, able to hash any sequence of Unicode codepoints, would use UTF-8 instead.
crypto.createHash('md5').update('¤', 'utf8').digest('hex')
crypto.createHash('md5').update(Buffer.from('¤', 'utf8')).digest('hex')
What are the allowed characters in both cookie name and value? Are they same as URL or some common subset?
Reason I'm asking is that I've recently hit some strange behavior with cookies that have - in their name and I'm just wondering if it's something browser specific or if my code is faulty.
According to the ancient Netscape cookie_spec the entire NAME=VALUE string is:
a sequence of characters excluding semi-colon, comma and white space.
So - should work, and it does seem to be OK in browsers I've got here; where are you having trouble with it?
By implication of the above:
= is legal to include, but potentially ambiguous. Browsers always split the name and value on the first = symbol in the string, so in practice you can put an = symbol in the VALUE but not the NAME.
What isn't mentioned, because Netscape were terrible at writing specs, but seems to be consistently supported by browsers:
either the NAME or the VALUE may be empty strings
if there is no = symbol in the string at all, browsers treat it as the cookie with the empty-string name, ie Set-Cookie: foo is the same as Set-Cookie: =foo.
when browsers output a cookie with an empty name, they omit the equals sign. So Set-Cookie: =bar begets Cookie: bar.
commas and spaces in names and values do actually seem to work, though spaces around the equals sign are trimmed
control characters (\x00 to \x1F plus \x7F) aren't allowed
What isn't mentioned and browsers are totally inconsistent about, is non-ASCII (Unicode) characters:
in Opera and Google Chrome, they are encoded to Cookie headers with UTF-8;
in IE, the machine's default code page is used (locale-specific and never UTF-8);
Firefox (and other Mozilla-based browsers) use the low byte of each UTF-16 code point on its own (so ISO-8859-1 is OK but anything else is mangled);
Safari simply refuses to send any cookie containing non-ASCII characters.
so in practice you cannot use non-ASCII characters in cookies at all. If you want to use Unicode, control codes or other arbitrary byte sequences, the cookie_spec demands you use an ad-hoc encoding scheme of your own choosing and suggest URL-encoding (as produced by JavaScript's encodeURIComponent) as a reasonable choice.
In terms of actual standards, there have been a few attempts to codify cookie behaviour but none thus far actually reflect the real world.
RFC 2109 was an attempt to codify and fix the original Netscape cookie_spec. In this standard many more special characters are disallowed, as it uses RFC 2616 tokens (a - is still allowed there), and only the value may be specified in a quoted-string with other characters. No browser ever implemented the limitations, the special handling of quoted strings and escaping, or the new features in this spec.
RFC 2965 was another go at it, tidying up 2109 and adding more features under a ‘version 2 cookies’ scheme. Nobody ever implemented any of that either. This spec has the same token-and-quoted-string limitations as the earlier version and it's just as much a load of nonsense.
RFC 6265 is an HTML5-era attempt to clear up the historical mess. It still doesn't match reality exactly but it's much better then the earlier attempts—it is at least a proper subset of what browsers support, not introducing any syntax that is supposed to work but doesn't (like the previous quoted-string).
In 6265 the cookie name is still specified as an RFC 2616 token, which means you can pick from the alphanums plus:
!#$%&'*+-.^_`|~
In the cookie value it formally bans the (filtered by browsers) control characters and (inconsistently-implemented) non-ASCII characters. It retains cookie_spec's prohibition on space, comma and semicolon, plus for compatibility with any poor idiots who actually implemented the earlier RFCs it also banned backslash and quotes, other than quotes wrapping the whole value (but in that case the quotes are still considered part of the value, not an encoding scheme). So that leaves you with the alphanums plus:
!#$%&'()*+-./:<=>?#[]^_`{|}~
In the real world we are still using the original-and-worst Netscape cookie_spec, so code that consumes cookies should be prepared to encounter pretty much anything, but for code that produces cookies it is advisable to stick with the subset in RFC 6265.
In ASP.Net you can use System.Web.HttpUtility to safely encode the cookie value before writing to the cookie and convert it back to its original form on reading it out.
// Encode
HttpUtility.UrlEncode(cookieData);
// Decode
HttpUtility.UrlDecode(encodedCookieData);
This will stop ampersands and equals signs spliting a value into a bunch of name/value pairs as it is written to a cookie.
I think it's generally browser specific. To be on the safe side, base64 encode a JSON object, and store everything in that. That way you just have to decode it and parse the JSON. All the characters used in base64 should play fine with most, if not all browsers.
Here it is, in as few words as possible. Focus on characters that need no escaping:
For cookies:
abdefghijklmnqrstuvxyzABDEFGHIJKLMNQRSTUVXYZ0123456789!#$%&'()*+-./:<>?#[]^_`{|}~
For urls
abdefghijklmnqrstuvxyzABDEFGHIJKLMNQRSTUVXYZ0123456789.-_~!$&'()*+,;=:#
For cookies and urls ( intersection )
abdefghijklmnqrstuvxyzABDEFGHIJKLMNQRSTUVXYZ0123456789!$&'()*+-.:#_~
That's how you answer.
Note that for cookies, the = has been removed because it is
usually used to set the cookie value.
For urls this the = was kept. The intersection is obviously without.
var chars = "abdefghijklmnqrstuvxyz"; chars += chars.toUpperCase() + "0123456789" + "!$&'()*+-.:#_~";
Turns out escaping still occuring and unexpected happening, especially in a Java cookie environment where the cookie is wrapped with double quotes if it encounters the last characters.
So to be safe, just use A-Za-z1-9. That's what I am going to do.
Newer rfc6265 published in April 2011:
cookie-header = "Cookie:" OWS cookie-string OWS
cookie-string = cookie-pair *( ";" SP cookie-pair )
cookie-pair = cookie-name "=" cookie-value
cookie-value = *cookie-octet / ( DQUOTE *cookie-octet DQUOTE )
cookie-octet = %x21 / %x23-2B / %x2D-3A / %x3C-5B / %x5D-7E
; US-ASCII characters excluding CTLs,
; whitespace DQUOTE, comma, semicolon,
; and backslash
If you look to #bobince answer you see that newer restrictions are more strict.
you can not put ";" in the value field of a cookie, the name that will be set is the string until the ";" in most browsers...
that's simple:
A <cookie-name> can be any US-ASCII characters except control
characters (CTLs), spaces, or tabs. It also must not contain a
separator character like the following: ( ) < > # , ; : \ " / [ ] ? =
{ }.
A <cookie-value> can optionally be set in double quotes and any
US-ASCII characters excluding CTLs, whitespace, double quotes, comma,
semicolon, and backslash are allowed. Encoding: Many implementations
perform URL encoding on cookie values, however it is not required per
the RFC specification. It does help satisfying the requirements about
which characters are allowed for though.
Link: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Set-Cookie#Directives
There are 2 versions of cookies specifications
1. Version 0 cookies aka Netscape cookies,
2. Version 1 aka RFC 2965 cookies
In version 0 The name and value part of cookies are sequences of characters, excluding the semicolon, comma, equals sign, and whitespace, if not used with double quotes
version 1 is a lot more complicated you can check it here
In this version specs for name value part is almost same except name can not start with $ sign
There is another interesting issue with IE and Edge. Cookies that have names with more than 1 period seem to be silently dropped.
So
This works:
cookie_name_a=valuea
while this will get dropped
cookie.name.a=valuea
One more consideration. I recently implemented a scheme in which some sensitive data posted to a PHP script needed to convert and return it as an encrypted cookie, that used all base64 values I thought were guaranteed 'safe". So I dutifully encrypted the data items using RC4, ran the output through base64_encode, and happily returned the cookie to the site. Testing seemed to go well until a base64 encoded string contained a "+" symbol. The string was written to the page cookie with no trouble. Using the browser diagnostics I could also verify the cookies was written unchanged. Then when a subsequent page called my PHP and obtained the cookie via the $_COOKIE array, I was stammered to find the string was now missing the "+" sign. Every occurrence of that character was replaced with an ASCII space.
Considering how many similar unresolved complaints I've read describing this scenario since then, often siting numerous references to using base64 to "safely" store arbitrary data in cookies, I thought I'd point out the problem and offer my admittedly kludgy solution.
After you've done whatever encryption you want to do on a piece of data, and then used base64_encode to make it "cookie-safe", run the output string through this...
// from browser to PHP. substitute troublesome chars with
// other cookie safe chars, or vis-versa.
function fix64($inp) {
$out =$inp;
for($i = 0; $i < strlen($inp); $i++) {
$c = $inp[$i];
switch ($c) {
case '+': $c = '*'; break; // definitly won't transfer!
case '*': $c = '+'; break;
case '=': $c = ':'; break; // = symbol seems like a bad idea
case ':': $c = '='; break;
default: continue;
}
$out[$i] = $c;
}
return $out;
}
Here I'm simply substituting "+" (and I decided "=" as well) with other "cookie safe" characters, before returning the encoded value to the page, for use as a cookie. Note that the length of the string being processed doesn't change. When the same (or another page on the site) runs my PHP script again, I'll be able to recover this cookie without missing characters. I just have to remember to pass the cookie back through the same fix64() call I created, and from there I can decode it with the usual base64_decode(), followed by whatever other decryption in your scheme.
There may be some setting I could make in PHP that allows base64 strings used in cookies to be transferred back to to PHP without corruption. In the mean time this works. The "+" may be a "legal" cookie value, but if you have any desire to be able to transmit such a string back to PHP (in my case via the $_COOKIE array), I'm suggesting re-processing to remove offending characters, and restore them after recovery. There are plenty of other "cookie safe" characters to choose from.
If you are using the variables later, you'll find that stuff like path actually will let accented characters through, but it won't actually match the browser path. For that you need to URIEncode them. So i.e. like this:
const encodedPath = encodeURI(myPath);
document.cookie = `use_pwa=true; domain=${location.host}; path=${encodedPath};`
So the "allowed" chars, might be more than what's in the spec. But you should stay within the spec, and use URI-encoded strings to be safe.
Years ago MSIE 5 or 5.5 (and probably both) had some serious issue with a "-" in the HTML block if you can believe it. Alhough it's not directly related, ever since we've stored an MD5 hash (containing letters and numbers only) in the cookie to look up everything else in server-side database.
I ended up using
cookie_value = encodeURIComponent(my_string);
and
my_string = decodeURIComponent(cookie_value);
That seems to work for all kinds of characters. I had weird issues otherwise, even with characters that weren't semicolons or commas.
I am creating a re-director of sorts in nodejs. I have a few values like
userid // superid
these I would like to hash to prevent users from retrieving the url and faking someone else's url and also base64 encode to minimize the length of the url created.
http://myurl.com/~hashedtoken
where un-hashed hashtoken could be something like this
55q322q23
55 = userid
I thought about using crypto library like so:
crypto.createHash('md5').update("55q322q23").digest("base64");
which returns: u/mxNJQaSs2HYJ5wirEZOQ==
The problem here is that I have the / which is not considered websafe so I would like to strip the un-safe letters from the base64 list of letters, somehow. Any ideas about this or perhaps a better solution to the problem at hand?
You could use a so called URL safe variant of Base64. The most common variant, described in RFC 4648, uses - and _ instead of + and / respectively, and omits padding characters (=).
Most implementations of Base64 support this URL safe variant too, though if yours doesn't, it's easy enough to do manually.
Here's what I used. Comments welcome :-)
The important bit is buffer.toString('base64'), then URL-safeing that base64 string with a couple of replace()s.
function newId() {
// Get random string with 20 bytes secure randomness
var crypto = require('crypto');
var id = crypto.randomBytes(20).toString('base64');
// Make URL safe
return id.replace(/\+/g, '-').replace(/\//g, '_').replace(/=+$/, '');
}
Based on the implementation here.
Makes a string safe for URL's and local email addresses (before the #).
i have the following code, based on http://nodejs.org/docs/v0.6.9/api/crypto.html#randomBytes
crypto.randomBytes 32, (ex, buf) ->
user.tokenString = buf.toString("hex")
user.tokenExpires = Date.now() + TOKEN_TIME
next()
i am using this to generate a tokenString to use for a node.js/express user validation.
in some cases the tokenString generated includes '/' forward slash character, and this breaks my routes, for example, tokenString if the tokenString is like '$2a$10$OYJn2r/Ts.guyWqx7iJTwO8cij80m.uIQV9nJgTt18nqu8lT8OqPe' it can't find /user/activate/$2a$10$OYJn2r and i get an 404 error
is there a more direct way to exclude certain characters from being included when generating the crypto.randomBytes?
Crypto.randomBytes generates random bytes . That has nothing to do with characters, characters are determined by the way we look at the bytes.
For example:
user.tokenString = buf.toString("hex")
Would convert the buffer to a string (where two characters represent each byte), in the character range 0-9a-f
Another (might be more suiting approach is to use a more compact encoding. Base64Url is an encoding that provides string encoding that is URL/Filename safe
user.tokenString = base64url(buf)
Here is an NPM package you can use for it.
Other than that, your code seems fine. If you were to call .toString() without specifying "hex" or specifying something like "ascii" for example, it would break just like in your question description.
I'd like to write some IDs for use in URLs in Crockford's base32. I'm using the base32 npm module.
So, for example, if the user types in http://domain/page/4A2A I'd like it to map to the same underlying ID as http://domain/page/4a2a
This is because I want human-friendly URLs, where the user doesn't have to worry about the difference between upper- and lower-case letters, or between "l" and "1" - they just get the page they expect.
But I'm struggling to implement this, basically because I'm too dim to understand how encoding works. First I tried:
var encoded1 = base32.encode('4a2a');
var encoded2 = base32.encode('4A2A');
console.log(encoded1, encoded2);
But they map to different underlying IDs:
6hgk4r8 6h0k4g8
OK, so maybe I need to use decode?
var encoded1 = base32.decode('4a2a');
var encoded2 = base32.decode('4A2A');
console.log(encoded1, encoded2);
No, that just gives me empty strings:
" "
What am I doing wrong, and how can I get 4A2A and 4A2A to map to the same thing?
For an incoming request, you'll want to decode the URL fragment. When you create URLs, you will take your identifier and encode it. So, given a URL http://domain/page/dnwnyub46m50, you will take that fragment and decode it. Example:
#> echo 'dnwnyub46m50'| base32 -d
my_id5
The library you linked to is case-insensitive, so you get the same result this way:
echo 'DNWNYUB46M50'| base32 -d
my_id5
When dealing with any encoding scheme (Base-16/32/64), you have two basic operations: encode, which works on a raw stream of bits/bytes, and decode which takes an encoded set of bytes and returns the original bit/byte stream. The Wikipedia page on Base32 encoding is a great resource.
When you decode a string, you get raw bytes: it may be that those bytes are not compatible with ASCII, UTF-8, or some other encoding which you are trying to work with. This is why your decoded examples look like spaces: the tools you are using do not recognize the resulting bytes as valid characters.
How you go about encoding identifiers depends on how your identifiers are generated. You didn't say how you were generating the underlying identifiers, so I can't make any assumptions about how you should handle the raw bytes that come out of the decoder, nor about the content of the raw bytes being passed into the encoder.
It's also important to mention that the library you linked to is not compatible with Crockford's Base32 encoding. The library excludes I, L, O, S, while Crockford's encoding excludes I, L, O, U. This would be a problem if you were trying to interoperate with another system that used a different library. If no one besides you will ever need to decode your URL fragments, then interoperability doesn't matter.
The source of your confusion is that a base64 or base32 are methods of representing numbers- whereas you are attempting in your examples to encode or decode text strings.
Encoding and decoding text strings as base32 is done by first converting the string into a large number. In your first examples, where you are encoding "4a2a" and "4A2A", those are strings with two different numeric values, that consequently translate to encoded base32 numbers with two different values, 6hgk4r8 6h0k4g8
when you "decode" 4a2a and 4A2A you say you get empty strings. However this is not true, the strings are not empty, they contain what the decoded number looks like, when interpreted as a string. Which is to say, it looks like nothing because 4a2a produces an unprintable character. It's invisible. What you want is to feed the encoder numbers, not strings.
JavaScript has
parseInt(num, 32)
and
num.toString(32)
built in in a way that's compatible with Java and across JavaScript versions.