Compressing plaintext in JavaScript?

Compressing plaintext in JavaScript? - javascript

I have a simple Notepad-like web application I'm making for fun. When you save a document, the contents of a <textarea> are sent to the server via Ajax and persisted in a database.
Let's just say for shits and giggles that we need to compress the contents of the <textarea> before sending it because we're on a 2800 baud modem.
Are there JavaScript libraries to do this? How well does plain text compress in the first place?

Simple 7 bit compression might work if you're only using the 7 bit ascii character set. A google search yielded this: http://www.iamcal.com/png-store/
Or you could use LZW
http://rosettacode.org/wiki/LZW_compression#JavaScript
As far as compression ratio; according to Dr. Dobbs:
It is somewhat difficult to characterize the results of any data compression technique. The level of compression achieved varies quite a bit, depending on several factors. LZW compression excels when confronted with data streams that have any type of repeated strings. Because of this, it does extremely well when compressing English text. Compression levels of 50 percent or better can be expected.

Well, you couldn't use gzip comppression. See here: Why can't browser send gzip request?
I suppose you could strip whitespace, but that would prove unsustainable. I'm not sure if this is an itch that needs scratching.
I did find this with a google search: http://rumkin.com/tools/compression/compress_huff.php That will eventually yield a smaller set of text, if the text is large enough. It actually inflates the text if the text is short.
I also found this: http://www.sean.co.uk/a/webdesign/javascript_string_compression.shtm

First, run the LZW compression, this yields compressed data in binary format.
Next then do base-64 encoding on the the compressed binary data. This will yield a text version of the compressed data that you can store in your database.
To restore the contents, do the base-64 decode. Then the LZW decompression.
There are Java libraries to do both. Just search on "LZW compression Java" and on "base-64 encode Java".

It varies heavily on the algorithm and the text.
I'm making my own compression algorithm here, as of writing its not done but it already works extremely well for English plaintext compression. ~50% compression for both small and large messages. It wouldn't be useful to share a code snippet because I'm using experimental dictionary compression, but heres my project: https://github.com/j-stodd/SMOL
I also tried the LZW compression shared by Suirtimed but it doesn't seem to perform that well, it will decrease length but bytes stay mostly the same. Compressing "aaaaaaaa" with LZW will save you only one byte. My algorithm would save you 5 bytes.

Related

Why does .ico ( Base64 ) appear to waste so much space?

Below is the base64 representation of my logo icon. It is mostly the character A. I made it in gimp and then converted it to base64.
Is there something I could have done differently so that I do not waste so much space. I would assume there is someway to encode A over and over again, instead of explicitly writing them?
I know that Base64 kills 33% off the top, but this is not my concern.
In gimp I save to .ico and then converted to Base64 using an online tool.
url(data:image/vnd.microsoft.icon;base64,AAABAAEAICAAAAEAIACoEAAAFgAAACgAAAAgAAAAQAAAAAEAIAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAADd3d0B3d3dQ93d3dUAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA3d3d1d3d3UPd3d0BAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA3d3dJd3d3dXd3d3/3d3d/wAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAADd3d3/3d3d/93d3dXd3d0lAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAN3d3Vbd3d3/3d3d/93d3f/d3d3/AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAN3d3f/d3d3/3d3d/93d3f/d3d1WAAAA
.../snip continues like this

Windows icon files contain raw uncompressed bitmap data, and Base64 is just a way of encoding data with a 33% expansion rate.
Depending on what you're wanting to do, there are several solutions:
Use the PNG ICO format: this is a normal multi-image'd Windows *.ico file, except the bitmap data is stored as PNG instead of a raw bitmap. This is only supported by Windows Vista or later. (PNGs are used for 128x128 and larger icon sizes but bitmaps are used for all smaller sizes for compatibility and performance reasons).
Use PNG directly - it doesn't look like you're taking advantage of the multi-image capabilities of the ICO format, is this for a favicon? Note that favicons can be in any bitmap format, not just ICO.
Use your webserver's GZIP compression - assuming you're offering your ICO files over the web then the inefficient storage isn't a concern because most servers, including IIS, come with HTTP Gzip compression support which really shrinks those things down.
Other than that, I/we need more information about what you're wanting to accomplish.

Save it as a 2 color palette GIF file.

Once you know the Base64 value, you could write a loop to make that many As I suppose. Depending on the length of the loop it may or may not save file space.

Feasibility of transporting JavaScript and CSS as binary from server to client?

I am a newbie in the web development field.
I see that minification of JavaScript and CSS is widely used to reduce web-page load times. But, undoubtedly, text format data will be longer than binary format, so why do we still use textual JavaScript and CSS?
Is it possible in the future to use binary format for servers to deliver presentational and behavioral definitions?
I think if there is a common standard to deliver these as binary data, then server-side programs will be created to convert text format JS/CSS produced by web designers to binary format, and network traffic will be greatly reduced.
Can anybody give me some ideas about this?

Gzip is pretty widely deployed http://betterexplained.com/articles/how-to-optimize-your-site-with-gzip-compression/

The feasibility is nil. It would require the existence of a universal standard for binary JavaScript and CSS, understood by all browsers, and by a lot of technology that is peripherally concerned with both.
There isn't one.

It's interesting that you didn't mention a binary version of HTML in your question.
A year ago, W3C published EXI, a specification for binary XML. You can use XML to represent HTML documets, so it is already possible to represent HTML in binary in a standards-compliant way (however, browsers have yet to support this).
CSS is a very regular format, so creating a binary format for it wouldn't be hard. (You might be interested in this.) Standardizing that format, on the other hand, would be.
Maybe in the future, people will write all their code in abstraction languages like SLIM and SASS, which will then be compiled to binary XML, allowing browsers to use one very fast and efficient interface to parse both markup and style.
As others have pointed out, little effort is being spent on developing web standards for more efficient data transfer. The consensus at the moment is that binary formats will complicate things (it will no longer be possible to edit the data directly), won't reduce the size much more than gzip*, and that further reduction in size is not necessary, especially since the introduction of fibre-optic.
* gzip is a general-purpose compression program much more widely used than any domain-specific binary format, and so is much more thouroughly tested and supported.

Client-side zipping with Flash + JavaScript

I'm looking for a robust way of creating a zip archive on the fly from information on a given page and making this available for download. Client-side zipping is a must since my script runs from a bookmarklet.
My first approach while I was more concerned with writing the rest of the script was just to post the information to a few lines of PHP running on my local server which zipped it and sent it back. This is obviously not suitable for a bookmarklet worth sharing.
I found JSZip earlier today, and I thought that'd be the end of it. This library works great when it works; unfortunately, the archives I'm creating frequently exceed a couple of MBs, and this breaks JSZip. (Note: I've only tested this on Chrome.)
Pure JS downloads also have the limitation of funky names due the data URI, which I intended to solve using JSZip's recommended method, using Downloadify, which uses Flash. This made me wonder whether the size limitations on JS zip generating could be / have been overcome by using a similar interplay of Flash & JS.
I Googled this, but having no experience with Actionscript I couldn't figure out quickly whether what I'm asking is possible. Is it possible to use a Flash object from JS to create relatively large (into the 10s of MBs) zip file on the client-side?
Thanks!

First of all some numbers:
Flash promises that uploads will work if the file is smaller than 100 Mb (I don't know whether it means base 10 or base 16).
There are two popular libraries in Flash for creating ZIP archives, but read on first.
ZIP archiver is a program that both compresses and archives the data, and it does it in exactly this order. I.e. it compresses each file separately and then appends it to the entire archive. This yields worse compression rate but allows for iterative creation of the archive. With the benefit being that you can even start sending the archive before it is entirely compressed.
An alternative to ZIP is first to use a dedicated archiver and then to compress the entire archive at once. This, some times can achieve few times better compression, but the cost is that you have to process the data at once.
But Flash ByteArray.compress() method offers you native implementation of deflate algorithm, which is mostly the same thing you would use in ZIP archiver. So, if you had implemented something like tar, you could significantly reduce the size of the files being sent.
But Flash is a single-thread environment, so, you would have to be careful about the size of the data you compress, and, probably, will have to find it out empirically. Or just use ZIP - more redundancy, but easier to implement.
I've used this library before: nochump. Didn't have any problems. Although, it is somewhat old, and it might make sense to try to port it to use Alchemy opcodes (which are used for fast memory access significantly reducing the cost of low-level binary arithmentic operations such as binary or, binary and etc.) This library implements CRC32 algorithm, which is an essential part of ZIP archive and it uses Alchemy - so it should be considerably faster, but you would have to implement the rest on your own.
Yet another option you might consider is Goole's NaCl - there you would be able to choose from archiver and compression implementations because it essentially runs the native code, so you could even use bz2 and other modern stuff - unfortunately, only in Chrome (and users must enable it) or Firefox (need plugin).

What is the best way of hiding or encrypting information in comments in Javascript, CSS or HTML code?

While digging through facebook's css and html code I found some comments which seems to be encrypted in order to hide information. This could be some kind of debugging information which might be useful to keep for later use. The comments are looking like this example:
/*[XnbHYrH~LGxMu]p`KYO^fXoOK]wFpBtjKdzjYssGm~[xISvmX0J]xhEMxwV_NjvnWm]jAyo`#}VtxqZ{QC`M|yxHMBLE[ZsaeCgU[aG}|K|`Icu`hxiAzM|j~RRkiO|AF`_KuuEnfd_I[P}BDo`ykXBjUjt_nza#^hh?CEQp~KHR|z`llKuTxM_lJp*/
A quick analysis of the encrypted text with this python snippet ''.join(sorted(set(comment))) shows that only 64 different characters are used.
'0?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz{|}~'
In terms of performance, size and browser-compatibility one cheap approach would be a base64 encoding of the raw text with a custom char mapping.
Update: Some of the constraints I would define for a best solution is a fast encoding with low computation time and a small output size for reduced bandwidth. On the other side it should be easy to retrieve the original information with a script and some kind of secret if needed. The usage is more for hiding non-sensitive data, so there is no need for strong encryption. It just should be economical unattractive for someone spending time on it.

I use a huffman code and base64 to encode some data on my website. I think it's very hard to bypass and I get some compression too. That was more an accident I did. But it would be nice if you can explain how you define best in this context? Do you have constraints?

I don't know what they're doing here, but I'd say you should never intentionally send sensitive data or anything you want to hide to a client, regardless of whether it's encrypted or not. Not only is it dangerous (if by some chance your encryption is broken) but it is wasting bandwidth.
If you absolutely need to keep stuff in your sourcecode for some reason, then you should have a pre-release job to strip it out so it never gets published.

Why are javascript programs delivered in plain text?

Why was it decided to ship javascript programs in plain text? Was it to achieve performance enhancement or the authors never imagined that javascript will be used in much more complex applications and developers may want to protect the source code?

I think part of the reason was that since HTML itself was delivered in plain text to the browser then so should JavaScript. It was the way of the Web.

That's because JavaScript was never intended for large web applications, rather it was a way to do something "cool" on the browser. JavaScript wasn't a very popular language and was despised until the advent of AJAX, that is why there has never been much insight how JavaScript should be distributed. After all, simplest way to send JavaScript to a browser was through regular text files, why would they have bothered with anything else back in 1995?

Take a look at YUI; it will compress javascript files by replacing all the names of variables, functions, etc to stuff like a, b, c, ...
It will also remove all unnecessary whitespaces, newlines, etc. So it will both obfuscate the javascript code and reduce its size.

Text is the one data form which transfers between any pair of computers without concern about computer architecture: endianness, word size, floating point binary representation, negative encoding, etc. Even EBCDIC computers readily handle ASCII.
Though any binary representation scheme can overcome these stumbling blocks—as TCP/IP internals do—code which does this is not completely pleasant to work with—or even to read. Experience is that even the most seasoned engineers occasionally forget to use a needed host-to-network or network-to-host conversion macro.
Indeed, many protocols which primarily transmit numeric information convert values to ASCII notation for transmission largely because of the generality. (PCL and ANSI 3.64 come to mind.) Text based transmission is handily supported by a wide universe of native numeric formatters and parsers, all of which tend to be well optimized. Also, virtually every programming language handily supports text encoded numeric data. Support for binary formatted data varies from adequate to painful.

It's easier to keep plain text secure than it is binary (from buffer overflows etc). It has a lower cost of entry. Minification and gzipping make it efficient. Web development is easier. Need I go on?

We Keep Coding

JavaScript is the programming language of the Web.