I am trying to develop a hybrid mobile app with QR code functionality. QR Code contains a limited number of character can be stored with it. So, I am thinking is it possible to compress the string to make it shorter so that I can store more info into the QR code?
At lengths that short, most compression algorithms will actually make data longer, not shorter. There are some algorithms which may work well, though… smaz comes to mind. However, it is going to depend heavily on what you are trying to compress, and you haven't really provided any information about that.
Instead of thinking about compression, your best bet may be to find an encoding scheme which makes more sense for your data. For example, if you're encoding a date and time, store it as a single number instead of text. Think about whether you really need seconds. If you are storing numbers, consider using variable-length quantities. If your data is JSON, consider using protobuf instead.
If what you have really is text, it may be worth considering coming up with your own character set. Instead of ASCII where each character 8 bits, can you limit yourself to 64 characters? a-z, A-Z, 0-9, and two punctuation characters is only 64 possible symbols… if that is all you need, you could use a 6-bit encoding. If the strings aren't case-sensitive you have tons of room for punctuation.
Related
Is there any standard on how to write Unicode characters in JavaScript/JSON?
For instance, is there any difference between \u011b and \u011B? Most of the web examples use second format. Also, there is an option for ASCII characters to be written in a short format like \xe1. Which format is preferable (standard). Is it good practice to mix these formats together and what about performance?
For the first question: both version are valid. It is more a coding convention, you should prefer what convention is already used in your files/project. Then check on your community (convention used by other programs you heavily use, what they prefer, and as last option you can choose one way. But in any case, keep consistent.
Personally I prefer none of them for code: UTF-8 is so wide used and browsers should understand it, so I would put directly the right character (as character, not as escape sequence). If codepoint is important, I would add it into a comment. it is expected that all developers and tools will have UTF-8 editors.
Javascript uses UCS-2, so the precursor of UTF-16, but considering unicode code points to be just 16bit length (so some emoji would use two characters).
The byte format should not be used for text: it hides the meaning. There are exceptions: e.g. to check which encoding you get from user, or if you have BOM. [But so just for signatures]. For other binary cases, it is ok to use \x1e escapes, e.g. for key identification.
Note: you should really follow one coding guidelines. Google for it and you will find many, e.g. this from Google (which is maybe too much): https://google.github.io/styleguide/jsguide.html
I am setting up a little website and would like to make it international. All the content will be stored in an external xml in different languages and parsed into the html via javascript.
Now the problem is, there are also german umlauts, russian, chinese and japanese symbols and also right-to-left languages like arabic and farsi.
What would be the best way/solution? Is there an "international encoding" which can display all languages properly? Or is there any other solution you would suggest?
Thanks in advance!
All of the Unicode transformations (UTF-8, UTF-16, UTF-32) can encode all Unicode characters. You pick which you want to use based on the size: If most of your text is in western scripts, probably UTF-8, as it will use only one byte for most of the characters, but 2, 3, or 4 if needed. If you're encoding far east scripts, you'll probably want one of the other transformations.
The fundamental thing here is that it's all Unicode; the transformations are just different ways of representing the same characters.
The co-founder of Stack Overflow had a good article on this topic: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Regardless of what encoding you use for your document, note that if you're doing processing of these strings in JavaScript, JavaScript strings are UTF-16 (except that invalid values are tolerated). (Even if the document is in UTF-8 or UTF-32.) This means that, for instance, each of those emojis people are so excited about these days look like two "characters" to JavaScript, because they take two words of UTF-16 to represent. Like 😎, for instance:
console.log("😎".length); // 2
So you'll need to be careful not to split up the two halves of characters that are encoded in two words of UTF-16.
The normal (and recommended) solution for multi-lingual sites is to use UTF-8. That can can deal with any characters that have been assigned Unicode codepoints with a couple of caveats:
Unicode is a versioned standard, and a different Javascript implementations may support different Unicode versions.
If your text includes characters outside of the Unicode Basic Multilingual Plane (BMP), then you need to do your text processing (in Javascript) in a way that is Unicode aware. For instance, if you use the Javascript String class you need to take proper account of surrogate pairs when doing text manipulation.
(A Javascript String is actually encoded as UTF-16. It has methods that allow you to manipulate it as Unicode codepoints, methods / attribute such as substring and length use codeunit rather than codepoint indexing. If you are not careful, you can end up splitting a string between the low and high parts of a surrogate pair. The result will be something that cannot be displayed properly. This only affects codepoints in higher planes ... but that includes the new emoji codepoints.)
In Javascript, window.atob() method decodes a base64 string and window.btoa() method encodes a string into base64.
Then why weren't they named like base64Decode() and base64Encode()?
atob() and btoa() don't make sense because they're not semantic at all.
I want to know the reason.
The atob() and btoa() methods allow authors to transform content to and from the base64 encoding.
In these APIs, for mnemonic purposes, the "b" can be considered to
stand for "binary", and the "a" for "ASCII". In practice, though, for
primarily historical reasons, both the input and output of these
functions are Unicode strings.
From : http://www.w3.org/TR/html/webappapis.html#atob
I know this is old, but it recently came up on Twitter, and I thought I'd share it as it is authoritative.
Me:
#BrendanEich did you pick those names?
Him:
Old Unix names, hard to find man pages rn but see
https://www.unix.com/man-page/minix/1/btoa/ …. The names carried over
from Unix into the Netscape codebase. I reflected them into JS in a
big hurry in 1995 (after the ten days in May but soon).
In case the Minix link breaks, here's the man page content:
BTOA(1) BTOA(1)
NAME
btoa - binary to ascii conversion
SYNOPSIS
btoa [-adhor] [infile] [outfile]
OPTIONS
-a Decode, rather than encode, the file
-d Extracts repair file from diagnosis file
-h Help menu is displayed giving the options
-o The obsolete algorithm is used for backward compatibility
-r Repair a damaged file
EXAMPLES
btoa <a.out >a.btoa # Convert a.out to ASCII
btoa -a <a.btoa >a.out
# Reverse the above
DESCRIPTION
Btoa is a filter that converts a binary file to ascii for transmission over a telephone
line. If two file names are provided, the first in used for input and the second for out-
put. If only one is provided, it is used as the input file. The program is a function-
ally similar alternative to uue/uud, but the encoding is completely different. Since both
of these are widely used, both have been provided with MINIX. The file is expanded about
25 percent in the process.
SEE ALSO
uue(1), uud(1).
Source: Brendan Eich, the creator of JavaScript. https://twitter.com/BrendanEich/status/998618208725684224
To sum up the already given answers:
atob stands for ASCII to binary
e.g.: atob("ZXhhbXBsZSELCg==") == "example!^K"
btoa stands for binary to ASCII
e.g.: btoa("\x01\x02\xfe\xff") == "AQL+/w=="
Why ASCII and binary:
ASCII (the a) is the result of base64 encoding. A safe text composed only of a subset of ascii characters(*) that can be correctly represented and transported (e.g. email's body),
binary (the b) is any stream of 0s and 1s (in javascript it must be represented with a string type).
(*) in base64 these are limited to: A-Z, a-z, 0-9, +, / and = (padding, only at the end) https://en.wikipedia.org/wiki/Base64
P.S. I must admit I myself was initially confused by the naming and thought the names were swapped. I thought that b stand for "base64 encoded string" and a for "any string" :D.
The names come from a unix function with similar functionality, but you can already read that in other answers here.
Here is my mnemonic to remember which one to use. This doesn't really answer the question itself, but might help people figure which one of the functions to use without keeping a tab open on this question on stack overflow all day long.
Beautiful to Awful btoa
Take something Beautiful (aka, beautiful content that would make sense to your application: json, xml, text, binary data) and transform it to something Awful, that cannot be understood as is (aka: encoded).
Awful to Beautiful atob
The exact opposite of btoa
Note
Some may say that binary is not beautiful, but hey, this is only a trick to help you.
I can't locate a source at the moment, but it is common knowledge that in this case, the b stands for 'binary', and the a for 'ASCII'.
Therefore, the functions are actually named:
ASCII to Binary for atob(), and
Binary to ASCII for btoa().
Note this is browser implementation, and was left for legacy / backwards-compatibility purposes. In Node.js for example, these don't exist.
I've written a query for Mongo to search for a phone number. The gotcha is the phone entry is a String rather than a Number. At first I thought it was working fine, however now I realize that if the query isn't formatted correctly it will not match.
So I guess my question is what's the easiest way of matching a phone number regardless of formatting?
Worst case scenario I use a $where statement and check equality by removing numbers from both the values and doing a regex match on that. Just wondering if there is a more optimal way of doing this?
I would store the phone numbers normalized (e.g. either stripped of non numeric chars, or formatted in a standard format) in the DB in the first place, since they are not already normalized, doing it on the fly for each search request will be expensive, so if you don't have too many entries already (e.g. if this is still all in development), a script that will normalize all entries in one shot (or in several batches during off peek hours if you have a production system) will be possible.
Then your where clause will just normalize the input, and then the search will be much easier.
Same goes for addresses by the way, you have to normalize the data to perform good search, or you'll have to develop some fuzzy matching algorithm, that is simply going to be slower. (and might take you more time than you think)
I have raw data in text file format with lot of repetitive tokens (~25%). I would like to know if there's any algorithm which will help:
(A) store data in compact form
(B) yet, allow at run time to re-constitute the original file.
Any ideas?
More details:
the raw data is consumed in a pure html+javascript app, for instant search using regex.
data is made of tokens containing (case sensitive) alpha characters, plus few punctuation symbols.
tokens are separated by spaces, new lines.
Most promising Algorithm so far: Succinct data structures discussed below, but reconstituting looks difficult.
http://stevehanov.ca/blog/index.php?id=120
http://ejohn.org/blog/dictionary-lookups-in-javascript/
http://ejohn.org/blog/revised-javascript-dictionary-search/
PS: server side gzip is being employed right now, but its only a transport layer optimization, and doesn't help maximize use of offline storage for example. Given the massive 25% repetitiveness, it should be possible to store in a more compact way, isn't it?
Given that the actual use is pretty unclear I have no idea whether this is helpful or not, but for smallest total size (html+javascript+data) some people came up with the idea of storing text data in a greyscale .png file, one byte to each pixel. A small loader script can then draw the .png to a canvas, read it pixel for pixel and reassemble the original data this way. This gives you deflate compression without having to implement it in Javascript. See e.g. here for more detailled information.
Please, do not use a technique like that unless you have pretty esotheric requirements, e.g. for a size-constrained programming competition. Your coworkers will thank you :-)
Generally speaking, it's a bad idea to try to implement compression in JavaScript. Compression is the exact type of work that JS is the worst at: CPU-intensive calculations.
Remember that JS is single-threaded1, so for the entire time spent decompressing data, you block the browser UI. In contrast, HTTP gzipped content is decompressed by the browser asynchronously.
Given that you have to reconstruct the entire dataset (so as to test every record against a regex), I doubt the Succinct Trie will work for you. To be honest, I doubt you'll get much better compression than the native gzipping.
1 - Web Workers notwithstanding.