Is browser extensions' Storage API data stored UTF-8 encoded?

Is browser extensions' Storage API data stored UTF-8 encoded? - javascript

My currently "best understanding" is that Javascript Strings, while in memory are represented as DOMString, which means that the the unicode Glyph a (Latin Small Letter A) is represented by 2 bytes (in memory) using UTF-16 text encoding.
This encoding is as maintained when using the Browser's Storage API localStorage, where the documentation also states that what is stored is a DOMString, meaning contrary to popular myth one can usually store 10MB and not incorrectly 5MB in localStorage.
My question however is not with regard to window.localStorage, but instead with the web extensions Storage API browser.storage.local. With chromium I was able to test (using getBytesInUse) that the data stored was encoded using UTF-8, but I did not find any documentation/specification yet, which states what I up to know have only found out by experiment.
An answer to this question should tell if:
the browser extensions' Storage API data is stored UTF-8 encoded?
and provide a reference that specifies this to be that way.
Background / Rationale
I develop a browser extension, who stores text data, which I seek to compress before storage, to conserve space. Since the Storage API provided does not allow storing of raw binary data, I seek to tweak the compression algorithm to be least wasteful, making it counter-productive to base64 convert the binary data. To efficiently store information wihtin text, however it makes a huge difference which text-encoding is used.
The data stored in the browser extension is mostly compressed HTML markup, in English language, which would benefit most from the data storage using UTF-8 text encoding.
For reference I have checked/read through the following information regarding String types related to Browser's Javascript-engine and DOM-engine: String, DOMString, USVString`

Related

What is the logic for sending the base64 string into image?

I have generated a Base64 string, which I have shared as an image using Capacitor FileSharer,
for this I have used two approaches-
img.split(',')[1],
This I have understood as how it is giving me the image file from removing the "data:image" from the string.
img.replace(/^data:image\/[a-z]+;base64,/, "")
This I haven't understood properly as what functions it is performing to the string that I am getting a image file. Anyone If possible, do provide an explanation.
Though I have used both of them, and both works fine. It is only I am asking because ,If I am using any property in my project , I should now how it is actually working.
(PS- I am new to Javascript )

Introduction of Base64 encoding
In computer science, Base64 is a group of binary-to-text encoding schemes that represent binary data in an ASCII string format by translating it into a radix-64 representation. The term Base64 originates from a specific MIME content transfer encoding. Each Base64 digit represents exactly 6 bits of data. Three 8-bit bytes (i.e., a total of 24 bits) can therefore be represented by four 6-bit Base64 digits (you can read more here).
Where can we use Base64 encoding on images specifically?
Basically, there are multiple advantages in using base64 images or even files such as pdf, csv, etc., in web interactions:
For storying them easily in databases as string and retrieve them accordingly.
In JSON or XML based web architectures (such as REST or SOAP) usually is hard to send images along side with form data. For example, sending profile picture along side with user form data such as username, password, first name, last name, etc., in JSON format.
Security! Anyone who does not know anything about base64 encoding cannot open files easily as it should be.

is their is any size limit of the protocol buffer?

I am passing the data from my client to server and vice versa . I want to know is their is any size limit of the protocol buffer .

Citing the official source:
Protocol Buffers are not designed to handle large messages. As a general rule of thumb, if you are dealing in messages larger than a megabyte each, it may be time to consider an alternate strategy.
That said, Protocol Buffers are great for handling individual messages within a large data set. Usually, large data sets are really just a collection of small pieces, where each small piece may be a structured piece of data. Even though Protocol Buffers cannot handle the entire set at once, using Protocol Buffers to encode each piece greatly simplifies your problem: now all you need is to handle a set of byte strings rather than a set of structures.
Protocol Buffers do not include any built-in support for large data sets because different situations call for different solutions. Sometimes a simple list of records will do while other times you may want something more like a database. Each solution should be developed as a separate library, so that only those who need it need to pay the costs.
As far as I understand the protobuf encoding the following applies:
varints above 64-bit are not specified, but given how their encoding works varint bit-length is not limited by the wire-format (varint consisting of several 1xxxxxxx groups and terminated by a single 0xxxxxxx is perfectly valid -- I suppose there is no actual implementation supporting varints larger than 64-bit thought)
given the above varint encoding property, it should be possible to encode any message length (as varints are used internally to encode length of length-delimited fields and other field types are varints or have a fixed length)
you can construct arbitrarily long valid protobuf messages just by repeating a single repeated field ad-absurdum -- parser should be perfectly happy as long as it has enough memory to store the values (there are even parsers which provide callbacks for field values thus relaxing memory consumption, e.g. nanopb)
(Please do validate my thoughts)

WebSockets and text encoding

I read:
The WebSocket API accepts a DOMString object, which is encoded as
UTF-8 on the wire, or one of ArrayBuffer, ArrayBufferView, or Blob
objects for binary transfers.
A DOMString is a UTF-16 encoded string. So is it correct that UTF-8 encoding is used over the wire?

Yes, it is correct.
UTF-16 may or may not be used in memory, that is just an implementation detail of whatever framework you are using. In the case of JavaScript, strings are UTF-16.
For WebSocket communications, UTF-8 must be used over the wire for textual data (most Internet protocols use UTF-8 nowadays). That is dictated by the WebSocket protocol specification:
After a successful handshake, clients and servers transfer data back and forth in conceptual units referred to in this specification as "messages". On the wire, a message is composed of one or more frames. The WebSocket message does not necessarily correspond to a particular network layer framing, as a fragmented message may be coalesced or split by an intermediary.
A frame has an associated type. Each frame belonging to the same message contains the same type of data. Broadly speaking, there are types for textual data (which is interpreted as UTF-8 [RFC3629] text), binary data (whose interpretation is left up to the application), and control frames (which are not intended to carry data for the application but instead for protocol-level signaling, such as to signal that the connection should be closed). This version of the protocol defines six frame types and leaves ten reserved for future use.
...
Data frames (e.g., non-control frames) are identified by opcodes where the most significant bit of the opcode is 0. Currently defined opcodes for data frames include 0x1 (Text), 0x2 (Binary). Opcodes 0x3-0x7 are reserved for further non-control frames yet to be defined.
Data frames carry application-layer and/or extension-layer data. The opcode determines the interpretation of the data:
Text
The "Payload data" is text data encoded as UTF-8. Note that a particular text frame might include a partial UTF-8 sequence; however, the whole message MUST contain valid UTF-8. Invalid UTF-8 in reassembled messages is handled as described in Section 8.1.
Binary
The "Payload data" is arbitrary binary data whose interpretation is solely up to the application layer.
You will incure a small amount of overhead converting from UTF-16 to UTF-8 to UTF-16, but the overhead is minimal on modern machines, and conversions between UTFs are lossless.

Replacing Base64 - Is http/https communication 8 bit clean?

Here is an overview of what 8 bit clean means.
In the context of web applications, why are images saved as Base64? There is a 33% overhead associated with being 8 bit clean.
If the transmission method is safe there is no need for this.
But basically, my images are saved in Base64 on the server, and transferred to the client, which as we all know can read Base64.
Here is the client side version of Base 64 in an SO Post.
How can you encode a string to Base64 in JavaScript?
Is http/https 8 bit clean?
Reference
http://www.princeton.edu/~achaney/tmve/wiki100k/docs/8-bit_clean.html
http://en.wikipedia.org/wiki/8-bit_clean

You are asking two different things.
Q: Is http 8 bit clean?
A: yes HTTP is "bit 8 clean".
Q: In the context of web applications, why are images saved as Base64?
A: images are not usually saved in Base64. In fact, they are almost never. They are usually saved or transmitted or streamed in compressed binary format (PNG or JPG or similar)
Base64 is used to embed images inside the HTML.
So, you got an image logo.png. You include it statically in your page as <img src='logo.png'>. The image is transmitted thru HTTP in binary, no encoding in neither browser nor server side. This is the most common case.
Alternatively, you might decide to embed the contents of the image inside the HTML. It has some advantages: The browser will not need to do a second trip to the server to fetch the image, because the browser has already received it in the same HTTP GET response of the HTML file. But some disadvantages, because HTML files are text and certain character values may have special meaning for HTML (not for HTTP), you cannot just embed the binary values inside the HTML text. You have to encode them to avoid such collisions. The most usual encoding method is base64, which avoids all the collisions with only a 33% of overhead.

RFC 2616s abstract states:
A feature of HTTP is the typing and negotiation of data representation, allowing systems to be built independently of the data being transferred.
HTTP always starts with a text-only header and in this header the content-type is specified.
As long as sender and receiver agree on this contents type anything is possible.
HTTP relies on a reliable (recognize the wordplay) transport layer such as TCP. HTTPS only adds security to the transport layer (or between the transport layer and HTTP, not sure about this).
So yep, http(s) is 8 bit clean.
In addition to PAs answer and your question "But why use an encoding method that adds 33% overhead, when you don't need it?": because that's part of a different concept!
HTTP transfers data of any kind, and the http-content may be an html file with an embedded picture. But after receiving that html file a browser or some other renderer has to interpret the html content. And that follows different standards, which require arbitrary data to be encoded. html is not 8-bit clean, in fact it is not even 7-bit clean as there are many restrictions on the characters used and their order of appearance.

In the context of web applications, why are images saved as Base64?
There is a 33% overhead associated with being 8 bit clean.
Base64 is used to allow 8-bit binary data to be presented as printable text within the ASCII definition. This is only 7-bits, not 8 as the last 128 characters would be depending on set encoding (Latin1, UTF8 etc.) which means that the encoded data could be mangled if a different encoding type was set at client/receiver end compared to source.
As there aren't enough printable characters within ASCII to represent all 8-bit values (which has absolute values and aren't dependent on encoding itself) you need to "water out the bits" and base-64 keeps high enough numbers to enable the bytes to be represented as printable chars.
This is the 33% overhead you see as the byte values representing characters outside the printable range must be shifted to a value that becomes printable within the ASCII table; Base-64 allows this (you could also use quoted printable which was common in the past, ie. with Usenet, email etc.).
I'm thinking about writing another encoding type to remove the overhead.
Good luck :-)

Related to the query
Is HTTP 8-bit clean ?
HTTP protocol is not in entirety a 8-bit clean protocol.
HTTP Entity Body is 8-bit clean since there is a provision to suggest the content-type, allowing content-negotiation between the interacting entities as pointed by everyone in this thread.
However the request line , the headers and the status line are not 8-bit clean.
In order to send any binary information as part of
the request line, as part of query parameters / path segments
header
one must use one of the binary-to-text encoding to preserve the binary values.
For instance when sending a signature as part of query parameters or headers , which is the case of signed URL technique employed by CDN , the signature a binary information has to be encoded to preserve the binary value of it.

Passing an ActionScript JPG Byte Array to Javascript (and eventually to PHP)

Our web application has a feature which uses Flash (AS3) to take photos using the user's web cam, then passes the resulting byte array to PHP where it is reconstructed and saved on the server.
However, we need to be able to take this web application offline, and we have chosen Gears to do so. The user takes the app offline, performs his tasks, then when he's reconnected to the server, we "sync" the data back with our central database.
We don't have PHP to interact with Flash anymore, but we still need to allow users to take and save photos. We don't know how to save a JPG that Flash creates in a local database. Our hope was that we could save the byte array, a serialized string, or somehow actually persist the object itself, then pass it back to either PHP or Flash (and then PHP) to recreate the JPG.
We have tried:
- passing the byte array to Javascript instead of PHP, but javascript doesn't seem to be able to do anything with it (the object seems to be stripped of its methods)
- stringifying the byte array in Flash, and then passing it to Javascript, but we always get the same string:
ÿØÿà
Now we are thinking of serializing the string in Flash, passing it to Javascript, then on the return route, passing that string back to Flash which will then pass it to PHP to be reconstructed as a JPG. (whew). Since no one on our team has extensive Flash background, we're a bit lost.
Is serialization the way to go? Is there a more realistic way to do this? Does anyone have any experience with this sort of thing? Perhaps we can build a javascript class that is the same as the byte array class in AS?

I'm not sure why you would want to use Javascript here. Anyway, the string you pasted looks like the beginning of a JPG header. The problem is that a JPG will for sure contain NULs (characters with 0 as its value). This will most likely truncate the string (as it seems to be the case with the sample you posted). If you want to "stringify" the JPG, the standard approach is encoding it as Base 64.
If you want to persist data locally, however, there's a way to do it in Flash. It's simple, but it has some limitations.
You can use a local Shared Object for this. By default, there's a 100 Kb limit, which is rather inadequate for image files; you could ask the user to allot more space to your app, though. In any case, I'd try to store the image as JPG, not the raw pixels, since the difference in size is very significative.
Shared Objects will handle serialization / deserialization for you transparently. There are some caveats: not every object can really be serialized; for starters, it has to have a parameterless constructor; DisplayObjects such as Sprites, MovieClips, etc, won't work. It's possible to serialize a ByteArray, however, so you could save your JPGs locally (if the user allows for the extra space). You should use AMF3 as the encoding scheme (which is the default, I think); also, you should map the class you're serializing with registerClassAlias to preserve the type of serialized the object (otherwise it will be treated as an Object object). You only need to do it once in the app life cycle, but it must be done before any read / write to the Shared Object.
Something along the lines of:
registerClassAlias("flash.utils.ByteArray",ByteArray);
I'd use Shared Objects rather than Javascript. Just keep in mind that you'll most likely have to ask the user to give you more space for storing the images (which seems reasonable enough if you're allowing them to work offline), and that the user could delete the data at any time (just like he could delete their browser's cookies).
Edit
I realize I didn't really pay much attention the "we have chosen Gears to do so" part of your question.
In that case, you could give the base 64 approach a try to pass the data to JS. From the Actionscript side it's easy (grab one of the many available Base64 encoders/decoders out there), and I assume the Gear's API must have an encoder / decoder available already (or at least it shouldn't be hard to find one). At that point you'll probably have to turn that into a Blob and store it to disk (maybe using the BlobAPI, but I'm not sure as I don't have experience with Gears).

We Keep Coding

JavaScript is the programming language of the Web.