I am browsing the deck.gl repo. It ships with some examples with text files, for example this one. These files have a .txt extension, but aren't plain text:
!OohmwFjqwbMg#[?ADKJYXF#^?N?FAD
=wnmwFvvwbM_#WNg####C?C_#UA?AD#?Of#_#UTu#??BK?A??FUVP?#JF?AVP?#JF?AVPGTA?EL#?
=urmwF|swbM_#UFS##BK?C#C#A#E?CIGA?GE?CIGA#CF?#ABA#CJ##GR]Ud#wA\T?#DB?AXP?#DB?A\T
<aymwFnvwbMaAOKCA#OKPk#CCDKAADKAADKAADKAADKAAL_#fBjAIVCCEL
The examples also contain JavaScript files that look as though they are used to decode these files, for example this one for the file above.
What exactly is going on here? I assume this is a way of reducing the size of the data, but why not just rely on browser gzipping?
And why use a plain text extension when the file is clearly plain text? And why have a custom decoder at all?
It looks like a custom encoding that uses byte values to encode coordinates/GeoJSON features.
For example, this line from /dist-demo/data/building-data.txt:
!GqgmwFrhwbM}C}##K#IBO#IlBh#BOBMn#PHBGd#KC
is decoded using the decodePolyline() utility function into this array:
[
[0.00004,0.00001],
[40.70541,0.00002],
[40.7062,-74.01624],
[40.70619,-74.01593],
[40.70618,-74.01587],
[40.70616,-74.01582],
[40.70615,-74.01574],
[40.7056,-74.01569],
[40.70558,-74.0159],
[40.70556,-74.01582],
[40.70532,-74.01575],
[40.70527,-74.01584],
[40.70531,-74.01586],
[40.70537,-74.01605],
[40.70537,-74.01603]
]
which is substantially larger in JSON format.
So my guess would be that the main reason is to be able to use smaller data files that are still portable/cacheable. It's still line-based clear text, so it's diffable as well.
Also, these files are still compressible. I assume that a full JSON file is not only larger to begin with but also exhibits less favorable compression characteristics than this file. A quick test on building-data.txt shows a compression ratio of roughly 2:1 for gzip/deflate (139,089 bytes to 72,660 bytes compressed). The compression result for the same file in raw JSON won't be anywhere near that.
Related
The R rgl package exports an HTML widget with the rglwidget() function, built using the htmlwidgets package. Often the data for this widget is quite large, and Pandoc and webshot2 choke on it.
I would like to try compressing the data when the HTML page is created, and uncompressing it in Javascript before display. I can see that there's a Javascript package pako that appears to do what I want, and it can be "browserified", but I can't see how to make it available to rglwidget(). Can anyone describe what's necessary?
Edited to add some more detail in response to a comment: This needs to happen on the server (or actually, even before the file gets to the server). htmlwidgets produces output in a Markdown document that is converted to HTML by Pandoc, and that step is failing because Pandoc chokes on the large JSON datasets. I'm not sure if it's just the size, or the complexity (I think Pandoc parses the JSON, and appears to blow up to a huge memory footprint before it crashes). I'm hoping that by using a base64 blob Pandoc will handle it better. webshot2 also has problems that may be the same.
2nd Edit: I've got some evidence that it's the size that matters, not the complexity. I used base64 encoding on the JSON to make it simpler (just one long string), but 33% bigger, and things were worse. So I'm back to thinking that compression would help.
I am building a simple application where users can load any file into the Monaco editor in a web browser.
I'm trying to work out if the file that the user has loaded is text, and therefore editable.
In JavaScript, the library I am using to load returns the loaded file as an ArrayBuffer. Of course I can just convert this to text regardless of whether or not it is text or binary and throw the result into the editor. Presumably binary converted to text will display as garbage in the Monaco editor.
I could also examine the mime type of the loaded file. This would go a long way towards solving the problem, but it means I somehow have to know which mime types are text- I have not been able to find anything that specifies this. Also, it means any file without the correct mime type set would not be editable.
So my question is, is there a way to know if the contents of a JavaScript ArrayBuffer is text or binary data such as an image or executable code, by examining the data itself, rather than referring to mime type?
EDIT:
This question is not a duplicate of questions that are simply asking how to convert an ArrayBuffer to text. Simply converting an ArrayBuffer to text doesn't tell whether nor not this is a file that contains editable text or if it is a binary file. Additional information is needed, such as the magic number suggested in the answers to this question.
You can check the Magic numbers of the ArrayBuffer. Magic numbers are a sort of constants in files buffer that you can check to distinguishing between many file formats
Wikipedia - Magic numbers
This NPM module use that approach. Here's a list of the module's supported file types, you can see that they don't support text types.
For SVG you can use https://github.com/sindresorhus/is-svg.
For CSV you can use https://www.npmjs.com/package/detect-csv, but you can't be sure at 100% like they're saying here
UPDATE: I've writed an article about this which contains more explanations and a little Sandbox
I am new to XML. I have an XML document that I am inserting data manually. I wanted to know if it is possible to include an image in an XML file and not by using the file path. I have found something about encoding but I do not understand how this work and the option is not even available in the XML editor. After storing the images in the XML file, I will access it using javascript. Please provide further information on this matter.
An image is binary data, and the usual way to store binary data in an XML document is by encoding it in base64 (which turns it into ASCII characters). Libraries to convert from binary to base64, and back, are widely available, but the details depend very much on your programming environment. There are also online services where you can upload an image and get back its base64 representation: an example is here https://www.base64encode.net/base64-image-encoder
I'm working on something that will read a user's text messages and export them to a csv file, which they can then download. The messages are being retrieved from a third-party web interface—I am essentially using js to grab the html of each message and compiling it as needed. The content of each message is added to a variable which, once all message are gathered, is given to a new Blob, which is then downloaded.
The problem I am having is that, in this web interface, emoji are represented as images, rather than characters. Thus, when writing a message containing an emoji to a file, the result is as so:
"Blah blah blah <img height="18px" width="18px" class="emoji adjustedSpriteForMessageDisplay spriteEMOJI sprite-1f612" data-textvalue="%F0%9F%98%92" src="assets/blank.gif">"
Now, from this image, we can get 2 workable values:
The UTF-8 hex value
F09F9892
and the Unicode codepoint (I may be referring to this wrong, I don't know much about encoding).
U+1f612
Now, what I want to do is take either of these values (whichever works better), and write it to the csv file as the character itself. So that, when viewing the csv file in a text editor or what have you, it would appear as
Though I have no idea where to even start with this. Maybe it's as simple as throwing some syntax around the character values, but I haven't been able to get anything from google, because I'm not familiar enough with encoding to know what to Google.
I suggest preprocessing the data as you grab it from the webpage instead of extracting it from the string afterwards.
You can then use decodeURIComponent() to decode the percent-encoded string:
decodeURIComponent('%F0%9F%98%92')
Combine that with jQuery to access the data-textvalue-attribute:
decodeURIComponent($(element).data('textvalue'))
I created a simple example on JSFiddle.
For some reason the emoji doesn't render correctly in the result screen in my browser, but that is a font issue. When looking at the result using a DOM inspector (or copying the text into a different application), the result is shown with a smiley.
CSV file format does not have character encoding information, so Excel usually assumes ASCII.
https://en.wikipedia.org/wiki/Comma-separated_values#General_functionality
Microsoft Excel mangles Diacritics in .csv files?
I heard a lot about Baase64 Encoding for Images in Webdesign.
And i saw a lot of developers they use it for thier headlines with: data:image/png;base64,iVBORw0...
Is there any automatism (with javascript) behind?
Or have they all converted & inserted ? (could not belive)
Example: http://obox-inkdrop.tumblr.com/ (- Headlines)
First of all, the encoding has to be done on the server-side, be it :
automated with a script, that reads the original image file, and returns the base64 encoded string to inject it into the HTML that's being generated
or by hand, and directly placed into the HTML.
The base64 encoding cannot be done on the client-side, as the goal is to avoid sending the image file from the server to the browser (to minimize the number of HTTP requests).
Depending of the language that's used on the server-side, you'll probably find some function to do base64 encoding.
In PHP, you might be interested by base64_encode()