I'm working on something that will read a user's text messages and export them to a csv file, which they can then download. The messages are being retrieved from a third-party web interface—I am essentially using js to grab the html of each message and compiling it as needed. The content of each message is added to a variable which, once all message are gathered, is given to a new Blob, which is then downloaded.
The problem I am having is that, in this web interface, emoji are represented as images, rather than characters. Thus, when writing a message containing an emoji to a file, the result is as so:
"Blah blah blah <img height="18px" width="18px" class="emoji adjustedSpriteForMessageDisplay spriteEMOJI sprite-1f612" data-textvalue="%F0%9F%98%92" src="assets/blank.gif">"
Now, from this image, we can get 2 workable values:
The UTF-8 hex value
F09F9892
and the Unicode codepoint (I may be referring to this wrong, I don't know much about encoding).
U+1f612
Now, what I want to do is take either of these values (whichever works better), and write it to the csv file as the character itself. So that, when viewing the csv file in a text editor or what have you, it would appear as
Though I have no idea where to even start with this. Maybe it's as simple as throwing some syntax around the character values, but I haven't been able to get anything from google, because I'm not familiar enough with encoding to know what to Google.
I suggest preprocessing the data as you grab it from the webpage instead of extracting it from the string afterwards.
You can then use decodeURIComponent() to decode the percent-encoded string:
decodeURIComponent('%F0%9F%98%92')
Combine that with jQuery to access the data-textvalue-attribute:
decodeURIComponent($(element).data('textvalue'))
I created a simple example on JSFiddle.
For some reason the emoji doesn't render correctly in the result screen in my browser, but that is a font issue. When looking at the result using a DOM inspector (or copying the text into a different application), the result is shown with a smiley.
CSV file format does not have character encoding information, so Excel usually assumes ASCII.
https://en.wikipedia.org/wiki/Comma-separated_values#General_functionality
Microsoft Excel mangles Diacritics in .csv files?
Related
I am building a simple application where users can load any file into the Monaco editor in a web browser.
I'm trying to work out if the file that the user has loaded is text, and therefore editable.
In JavaScript, the library I am using to load returns the loaded file as an ArrayBuffer. Of course I can just convert this to text regardless of whether or not it is text or binary and throw the result into the editor. Presumably binary converted to text will display as garbage in the Monaco editor.
I could also examine the mime type of the loaded file. This would go a long way towards solving the problem, but it means I somehow have to know which mime types are text- I have not been able to find anything that specifies this. Also, it means any file without the correct mime type set would not be editable.
So my question is, is there a way to know if the contents of a JavaScript ArrayBuffer is text or binary data such as an image or executable code, by examining the data itself, rather than referring to mime type?
EDIT:
This question is not a duplicate of questions that are simply asking how to convert an ArrayBuffer to text. Simply converting an ArrayBuffer to text doesn't tell whether nor not this is a file that contains editable text or if it is a binary file. Additional information is needed, such as the magic number suggested in the answers to this question.
You can check the Magic numbers of the ArrayBuffer. Magic numbers are a sort of constants in files buffer that you can check to distinguishing between many file formats
Wikipedia - Magic numbers
This NPM module use that approach. Here's a list of the module's supported file types, you can see that they don't support text types.
For SVG you can use https://github.com/sindresorhus/is-svg.
For CSV you can use https://www.npmjs.com/package/detect-csv, but you can't be sure at 100% like they're saying here
UPDATE: I've writed an article about this which contains more explanations and a little Sandbox
I have a successfully running script that loads Word files from SharePoint and inserts them into Word 2017 (Office 365 Word local client, not online)
The current scripts reads up the files using Ajax and extracts the base64 file and uses
body.insertFileFromBase64(myBase64, end)
I now need to extend the functionality to support Word 2013 (i.e. use the Office.js instead of the Word JavaScript api). So the code has changed to
Office.context.document.setSelectedDataAsync(file, someCoercionType)
I hoped to be able to use a variant of
Office.context.document.setSelectedDataAsync(myBase64, {coercionType: Office.CoercionType.Ooxml}, function (
But I get an error back "The Format of the specified data object is invalid", which is correct enough as the Office API assumes a base64 file is an image.
Is it possible to convert the Base64 file to XML in JavaScript? (Elsewhere in my code I unzip the docx and extract bookmarks, but only from document.xml which lacks all formatting and images, footers etc.)
Base64 is simply an binary encoding and blissfully unaware of the underlying content type. So if you're source content was OOXML, decoding it would give you that OOXML back. What you cannot do is type conversion. For example, a Base64 encoded JPEG can not be decoded directly into a BMP. To do that you would need to first decode and then convert from JPEG to BMP using some other tool.
If you're seeking to manipulate or extract content an existing document, you may want to look at Aspose.Words. Aspose provides tools that allow you to programmatically work with Word documents (they have similar tools for a flew of other file types as well). Using this, you may be able to extract the OOXML you're looking for so you can then insert it into Word using Office.js.
At the moment, the only Coercion Type that accepts Base64 encoded content is Office.CoercionType.Image.
I'm trying to use ☰ in external javascript file
$('<div />',{
text: '☰',
......
But I couldn't save the file and its saying:
The document's current encoding can not correctly save all of the characters within the document. You may want to change to UTF-8 or an encoding that supports the special characters in this document.
What should I do?
You should convert the file to UTF-8, and then try pasting the character in, again, after it's converted and saved.
Your file could be in one of many, many formats, depending on your editor, but if you're just using a text-editor like Notepad, it's going to cause you problems with things that don't fit happily into ASCII.
I am trying to use docx.js to generate a Word document but I can't seem to get it to work.
I copied the raw code into the Google Chrome console after amending line 247 to fix a "'textAlign' undefined error"
if (inNode.style && inNode.style.textAlign){..}
Which makes the function convertContent available. The result of which is an Object e.g.
JSON.stringify( convertContent($('<p>Word!</p>)[0]) )
Results in -
"{"string":
"<w:body>
<w:p>
<w:r>
<w:t xml:space=\"preserve\">Word!</w:t>
</w:r>
</w:p>
</w:body>"
,"charSpaceCount":5
,"charCount":5,
"pCount":1}"
I copied
<w:body>
<w:p>
<w:r>
<w:t xml:space="preserve">Word!</w:t>
</w:r>
</w:p>
</w:body>
into Notepad++ and saved it as a file with an extension of 'docx' but when I open it in MS Word but it says 'cannot be opened because there is a problem with the contents'.
Am I missing some attribute or XML tags or something?
You can generate a Docx Document from a template using docxtemplater (library I have created).
It can replace tags by their values (like a template engine), and also replace images in a paid version.
Here is a demo of the templating engine: https://docxtemplater.com/demo/
This code can't work on a JSFiddle because of the ajaxCalls to local files (everything that is in the blankfolder), or you should enter all files in ByteArray format and use the jsFiddle echo API: http://doc.jsfiddle.net/use/echo.html
I know this is an older question and you already have an answer, but I struggled getting this to work for a day, so I thought I'd share my results.
Like you, I had to fix the textAlign bug by changing the line to this:
if (inNode.style && inNode.style.textAlign)
Also, it didn't handle HTML comments. So, I had to add the following line above the check for a "#text" node in the for loop:
if (inNodeChild.nodeName === '#comment') continue;
To create the docx was tricky since there is absolutely no documentation on this thing as of yet. But looking through the code, I see that it is expecting the HTML to be in a File object. For my purposes, I wanted to use the HTML I rendered, not some HTML file the user has to select to upload. So I had to trick it by making my own object with the same property that it was looking for and pass it in. To save it to the client, I use FileSaver.js, which requires a blob. I included this function that converts base64 into a blob. So my code to implement it is this:
var result = docx({ DOM: $('#myDiv')[0] });
var blob = b64toBlob(result.base64, "application/vnd.openxmlformats-officedocument.wordprocessingml.document");
saveAs(blob, "test.docx");
In the end, this would work for simple Word documents, but isn't nearly sophisticated for anything more. I couldn't get any of my styles to render and I didn't even attempt to get images working. I've since abandoned this approach and am now researching DocxgenJS or some server-side solution.
You may find this link useful,
http://evidenceprime.github.io/html-docx-js/
An online demo here:
http://evidenceprime.github.io/html-docx-js/test/sample.html
You are doing the correct thing codewise, but your file is not a valid docx file. If you look through the docx() function in docx.js, you will see that a docx file is actually a zip containing several xml files.
I am using Open Xml SDK for JavaScript.
http://ericwhite.com/blog/open-xml-sdk-for-javascript/
Basically, on web server, I have a empty docx file as new template.
when user in browser click new docx file, I will retrieve the empty docx file as template, convert it to BASE64 and return it as Ajax response.
in client scripts, you convert the BASE64 string to byte array and using openxmlsdk.js to load the byte array as an javascript OpenXmlPackage object.
once you have the package loaded, you can use regular OpenXmlPart to create a real document. (inserting image, creating table/row ).
the last step is stream it out to end user as a document. this part is security related. in my code I send it back to webserver and gets saved temporarily. and prepare a http response to notify end user to download it.
Check the URL above, there are useful samples of doing this in JavaScript.
I have an html page were i can fill in some text and send (with javascript) this to an sql-database.
On my pc, everything works fine, but on another one (a french windows), it doesn't save my chars correctly.
french chars like é, è, â,.. were saved as 'É', or something like that.
I googled a lot but still did not found any solution, i'm also not able to reproduce the problem on my own pc..
"É" occurs when a character encoded in utf-8 (2 bytes) is read as latin (1 byte). The problem can be on the client side (e.g. by the use of escape) or on the server side (wrong parsing of the form's POST data, database encoding).
Make sure that your html pages encoding is set to something like UTF-8, UTF-16, etc... Also make sure that your strings are escaped properly in javascript.
You need to encode the file in ANSI. I do this my self. For example in notepad 2 you would click File->Encoding->ANSI and then save.