Javascript export CSV encoding utf-8 issue - javascript

I need to export javascript array to CSV file and download it. I did it but 'ı,ü,ö,ğ,ş' this characters looks like 'ı ü ö ÄŸ ÅŸ' in the CSV file. I have tried many solutions recommended on this site but didn't work for me.
I added my code snippet, Can anyone solve this problem?
var csvString = 'ı,ü,ö,ğ,ş';
var a = window.document.createElement('a');
a.setAttribute('href', 'data:text/csv; charset=utf-8,' + encodeURIComponent(csvString));
a.setAttribute('download', 'example.csv');
a.click();

This depends on what program is opening the example.csv file. Using a text editor, the encoding will be UTF-8 and the characters will not be malformed. But using Excel the default encoding for CSV is ANSI and not UTF-8. So without forcing Excel using not ANSI but UTF-8 as the encoding, the characters will be malformed.
Excel can be forced using UTF-8 for CSV with putting a BOM (Byte Order Mark) as first characters in the file. The default BOM for UTF-8 is the byte sequence 0xEF,0xBB,0xBF. So one could think simply putting "\xEF\xBB\xBF" as first bytes to the string will be the solution. But surely that would be too simple, wouldn't it? ;-) The problem with this is how to force JavaScript to not taking those bytes as characters. The "solution" is using a "universal BOM" "\uFEFF" as mentioned in Special Characters (JavaScript).
Example:
var csvString = 'ı,ü,ü,ğ,ş,#Hashtag,ä,ö';
var universalBOM = "\uFEFF";
var a = window.document.createElement('a');
a.setAttribute('href', 'data:text/csv; charset=utf-8,' + encodeURIComponent(universalBOM+csvString));
a.setAttribute('download', 'example.csv');
window.document.body.appendChild(a);
a.click();
See also Adding UTF-8 BOM to string/Blob.
Using this, the encoding will be correct. But nevertheless, this only works properly if comma is the default list separator in your Windows locale settings. If not, if for example semicolon is the default list separator in your Windows locale settings, then all content will be in first column without splitting it by comma. Then you have to use semicolon as delimiter in the CSV also. But this is another problem and leads to the conclusion not using CSV at all but using libraries which can directly creating Excel files (*.xls or *.xlsx).

Related

How to change the encoding of a CSV file to ISO-8859-1 (ISO-Latin-1)

I have searched for way too long, to solve my existing Problem.
So, there is a Software, that can only read csv-files with ISO-8859-1 (ISO-Latin-1) encoding. And If tried alost everything I can find on the web, but nothing worked. I don´t want to change the Text, I want to change the encoding.
I´ve tried working with this lib: https://github.com/inexorabletash/text-encoding Library and the PapaParse Library and much more. But they are just converting the Text so there are weird symbols replace ä,ö,ü and other characters.
The characters in the csv file itself has characters which are not part of the character set which you're encoding into. You might find that the end of lines have a character which does not exist in the character set you want to use. If you open the csv file in a basic text editor or at the dos prompt use Type myFile.csv you will be able to see which charaters there are in a most basic format. Then stip them out and you should have a file that can be converted. Always work on a copy of the original. The easiest way would be to search and replace in a text editor where you replace the unwanted characters with nothing. not even a space. Keep in mind that if the character is part of the csv construct - eg .. a delimiter - then you would want to replace that with the latin equivalent character.
I found it myself, but thank you anyway.
In js :
const uint8array = new TextEncoder(
'windows-1252',
{NONSTANDARD_allowLegacyEncoding: true}
).encode(csv_data);
const blob = new Blob([uint8array])
(For those, who searched the question: csv_data is an String, containing the values from the read csv-file. Make sure to ad ; behind every value to move to the next column and a /n to move to the next line.)
In HTML :
<script>
window.TextEncoder = window.TextDecoder = null;
</script>
<script src="encoding-indexes.js"></script>
<script src="encoding.js"></script>
And you have to download encoding.js and encoding-indexes.js from https://github.com/inexorabletash/text-encoding

Parse UTF-8 XML in javascript

I'm trying to load and parse a simple utf-8-encoded XML file in javascript using node and the xpath and xmldom packages. There are no XML namespaces used and the same XML parsed when converted to ASCII. I can see in the debugger in VS Code that the string has embedded spaces in between each character (surely due to loading the utf-8 file incorrectly) but I can't find a way to properly load and parse the utf-8 file.
Code:
var xpath = require('xpath')
, dom = require('xmldom').DOMParser;
const fs = require('fs');
var myXml = "path_to_my_file.xml";
var xmlContents = fs.readFileSync(myXml, 'utf8').toString();
// this line causes errors parsing every single tag as the tag names have spaces in them from improper utf-8 decoding
var doc = new dom().parseFromString(xmlContents, 'application/xml');
var cvNode = xpath.select1("//MyTag", doc);
console.log(cvNode.textContent);
The code works fine if the file is ASCII (textContent has the proper data), but if it is UTF-8 then there are a number of parsing errors and cvNode is undefined.
Is there a proper way to parse UTF-8 XML in node/javascript? I can't for the life of me find a decent example.
When you see additional white spaces between each letter, this suggests that the file isn't actually encoded using utf-8 but uses a 16 bit unicode encoding.
Try 'utf16le'.
For a list of supported encodings see Buffers and Character Encodings.

JavaScript: how to Download CSV file containing French text with UTF-8 encoding

I am trying to download CVS file using JavaScript. That CSV contains text in French language.
I am creating file using blob
var blob = new Blob([ csv ], {
type : 'text/csv;charset=utf-8'
});
var csvUrl = window.URL.createObjectURL(blob);
But when i open the csv file it shows the content in ANSII format due to which French characters are converted to some unknown characters.
But if i open the same file as text, the encoding is ok (UTF-8).
Example: Text I am trying to download "Opération"
Text Visible in Excel (.csv) ==> Opération <== With unknown characters
Text if opened in Notpad++ ==> Opération <== Correct UTF-8
How can i open Download CSV file directly with UTF-8 encoding?
I don't want user to change anything in excel. I want some javascript format so that excel could recognize all my characters irrespective of its encoding. Is it possible to do something?
I met the same issue, and I found the following solution:
similar problem and solution
The only thing to do is add the "\ufeff" at the beginning of your csv string:
var csv = "\ufeff"+CSV;

Javascript FileReader readAsText function not understaning utf-8 encoding characters like ä and ö

I have tried searching this a lot and nothing helped me. I have an import from csv feature and javascript code reads the csv content line by line. The characters ä,ö etc are just not recognized. FileReader readAsText has default encoding utf-8 but in this case it is not for some reason working. Here is my code.
reader = new FileReader()
reader.onload = (e) =>
result = e.target.result
console.log result
# file content
fileContent = result.split("\r")
reader.readAsText(e.target.files.item(0))
I have tried defining encoding like below and whatever I put there couldn't help me.
encoding = "UTF-8"
reader.readAsText(e.target.files.item(0), encoding)
I got this to work by using ISO Latin 4 encoding.
reader.readAsText(e.target.files.item(0), 'ISO-8859-4');
That should work for you but remember to use this particular encoding just for some scandinavian characters.

Adding UTF-8 BOM to string/Blob

I need to add a UTF-8 byte-order-mark to generated text data on client side. How do I do that?
Using new Blob(['\xEF\xBB\xBF' + content]) yields '"my data"', of course.
Neither did '\uBBEF\x22BF' work (with '\x22' == '"' being the next character in content).
Is it possible to prepend the UTF-8 BOM in JavaScript to a generated text?
Yes, I really do need the UTF-8 BOM in this case.
Prepend \ufeff to the string. See http://msdn.microsoft.com/en-us/library/ie/2yfce773(v=vs.94).aspx
See discussion between #jeff-fischer and #casey for details on UTF-8 and UTF-16 and the BOM. What actually makes the above work is that the string \ufeff is always used to represent the BOM, regardless of UTF-8 or UTF-16 being used.
See p.36 in The Unicode Standard 5.0, Chapter 2 for a detailed explanation. A quote from that page
The endian order entry for UTF-8 in Table 2-4 is marked N/A because
UTF-8 code units are 8 bits in size, and the usual machine issues of
endian order for larger code units do not apply. The serialized order
of the bytes must not depart from the order defined by the UTF- 8
encoding form. Use of a BOM is neither required nor recommended for
UTF-8, but may be encountered in contexts where UTF-8 data is
converted from other encoding forms that use a BOM or where the BOM is
used as a UTF-8 signature.
I had the same issue and this is the solution I came up with:
var blob = new Blob([
new Uint8Array([0xEF, 0xBB, 0xBF]), // UTF-8 BOM
"Text",
... // Remaining data
],
{ type: "text/plain;charset=utf-8" });
Using Uint8Array prevents the browser from converting those bytes into string (tested on Chrome and Firefox).
You should replace text/plain with your desired MIME type.
I'm editing my original answer. The above answer really demands elaboration as this is a convoluted solution by Node.js.
The short answer is, yes, this code works.
The long answer is, no, FEFF is not the byte order mark for utf-8. Apparently node took some sort of shortcut for writing encodings within files. FEFF is the UTF16 Little Endian encoding as can be seen within the Byte Order Mark wikipedia article and can also be viewed within a binary text editor after having written the file. I've verified this is the case.
http://en.wikipedia.org/wiki/Byte_order_mark#Representations_of_byte_order_marks_by_encoding
Apparently, Node.JS uses the \ufeff to signify any number of encoding. It takes the \ufeff marker and converts it into the correct byte order mark based on the 3rd options parameter of writeFile. The 3rd parameter you pass in the encoding string. Node.JS takes this encoding string and converts the \ufeff fixed byte encoding into any one of the actual encoding's byte order marks.
UTF-8 Example:
fs.writeFile(someFilename, '\ufeff' + html, { encoding: 'utf8' }, function(err) {
/* The actual byte order mark written to the file is EF BB BF */
}
UTF-16 Little Endian Example:
fs.writeFile(someFilename, '\ufeff' + html, { encoding: 'utf16le' }, function(err) {
/* The actual byte order mark written to the file is FF FE */
}
So, as you can see the \ufeff is simply a marker stating any number of resulting encodings. The actual encoding that makes it into the file is directly dependent the encoding option specified. The marker used within the string is really irrelevant to what gets written to the file.
I suspect that the reasoning behind this is because they chose not to write byte order marks and the 3 byte mark for UTF-8 isn't easily encoded into the javascript string to be written to disk. So, they used the UTF16LE BOM as a placeholder mark within the string which gets substituted at write-time.
This is my solution:
var blob = new Blob(["\uFEFF"+csv], {
type: 'text/csv; charset=utf-18'
});
This works for me:
let blob = new Blob(["\ufeff", csv], { type: 'text/csv;charset=utf-8' });
BOM (Byte Order Marker) might be necessary to use because some programs need it to use the correct character encoding.
Example:
When opening a csv file without a BOM in a system with a default character encoding of Shift_JIS instead of UTF-8 in MS Excel, it will open it in default encoding. This will result in garbage characters. If you specify the BOM for UTF-8, it will fix it.
This fixes it for me. was getting a BOM with authorize.net api and cloudflare workers:
const data = JSON.parse((await res.text()).trim());

Categories