How to read the encoding of the file - javascript

I am not sure how can I find the encoding of the file. I am trying to upload the CSV file only with utf-8 and find if there are any non utf-8 characters I want to show an error message.
I am using Papa persor for parsing.
How can I read the encoding of the file either in java or in js.
var fileElement = element.find('.csv-upload');
var file = fileElement[0].files[0]
var parseConfig = {
skipEmptyLines: true,
header: true,
encoding:'UTF-8',
trimHeaders: true,
complete: function (content) {
scope.onFileRead(content, fileElement[0]);
}
};
if (scope.rowsToParse) {
parseConfig.preview = scope.rowsToParse;
}
if (file) {
Papa.parse(file, parseConfig);
}

A CSV file will not have any encoding information in it. It is just a sequence of bytes.
You have to know beforehand whether the file contains non-UTF8 characters, and what is the right file encoding to use when reading it.
If you are using Java (not sure if you are because you have the Spring tag), you could use one of the libraries suggested here to try and infer the file type.
Is there a java library equivalent to file command in unix
Maybe there is something similar for Javascript.

Related

Generating files in nodejs escape character issue

I have a NestJs project where I am trying to generate js files into the CDN blob storage.
For eg I have this object:
{
"prop1":"${resolvedprop1}",
"prop2":"John's pet"
}
content of the generated js file should look like:
var metadata=()=>JSON.parse('{ "prop1":"${resolvedprop1}", "prop2":"John's pet" }')
Code for generation of the file:
function writeToCDN(metadata){
const content = `var metadata = ()=>JSON.parse('${JSON.stringify(metadata)}')`;
//write content string to a file in CDN BLOB.
}
When I load this file into the browser and execute metadata() I get an error when parsing because the single quote is not escaped.
Important note:
I can not use backtick instead of single quote inside JSON.parse since I am expecting ${} in the prop1 value.
Thank you for your help in advance!

Some Random characters appearing at beginning of .txt file after converting from base64 node js

So i am dropping a .txt file in an uploader which is converting it into base64 data like this:
const {getRootProps, getInputProps} = useDropzone({
onDrop: async acceptedFiles => {
let font = ''; // its not actually a font just reusing some code i'll change it later its a .txt file so wherever you see font assume its NOT a font.
let reader = new FileReader();
let filename = acceptedFiles[0].name.split(".")[0];
console.log(filename);
reader.readAsDataURL(acceptedFiles[0]);
reader.onload = await function (){
font = reader.result;
console.log(font);
dispatch({type:'SET_FILES',payload:font})
};
setFontSet(true);
}
});
Then a POST request is made to the node js server and I indeed receive the base64 value. I then proceed to convert it back into a .txt file by writing it into a file called signals.txt like this:
server.post('/putInDB',(req,res)=>{
console.log(req.body);
var bitmap = new Buffer(req.body.data, 'base64');
let dirpath = `${process.cwd()}/signals.txt`;
let signalPath = path.normalize(dirpath);
connection.connect();
fs.writeFile(signalPath, bitmap, async (err) => {
if (err) throw err;
console.log('Successfully updated the file data');
//all the ending brackets and stuff
Now the thing is the orignal file looks like this :
Time,1,2,3,4,5,6,7,8,9,10,11,12
0.000000,7.250553,14.951141,5.550423,2.850217,-1.050080,-3.050233,1.850141,2.850217,-3.150240,1.350103,-2.950225,1.150088
But the file when writing back from base64 looks like this :
u«Zµìmþ™ZŠvÚ±î¸Time,1,2,3,4,5,6,7,8,9,10,11,12
0.000000,1.250095,0.250019,-4.150317,-0.350027,3.650278,1.950149,0.950072,-1.250095,-1.150088,-7.750591,-1.850141,-0.050004
See the weird characters in the beginning ? Why is this happening.
Remember to read up on what the functions you use do, because you're using readAsDataURL which does not give you the base64 encoded version of your data: it gives you Data-URL, and Data-URLs have a header prefix to tell URL parsers what kind of data this will be, and how to decode the data directly following the header.
To quote the MDN article:
Note: The blob's result cannot be directly decoded as Base64 without first removing the Data-URL declaration preceding the Base64-encoded data. To retrieve only the Base64 encoded string, first remove data:*/*;base64, from the result.
If you don't, blindly converting the Data-URL from base64 to plain text will give you some nonsense data at the start:
> Buffer.from('data:*/*;base64', 'base64').toString('utf-8')
'u�Z���{�'
Which raises another point: you would have caught this with POST data validation, because the Data-URL that you sent contains characters that are not allowed in base64. POST validation is always a good idea.
I know this isn't the exact code, but it is difficult to reproduce your problem with the code you provided. But the data you are sending needs to be a URL/URI encoded form.
So essentially:
encodeURI(base64data);
Encode URI is built into javascript: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/encodeURI
EDIT:
I saw you used the function readDataAsUrl(), but try using the encodeURI function and then readDataAsUrl().

How to find out charset of text file loaded by input[type="file"] in Javascript

I want to read user's file and gave him modified version of this file. I use input with type file to get text file, but how I can get charset of loaded file, because in different cases it can be various... Uploaded file has format .txt or something similar and isn't .html :)
var handler = document.getElementById('handler');
var reader = new FileReader();
handler.addEventListener('click', function() {
reader.readAsText(firstSub.files[0], /* Here I need use a correctly charset */);
});
reader.addEventListener("loadend", function() {
console.dir(reader.result.split('\n'));
});
In my case (I made a small web app that accepts subtitle .srt files and removes time codes and line breaks, making a printable text), it was enough to foresee 2 types of encoding: UTF-8 and CP1251 (in all cases I tried – with both Latin and Cyrillic letters – these two types are enough). At first I try encoding with UTF-8, and if it is not successful, some characters are replaced by '�'-signs. So, I check the result for presence of these signs, and, if found, the procedure is repeated with CP1251 encoding. So, here is my code:
function onFileInputChange(inputDomElement, utf8 = true) {
const file = inputDomElement.files[0];
const reader = new FileReader();
reader.readAsText(file, utf8 ? 'UTF-8' : 'CP1251');
reader.onload = () => {
const result = reader.result;
if (utf8 && result.includes('�')) {
onFileInputChange(inputDomElement, false);
console.log('The file encoding is not utf-8! Trying CP1251...');
} else {
document.querySelector('#textarea1').value = file.name.replace(/\.(srt|txt)$/, '').replace(/_+/g, '\ ').toUpperCase() + '\n' + result;
}
}
}
You should check out this library encoding.js
They also have a working demo. I would suggest you first try it out with the files that you'll typically work with to see if it detects the encoding correctly and then use the library in your project.
The other solutions didn't work for what I was trying to do, so I decided to create my own module that can detect the charset and language of any file loaded via input[type='file'] / FileReader API.
You load it via the <script> tag and then use the languageEncoding function to retrieve the charset/encoding:
// index.html
<script src="https://unpkg.com/detect-file-encoding-and-language/umd/language-encoding.min.js"></script>
// app.js
languageEncoding(file).then(fileInfo => console.log(fileInfo));
// Possible result: { language: english, encoding: UTF-8, confidence: { language: 0.96, encoding: 1 } }
For a more complete example/instructions check out this part of the documentation!

Unable to read accented characters from csv file stream in node

To start off. I am currently using npm fast-csv which is a nice CSV reader/writer that is pretty straightforward and simple. What Im attempting to do is use this in conjunction with iconv to process "accented" character and non-ASCII characters and either convert them to an ASCII equivalent or remove them depending on the character.
My current process Im doing with fast-csv is to bring in a chunk for processing (comes in as one row) via a read stream, pause the read stream, process the data, pipe the data to a write stream and then resume the read stream using a callback. Fast-csv currently knows where to separate the chunks based on the format of the data coming in from the readstream.
The entire process looks like this:
var stream = fs.createReadStream(inputFileName);
function csvPull(source) {
csvWrite = csv.createWriteStream({ headers: true });
writableStream = fs.createWriteStream(outputFileName);
csvStream = csv()
.on("data", function (data) {
csvStream.pause();
processRow(data, function () {
csvStream.resume();
});
})
.on("end", function () {
console.log('END OF CSV FILE');
});
csvWrite.pipe(writableStream);
source.pipe(csvStream);
}
csvPull(stream);
The problem I am currently running into is that Im noticing that for some reason, when my javascript compiles, it does not inherently recognise non-ASCII characters, so I am resorting to having to use npm iconv-lite to encode the data stream as it comes in to something usable. However, this presents a bigger issue as fast-csv will no longer know where to split the chunks (rows) due to the now encoded data. This is a problem due to the sizes of the CSVs I will be working with; it will not be an option to load the entire CSV into the buffer to then decode.
Are there any suggestions on how I might get around this without writing my own CSV parser into my code?
Try reading your file with binary for the encoding option. I had to read few csv with some accented characters and it worked fine with that.
var stream = fs.createReadStream(inputFileName, { encoding: 'binary' });
Unless I misunderstand, you should be able to fix this by setting the encoding on the stream to utf-8 (docs).
for the first line:
var stream = fs.createReadStream(inputFileName, {encoding: 'utf8'});
and if needed:
writableStream = fs.createWriteStream(outputFileName, {defaultEncoding: 'utf8'});

check uploaded file format on client side

I am creating a web portal where end user will upload a csv file and I will do some manipulation on that file on the server side (python). There is some latency and lag on the server side so I dont want to send the message from server to client regarding the bad format of uploaded file. Is there any way to do heavy lifting on client side may be using js or jquery to check if the uploaded file is "comma" separated or not etc etc?
I know we can do "accept=.csv" in the html so that file extension has csv format but how to do with contents to be sure.
Accessing local files from Javascript is only possible by using the File API (https://developer.mozilla.org/en-US/docs/Using_files_from_web_applications) - by using this you might be able to check the content whether it matches your expectations or not.
Here's some bits of code I used to display a preview image clientside when a file is selected. You should be able to use this as a starting point to do something else with the file data. Determining whether its csv is up to you.
Obvious caveat:
You still have to check server side. Anyone can modify your clientside javascript to pretend a bad file is good.
Another caveat:
I'm pretty sure that you can have escaped comma characters in a valid csv file. I think the escape character might be different across some implementations too...
// Fired when the user chooses a file in the OS dialog box
// They will have clicked <input id="fileId" type="file">
document.getElementById('fileId').onchange = function (evt) {
if(!evt.target.files || evt.target.files.length === 0){
console.log('No files selected');
return;
}
var uploadTitle = evt2.target.files[0].name;
var uploadSize = evt2.target.files[0].size;
var uploadType = evt2.target.files[0].type;
// To manipulate the file you set a callback for the whole contents:
var FR = new FileReader();
// I've only used this readAsDataURL which will encode the file like ...
// I'm sure there's a similar call for plaintext
FR.readAsDataURL($('#file')[0].files[0]);
FR.onload = function(evt2){
var evtData = {
filesEvent: evt,
}
var uploadData = evt2.result
console.log(uploadTitle, uploadSize, uploadType, uploadData);
}
}

Categories