Unable to read accented characters from csv file stream in node - javascript

To start off. I am currently using npm fast-csv which is a nice CSV reader/writer that is pretty straightforward and simple. What Im attempting to do is use this in conjunction with iconv to process "accented" character and non-ASCII characters and either convert them to an ASCII equivalent or remove them depending on the character.
My current process Im doing with fast-csv is to bring in a chunk for processing (comes in as one row) via a read stream, pause the read stream, process the data, pipe the data to a write stream and then resume the read stream using a callback. Fast-csv currently knows where to separate the chunks based on the format of the data coming in from the readstream.
The entire process looks like this:
var stream = fs.createReadStream(inputFileName);
function csvPull(source) {
csvWrite = csv.createWriteStream({ headers: true });
writableStream = fs.createWriteStream(outputFileName);
csvStream = csv()
.on("data", function (data) {
csvStream.pause();
processRow(data, function () {
csvStream.resume();
});
})
.on("end", function () {
console.log('END OF CSV FILE');
});
csvWrite.pipe(writableStream);
source.pipe(csvStream);
}
csvPull(stream);
The problem I am currently running into is that Im noticing that for some reason, when my javascript compiles, it does not inherently recognise non-ASCII characters, so I am resorting to having to use npm iconv-lite to encode the data stream as it comes in to something usable. However, this presents a bigger issue as fast-csv will no longer know where to split the chunks (rows) due to the now encoded data. This is a problem due to the sizes of the CSVs I will be working with; it will not be an option to load the entire CSV into the buffer to then decode.
Are there any suggestions on how I might get around this without writing my own CSV parser into my code?

Try reading your file with binary for the encoding option. I had to read few csv with some accented characters and it worked fine with that.
var stream = fs.createReadStream(inputFileName, { encoding: 'binary' });

Unless I misunderstand, you should be able to fix this by setting the encoding on the stream to utf-8 (docs).
for the first line:
var stream = fs.createReadStream(inputFileName, {encoding: 'utf8'});
and if needed:
writableStream = fs.createWriteStream(outputFileName, {defaultEncoding: 'utf8'});

Related

Java Script File Reader class is not reading the contents properly

I am using readAsText method in FileReader class (java script) with encoding type as "UTF-8" to read a file from client. It works well for all kind of characters with ascii values ranging from 1 to 65000. The only problem I have is, when I read chunk by chunk from the file, any char has ascii value after 3000 sometimes not read properly, After the investigation, I found that it is happening only when I do this reading for big files and the particular char is accidently sitting as first letter of a chunk. And I tested with multiple chunks of a file. This problem is not happening for all the chunks, happening one or 2 chunks out of 10. This is weird and strange. Am I missing something here? and do we have any other options to read local file in Java script? Any help will be much appreciated.
this might be one solution
new Blob(['hi']).text().then(console.log)
Here is another, not so cross browser friendly... but this could work...
new Blob(['foo']) // or new File(['foo'], 'test.txt')
.stream()
.pipeThrough(new TextDecoderStream('utf-8'))
.pipeTo(new WritableStream({
write(part) {
console.log(part)
}
}))
Another lower level solution that don't depend on WritableStream or TextDecoderStream would be to use regular TextDecoder with stream option by doing
var res = ''
const decoder = new TextDecoder()
res += decoder.decode(chunk1, { stream: true })
res += decoder.decode(chunk2, { stream: true })
res += decoder.decode(chunk3, { stream: true })
res += decoder.decode() // flush the end
how u get each chunk could be by using new Response(blob).body or by blob.stream() or simply slicing the blob using blob.slice(start, end) and use FileReader.prototype.readAsArrayBuffer()

Some Random characters appearing at beginning of .txt file after converting from base64 node js

So i am dropping a .txt file in an uploader which is converting it into base64 data like this:
const {getRootProps, getInputProps} = useDropzone({
onDrop: async acceptedFiles => {
let font = ''; // its not actually a font just reusing some code i'll change it later its a .txt file so wherever you see font assume its NOT a font.
let reader = new FileReader();
let filename = acceptedFiles[0].name.split(".")[0];
console.log(filename);
reader.readAsDataURL(acceptedFiles[0]);
reader.onload = await function (){
font = reader.result;
console.log(font);
dispatch({type:'SET_FILES',payload:font})
};
setFontSet(true);
}
});
Then a POST request is made to the node js server and I indeed receive the base64 value. I then proceed to convert it back into a .txt file by writing it into a file called signals.txt like this:
server.post('/putInDB',(req,res)=>{
console.log(req.body);
var bitmap = new Buffer(req.body.data, 'base64');
let dirpath = `${process.cwd()}/signals.txt`;
let signalPath = path.normalize(dirpath);
connection.connect();
fs.writeFile(signalPath, bitmap, async (err) => {
if (err) throw err;
console.log('Successfully updated the file data');
//all the ending brackets and stuff
Now the thing is the orignal file looks like this :
Time,1,2,3,4,5,6,7,8,9,10,11,12
0.000000,7.250553,14.951141,5.550423,2.850217,-1.050080,-3.050233,1.850141,2.850217,-3.150240,1.350103,-2.950225,1.150088
But the file when writing back from base64 looks like this :
u«Zµìmþ™ZŠvÚ±î¸Time,1,2,3,4,5,6,7,8,9,10,11,12
0.000000,1.250095,0.250019,-4.150317,-0.350027,3.650278,1.950149,0.950072,-1.250095,-1.150088,-7.750591,-1.850141,-0.050004
See the weird characters in the beginning ? Why is this happening.
Remember to read up on what the functions you use do, because you're using readAsDataURL which does not give you the base64 encoded version of your data: it gives you Data-URL, and Data-URLs have a header prefix to tell URL parsers what kind of data this will be, and how to decode the data directly following the header.
To quote the MDN article:
Note: The blob's result cannot be directly decoded as Base64 without first removing the Data-URL declaration preceding the Base64-encoded data. To retrieve only the Base64 encoded string, first remove data:*/*;base64, from the result.
If you don't, blindly converting the Data-URL from base64 to plain text will give you some nonsense data at the start:
> Buffer.from('data:*/*;base64', 'base64').toString('utf-8')
'u�Z���{�'
Which raises another point: you would have caught this with POST data validation, because the Data-URL that you sent contains characters that are not allowed in base64. POST validation is always a good idea.
I know this isn't the exact code, but it is difficult to reproduce your problem with the code you provided. But the data you are sending needs to be a URL/URI encoded form.
So essentially:
encodeURI(base64data);
Encode URI is built into javascript: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/encodeURI
EDIT:
I saw you used the function readDataAsUrl(), but try using the encodeURI function and then readDataAsUrl().

how to send encoded data from python to nodejs

I want to send some string data from Python3 to nodeJs. The string is Korean characters and I am encoding it to utf8.(Cause I don't know other ways sending data safely.) When I send(from python) it is ByteStream and in nodeJs I receive it as Array. I convert this Array to String. But now I cannot decode string back to original Korean characters.
Here are some codes I am using.
python
input = sys.argv[1]
d = bot.get_response(input)
data = str(d).encode('utf8')
print(data)
nodeJs
var utf = require('utf8');
var python = require('python-shell');
var pyt = path.normalize('path/to/my/python.exe'),
scrp = path.normalize('path/to/my/scriptsFolder/'),
var options = {
mode: 'text',
pythonPath: pyt,
pythonOptions: ['-u'],
scriptPath: scrp,
encoding: 'utf8',
args: [message]
};
python.run('test.py', options, function (err, results) {
//here I need to decode 'results'
var originalString = utf.encode(results.toString());// that code is not working for me
});
I have used several libs like utf8 to decode but didn't help.
Can someone please give some idea how to make it work.
EDIT
I have to edit with some more info.
I have tried #smarx approach but did not work.
I have two cases:
1. if I send data as string from python here is what I get in nodeJs b'\xec\x95\x88\xeb\x85\x95\xed\x95\x98\xec\x8b\xad\xeb\x8b\x88\xea\xb9\x8c? \xec\x9d\xb4\xed\x9a\xa8\xec\xa2\x85 \xea\xb3\xa0\xea\xb0\x9d\xeb\x8b\x98! \xeb\x8f\x99\xec\x96\x91\xeb\xa7\xa4\xec\xa7\x81\xec\x9e\x85\xeb\x8b\x88\xeb\x8b\xa4
2. if I encode data and send. I get �ȳ��Ͻʴϱ�? ��ȿ�� ������! �
I had the totally same issue on my project and now I finally found the answer.
I solved my problem by using these codes.
It works on windows (macOS and Linux, their default system encoding it 'utf8' so the issue doesn't happen).
I hope it might help you, too!
#in the python file that your javascript file will call by python-shell module put those code
import sys
sys.stdout.reconfigure(encoding='utf-8')
I found the hints from python-shell description.
feature >Simple and efficient data transfers through stdin and stdout streams
I'm still not sure what python.run does, since you won't share that code, but here's my version of the code, which is working fine:
test.py
print("안녕 세상")
app.js
const { exec } = require('child_process');
exec('python3 test.py', function (err, stdout, stderr) {
console.log(stdout);
});
// Output:
// 안녕 세상
I have the same issue when using python-shell.
Here is my solution:
The string after .encode('utf-8') is a binary string. So you need to print it on stdout directly.
in test.py, it print a utf-8 json which include some chinese char:
sys.stdout.buffer.write(json.dumps({"你好":"世界"}, ensure_ascii=False).encode('utf8'))
print() # print \n at ending to support python-shell in json mode
in main.js
let opt = {mode: 'json', pythonOptions: ['-u'], pythonPath: 'python', encoding: 'utf8'}
let pyshell = new PythonShell('lyric.py', opt);
pyshell.on('message', function (message) {
console.log(message); //*** The console msg may still wrong (still ���)
let json = JSON.stringify(message);
let fs = require('fs');
fs.writeFile('myjsonfile.json', json, 'utf8', function () {
}); //*** The output json file will be correct utf8 output
});
result:
This shows the msg is correctly receive in utf-8, because the json output is correct.
However console.log output apparently failed.
I don't know is there any way to fix console.log output. (Windows 10)
I had same trouble in using data(string) from python in node js.
I solved this problem in this way:
Try change default code page of Windows Console to UTF-8, if your code page of Windows Console is not UTF-8.
(In my case default code page was CP949.)
In my case:
I got message like ������ 2���� ��������.
I tried encoding on online (http://code.cside.com/3rdpage/us/url/converter.html)
then I found my strings encoded cp949 -> decoded utf-8.

Cropping a Base64 PNG in-memory using PURE JavaScript on the client side w/o using canvas

Context: JavaScript, as part of a SDK (can be on node.js or browser).
Start point: I have a base64 string that's actually a base64 encoded PNG image (I got it from selenium webdriver - takeScreenshot).
Question: How do I crop it?
The techniques involving the canvas seem irrelevant (or am I wrong?). My code runs as part of tests - probably on node.js. The canvas approach doesn't seem to fit here and might also cause additional noise in the image.
All the libraries I found either deal with streams (maybe I should convert the string to stream somehow?) or deal directly with the UI by adding a control (irrelevant for me).
Isn't there something like (promises and callbacks omitted for brevity):
var base64png = driver.takeScreenshot();
var png = new PNG(base64png);
return png.crop(50, 100, 20, 80).toBase64();
?
Thanks!
Considering you wish to start with base64 string and end with cropped base64 string (image), here is the following code:
var Stream = require('stream');
var gm = require('gm');
var base64png = driver.takeScreenshot();
var stream = new Stream();
stream.on('data', function(data) {
print data
});
gm(stream, 'my_image.png').crop(WIDTH, HEIGHT, X, Y).stream(function (err, stdout, stderr) {
var data = '';
stdout.on('readable', function() {
data += stream.read().toString('base64');
});
stream.on('end', function() {
// DO something with your new base64 cropped img
});
});
stream.emit('data', base64png);
Be aware that it is unfinished, and might need some polishing or debugging (I am in no means a node.js guru), but the idea is next:
Convert string into stream
Read stream into GM module
Manipulate the image
Save it into a stream
Convert stream back into 64base string
Adding my previous comment as an answer:
Anyone looking to do this will need to decode the image to get the raw image data using a library such as node-pngjs and manipulate the data yourself (perhaps there is a library for such operations that doesn't rely on the canvas).

How to convert character encoding from CP932 to UTF-8 in nodejs javascript, using the nodejs-iconv module (or other solution)

I'm attempting to convert a string from CP932 (aka Windows-31J) to utf8 in javascript. Basically I'm crawling a site that ignores the utf-8 request in the request header and returns cp932 encoded text (even though the html metatag indicates that the page is shift_jis).
Anyway, I have the entire page stored in a string variable called "html". From there I'm attempting to convert it to utf8 using this code:
var Iconv = require('iconv').Iconv;
var conv = new Iconv('CP932', 'UTF-8//TRANSLIT//IGNORE');
var myBuffer = new Buffer(html.length * 3);
myBuffer.write(html, 0, 'utf8')
var utf8html = (conv.convert(myBuffer)).toString('utf8');
The result is not what it's supposed to be. For example, the string: "投稿者さんの 稚内全日空ホテル のクチコミ (感想・情報)" comes out as "ソスソスソスeソスメゑソスソスソスソスソス ソスtソスソスソスSソスソスソスソスソスzソスeソスソス ソスフクソス`ソスRソス~ (ソスソスソスzソスEソスソスソスソス)"
If I remove //TRANSLIT//IGNORE (Which should cause it to return similar characters for missing characters, and failing that omit non-transcode-able characters), I get this error:
Error: EILSEQ, Illegal character sequence.
I'm open to using any solution that can be implemented in nodejs, but my search results haven't yielded many options outside of the nodejs-iconv module.
nodejs-iconv ref: https://github.com/bnoordhuis/node-iconv
Thanks!
Edit 24.06.2011:
I've gone ahead and implemented a solution in Java. However I'd still be interested in a javascript solution to this problem if somebody can solve it.
I got same trouble today :)
It depends libiconv. You need libiconv-1.13-ja-1.patch.
Please check followings.
http://d.hatena.ne.jp/ushiboy/20110422/1303481470
http://code.xenophy.com/?p=1529
or you can avoid problem using iconv-jp try
npm install iconv-jp
I had same problem, but with CP1250. I was looking for problem everywhere and everything was OK, except call of request – I had to add encoding: 'binary'.
request = require('request')
Iconv = require('iconv').Iconv
request({uri: url, encoding: 'binary'}, function(err, response, body) {
body = new Buffer(body, 'binary')
iconv = new Iconv('CP1250', 'UTF8')
body = iconv.convert(body).toString()
// ...
})
https://github.com/bnoordhuis/node-iconv/issues/19
I tried /Users/Me/node_modules/iconv/test.js
node test.js.
It return error.
On Mac OS X Lion, this problem seems depend on gcc.

Categories