I want to send some string data from Python3 to nodeJs. The string is Korean characters and I am encoding it to utf8.(Cause I don't know other ways sending data safely.) When I send(from python) it is ByteStream and in nodeJs I receive it as Array. I convert this Array to String. But now I cannot decode string back to original Korean characters.
Here are some codes I am using.
python
input = sys.argv[1]
d = bot.get_response(input)
data = str(d).encode('utf8')
print(data)
nodeJs
var utf = require('utf8');
var python = require('python-shell');
var pyt = path.normalize('path/to/my/python.exe'),
scrp = path.normalize('path/to/my/scriptsFolder/'),
var options = {
mode: 'text',
pythonPath: pyt,
pythonOptions: ['-u'],
scriptPath: scrp,
encoding: 'utf8',
args: [message]
};
python.run('test.py', options, function (err, results) {
//here I need to decode 'results'
var originalString = utf.encode(results.toString());// that code is not working for me
});
I have used several libs like utf8 to decode but didn't help.
Can someone please give some idea how to make it work.
EDIT
I have to edit with some more info.
I have tried #smarx approach but did not work.
I have two cases:
1. if I send data as string from python here is what I get in nodeJs b'\xec\x95\x88\xeb\x85\x95\xed\x95\x98\xec\x8b\xad\xeb\x8b\x88\xea\xb9\x8c? \xec\x9d\xb4\xed\x9a\xa8\xec\xa2\x85 \xea\xb3\xa0\xea\xb0\x9d\xeb\x8b\x98! \xeb\x8f\x99\xec\x96\x91\xeb\xa7\xa4\xec\xa7\x81\xec\x9e\x85\xeb\x8b\x88\xeb\x8b\xa4
2. if I encode data and send. I get �ȳ��Ͻʴϱ�? ��ȿ�� ������! �
I had the totally same issue on my project and now I finally found the answer.
I solved my problem by using these codes.
It works on windows (macOS and Linux, their default system encoding it 'utf8' so the issue doesn't happen).
I hope it might help you, too!
#in the python file that your javascript file will call by python-shell module put those code
import sys
sys.stdout.reconfigure(encoding='utf-8')
I found the hints from python-shell description.
feature >Simple and efficient data transfers through stdin and stdout streams
I'm still not sure what python.run does, since you won't share that code, but here's my version of the code, which is working fine:
test.py
print("안녕 세상")
app.js
const { exec } = require('child_process');
exec('python3 test.py', function (err, stdout, stderr) {
console.log(stdout);
});
// Output:
// 안녕 세상
I have the same issue when using python-shell.
Here is my solution:
The string after .encode('utf-8') is a binary string. So you need to print it on stdout directly.
in test.py, it print a utf-8 json which include some chinese char:
sys.stdout.buffer.write(json.dumps({"你好":"世界"}, ensure_ascii=False).encode('utf8'))
print() # print \n at ending to support python-shell in json mode
in main.js
let opt = {mode: 'json', pythonOptions: ['-u'], pythonPath: 'python', encoding: 'utf8'}
let pyshell = new PythonShell('lyric.py', opt);
pyshell.on('message', function (message) {
console.log(message); //*** The console msg may still wrong (still ���)
let json = JSON.stringify(message);
let fs = require('fs');
fs.writeFile('myjsonfile.json', json, 'utf8', function () {
}); //*** The output json file will be correct utf8 output
});
result:
This shows the msg is correctly receive in utf-8, because the json output is correct.
However console.log output apparently failed.
I don't know is there any way to fix console.log output. (Windows 10)
I had same trouble in using data(string) from python in node js.
I solved this problem in this way:
Try change default code page of Windows Console to UTF-8, if your code page of Windows Console is not UTF-8.
(In my case default code page was CP949.)
In my case:
I got message like ������ 2���� ��������.
I tried encoding on online (http://code.cside.com/3rdpage/us/url/converter.html)
then I found my strings encoded cp949 -> decoded utf-8.
Related
I'm converting my code from PHP to node.js, and I need to convert a part of my code where there's the gzuncompress() function.
For that I'm using zlib.inflateSync. But I don't know which encoding I should use to create the buffer and so to have the same result of php
Here's what I do with php to decompress a string:
gzuncompress(substr($this->raw, 8))
and here's what I've tried in node.js
zlib.inflateSync(new Buffer(this.raw.substr(8), "encoding"))
So what encoding should I use to make zlib.inflateSync returns the same data as gzuncompress ?
I am not sure about what would be exact encoding here however this repo has some PHP translations for node.js (https://github.com/gamalielmendez/node-fpdf/blob/master/src/PHP_CoreFunctions.js). According to this repo, the following could work:
const gzuncompress = (data) => {
const chunk = (!Buffer.isBuffer(data)) ? Buffer.from(data, 'binary') : data
const Z1 = zlib.inflateSync(chunk)
return Z1.toString('binary')//'ascii'
}
I'm a JS developer just learning python. This is my first time trying to use node (v6.7.0) and python (v2.7.1) together. I'm using restify with python-runner as a bridge to my python virtualenv. My python script uses a RAKE NLP keyword-extraction package.
I can't figure out for the life of me why my return data in server.js inserts a random comma at character 8192 and roughly multiples of. There's no pattern except the location; Sometimes it's in the middle of the object key string other times in the value, othertimes after the comma separating the object pairs. This completely breaks the JSON.parse() on the return data. Example outputs below. When I run the script from a python shell, this doesn't happen.
I seriously can't figure out why this is happening, any experienced devs have any ideas?
Sample output in browser
[..., {...ate': 1.0, 'intended recipient': 4.,0, 'correc...}, ...]
Sample output in python shell
[..., {...ate': 1.0, 'intended recipient': 4.0, 'correc...}, ...]
DISREGARD ANY DISCREPANCIES REGARDING OBJECT CONVERSION AND HANDLING IN THE FILES BELOW. THE CODE HAS BEEN SIMPLIFIED TO SHOWCASE THE ISSUE
server.js
var restify = require('restify');
var py = require('python-runner');
var server = restify.createServer({...});
server.get('/keyword-extraction', function( req, res, next ) {
py.execScript(__dirname + '/keyword-extraction.py', {
bin: '.py/bin/python'
})
.then( function( data ) {
fData = JSON.parse(data); <---- ERROR
res.json(fData);
})
.catch( function( err ) {...});
return next();
});
server.listen(8001, 'localhost', function() {...});
keyword-extraction.py
import csv
import json
import RAKE
f = open( 'emails.csv', 'rb' )
f.readline() # skip line containing col names
outputData = []
try:
reader = csv.reader(f)
for row in reader:
email = {}
emailBody = row[7]
Rake = RAKE.Rake('SmartStoplist.txt')
rakeOutput = Rake.run(emailBody)
for tuple in rakeOutput:
email[tuple[0]] = tuple[1]
outputData.append(email)
finally:
file.close()
print( json.dumps(outputData))
This looks suspiciously like a bug related to size of some buffer, since 8192 is a power of two.
The main thing here is to isolate exactly where the failure is occurring. If I were debugging this, I would
Take a closer look at the output from json.dumps, by printing several characters on either side of position 8191, ideally the integer character code (unicode, ASCII, or whatever).
If that looks OK, I would try capturing the output from the python script as a file and read that directly in the node server (i.e. don't run a python script).
If that works, then create a python script that takes that file and outputs it without manipulation and have your node server execute that python script instead of the one it is using now.
That should help you figure out where the problem is occurring. From comments, I suspect that this is essentially a bug that you cannot control, unless you can increase the python buffer size enough to guarantee your data will never blow the buffer. 8K is pretty small, so that might be a realistic solution.
If that is inadequate, then you might consider processing the data on the the node server, to remove every character at n * 8192, if you can consistently rely on that. Good luck.
To start off. I am currently using npm fast-csv which is a nice CSV reader/writer that is pretty straightforward and simple. What Im attempting to do is use this in conjunction with iconv to process "accented" character and non-ASCII characters and either convert them to an ASCII equivalent or remove them depending on the character.
My current process Im doing with fast-csv is to bring in a chunk for processing (comes in as one row) via a read stream, pause the read stream, process the data, pipe the data to a write stream and then resume the read stream using a callback. Fast-csv currently knows where to separate the chunks based on the format of the data coming in from the readstream.
The entire process looks like this:
var stream = fs.createReadStream(inputFileName);
function csvPull(source) {
csvWrite = csv.createWriteStream({ headers: true });
writableStream = fs.createWriteStream(outputFileName);
csvStream = csv()
.on("data", function (data) {
csvStream.pause();
processRow(data, function () {
csvStream.resume();
});
})
.on("end", function () {
console.log('END OF CSV FILE');
});
csvWrite.pipe(writableStream);
source.pipe(csvStream);
}
csvPull(stream);
The problem I am currently running into is that Im noticing that for some reason, when my javascript compiles, it does not inherently recognise non-ASCII characters, so I am resorting to having to use npm iconv-lite to encode the data stream as it comes in to something usable. However, this presents a bigger issue as fast-csv will no longer know where to split the chunks (rows) due to the now encoded data. This is a problem due to the sizes of the CSVs I will be working with; it will not be an option to load the entire CSV into the buffer to then decode.
Are there any suggestions on how I might get around this without writing my own CSV parser into my code?
Try reading your file with binary for the encoding option. I had to read few csv with some accented characters and it worked fine with that.
var stream = fs.createReadStream(inputFileName, { encoding: 'binary' });
Unless I misunderstand, you should be able to fix this by setting the encoding on the stream to utf-8 (docs).
for the first line:
var stream = fs.createReadStream(inputFileName, {encoding: 'utf8'});
and if needed:
writableStream = fs.createWriteStream(outputFileName, {defaultEncoding: 'utf8'});
I'm having some troubles saving a byte array (fetched from a Microsoft Office's task pane using Office.js) to a Word document file (on the server side). This is what I'm doing:
I'm getting the content of the Word document using this library
JavaScript
$('#push').click(function () {
$.when(OffQuery.getContent({ sliceSize: 1000000 }, function (j, data, result, file, opt) {
// ...nothing interesting here
})).then(function (finalByteArray, file, opt) {
// (1) this line is changed...see the answer
var fileContent = Base64.encode(finalByteArray); //encode the byte array into base64 string.
$.ajax({
url: '/webcontext/api/v1/documents',
// (2) missing setting (see the answer)
data: fileContent,
type: 'POST'
}).then(function () {
// updateStatus('Done sending contents into server...');
});
}).progress(function(j, chunkOfData, result, file, opt){
// ...nothing interesting here
});
Then in a Spring controller I'm doing this:
Java / Spring
#RequestMapping(method = RequestMethod.POST) // As OOXML
public void create(#RequestBody String fileContent, HttpServletRequest request) throws Exception { // TODO
LOGGER.debug("{} {}", request.getMethod(), request.getRequestURI());
//LOGGER.debug("fileContent: {}", fileContent);
try {
val base64 = Base64.decodeBase64(fileContent); // From Apache Commons Codecs
FileUtils.writeByteArrayToFile(new File("assets/tests/output/some_file.docx"), base64);
} catch (IOException e) {
LOGGER.error("Crash! Something went wrong here while trying to save that...this is why: ", e);
}
}
...but the file is getting saved as-is; basically is saving the byte array into the file as a text document.
Am I missing something? Do you have any clues? Somebody that has worked with Office.js, Task Panes and things like that?
Thanks in advance...
UPDATE 1
Turns out that the finalByteArray is getting converted into a Base64 string (fileContent), but when I try to do the reverse operation in Java is not working...if somebody has done that before, please let me know. I have tried:
The sample in this Mozilla page
Unibabel
base64-js
...on the Java side (to decode the Base64 String into a byte array):
The default Base64 encoder/decoder
The Base64 Apache codec
Actually, I found the error. It was on the client side. There is a function included in the Office.js SDK that does the conversion between the byte array into a Base64 string ― although I'm not sure if it's shipped with all versions, I'm using Office.js SDK 1.1.
So I changed the conversion to:
var fileContent = OSF.OUtil.encodeBase64(finalByteArray);
...and in the AJAX call I added the contentType setting:
$.ajax({
//...
contentType: 'application/octet-stream',
// ...
type: 'POST'
}).finish(function () {
// ...
});
By setting the content type correctly I was able to post "the right content" to the server.
Even if I do the correct Base64 conversion without setting the correct
Content Type, the received data in the Spring controller is
different (larger in this case) than the one reported in the client
side.
I hope this may help someone else in the future. The examples in the Microsoft Web are quite clear, but for some reason "there is always something different between environments".
There is no obvious error in your code, except (as commented) that I don't know why you cut off the last character from your code.
Why don't you use a javascript debugger like Firebug and a remote Java debugger for your Webserver to check every step in your processing and control the content of the various variables (Javascript fileContent, Java fileContent, Java base64) to find out where the error creeps in.
I'm attempting to convert a string from CP932 (aka Windows-31J) to utf8 in javascript. Basically I'm crawling a site that ignores the utf-8 request in the request header and returns cp932 encoded text (even though the html metatag indicates that the page is shift_jis).
Anyway, I have the entire page stored in a string variable called "html". From there I'm attempting to convert it to utf8 using this code:
var Iconv = require('iconv').Iconv;
var conv = new Iconv('CP932', 'UTF-8//TRANSLIT//IGNORE');
var myBuffer = new Buffer(html.length * 3);
myBuffer.write(html, 0, 'utf8')
var utf8html = (conv.convert(myBuffer)).toString('utf8');
The result is not what it's supposed to be. For example, the string: "投稿者さんの 稚内全日空ホテル のクチコミ (感想・情報)" comes out as "ソスソスソスeソスメゑソスソスソスソスソス ソスtソスソスソスSソスソスソスソスソスzソスeソスソス ソスフクソス`ソスRソス~ (ソスソスソスzソスEソスソスソスソス)"
If I remove //TRANSLIT//IGNORE (Which should cause it to return similar characters for missing characters, and failing that omit non-transcode-able characters), I get this error:
Error: EILSEQ, Illegal character sequence.
I'm open to using any solution that can be implemented in nodejs, but my search results haven't yielded many options outside of the nodejs-iconv module.
nodejs-iconv ref: https://github.com/bnoordhuis/node-iconv
Thanks!
Edit 24.06.2011:
I've gone ahead and implemented a solution in Java. However I'd still be interested in a javascript solution to this problem if somebody can solve it.
I got same trouble today :)
It depends libiconv. You need libiconv-1.13-ja-1.patch.
Please check followings.
http://d.hatena.ne.jp/ushiboy/20110422/1303481470
http://code.xenophy.com/?p=1529
or you can avoid problem using iconv-jp try
npm install iconv-jp
I had same problem, but with CP1250. I was looking for problem everywhere and everything was OK, except call of request – I had to add encoding: 'binary'.
request = require('request')
Iconv = require('iconv').Iconv
request({uri: url, encoding: 'binary'}, function(err, response, body) {
body = new Buffer(body, 'binary')
iconv = new Iconv('CP1250', 'UTF8')
body = iconv.convert(body).toString()
// ...
})
https://github.com/bnoordhuis/node-iconv/issues/19
I tried /Users/Me/node_modules/iconv/test.js
node test.js.
It return error.
On Mac OS X Lion, this problem seems depend on gcc.