Parsing huge binary files in Node.js - javascript

I want to create Node.js module which should be able to parse huge binary files (some larger than 200GB). Each file is divided into chunks and each chunk can be larger than 10GB. I tried using flowing and non-flowing methods to read file, but the problem is because the end of the readed buffer is reached while parsing chunk, so parsing of that chunk must be terminated before the next onData event occurs. This is what I've tried:
var s = getStream();
s.on('data', function(a){
parseChunk(a);
});
function parseChunk(a){
/*
There are a lot of codes and functions.
One chunk is larger than buffer passed to this function,
so when the end of this buffer is reached, parseChunk
function must be terminated before parsing process is finished.
Also, when the next buffer is passed, it is not the start of
a new chunk because the previous chunk is not parsed to the end.
*/
}
Loading whole chunk into process memory isn't prossible because I have only 8GB of RAM. How can I synchronously read data from the stream or how can I pause parseChunk function when the end of the buffer is reached and wait until new data is available?

Maybe I'm missing something, but as far as I can tell, I don't see a reason why this couldn't be implemented using streams with a different syntax. I'd use
let chunk;
let Nbytes; // # of bytes to read into a chunk
stream.on('readable', ()=>{
while(chunk = stream.read(Nbytes)!==null) {
// call whatever you like on the chunk of data of size Nbytes
}
})
Note that if you specify the size of the chunk yourself, like done here, null will be returned if the amount of bytes requested are not available at the end of the stream. This doesn't mean there is no data anymore to stream. So just be aware that you should expect back a 'trimmed' buffer object of size < Nbytes at the end of the file.

Related

Java Script File Reader class is not reading the contents properly

I am using readAsText method in FileReader class (java script) with encoding type as "UTF-8" to read a file from client. It works well for all kind of characters with ascii values ranging from 1 to 65000. The only problem I have is, when I read chunk by chunk from the file, any char has ascii value after 3000 sometimes not read properly, After the investigation, I found that it is happening only when I do this reading for big files and the particular char is accidently sitting as first letter of a chunk. And I tested with multiple chunks of a file. This problem is not happening for all the chunks, happening one or 2 chunks out of 10. This is weird and strange. Am I missing something here? and do we have any other options to read local file in Java script? Any help will be much appreciated.
this might be one solution
new Blob(['hi']).text().then(console.log)
Here is another, not so cross browser friendly... but this could work...
new Blob(['foo']) // or new File(['foo'], 'test.txt')
.stream()
.pipeThrough(new TextDecoderStream('utf-8'))
.pipeTo(new WritableStream({
write(part) {
console.log(part)
}
}))
Another lower level solution that don't depend on WritableStream or TextDecoderStream would be to use regular TextDecoder with stream option by doing
var res = ''
const decoder = new TextDecoder()
res += decoder.decode(chunk1, { stream: true })
res += decoder.decode(chunk2, { stream: true })
res += decoder.decode(chunk3, { stream: true })
res += decoder.decode() // flush the end
how u get each chunk could be by using new Response(blob).body or by blob.stream() or simply slicing the blob using blob.slice(start, end) and use FileReader.prototype.readAsArrayBuffer()

Node.js monitor file for changes and parse them

I need to monitor a file for changes. Due to a large amount of new entries to this file I would need to 'monitor' this file. I would need to get the new inserted content to this file to be able to parse this content.
I found this code:
fs.watchFile('var/log/query.log', function() {
console.log('File Changed ...');
//how to get the new line which is now inserted?
});
Here is an example of how I used fs.watchFile to monitor a log file for a game called Hearthstone and pick up new log entries to monitor game events as they happened while playing. https://github.com/chevex-archived/hearthstone-log-watcher/blob/master/index.js
var fs = require('fs');
var options = {
logFile: '~/Library/Preferences/Blizzard/Hearthstone/log.config',
endOfLineChar: require('os').EOL
};
// Obtain the initial size of the log file before we begin watching it.
var fileSize = fs.statSync(options.logFile).size;
fs.watchFile(options.logFile, function (current, previous) {
// Check if file modified time is less than last time.
// If so, nothing changed so don't bother parsing.
if (current.mtime <= previous.mtime) { return; }
// We're only going to read the portion of the file that
// we have not read so far. Obtain new file size.
var newFileSize = fs.statSync(options.logFile).size;
// Calculate size difference.
var sizeDiff = newFileSize - fileSize;
// If less than zero then Hearthstone truncated its log file
// since we last read it in order to save space.
// Set fileSize to zero and set the size difference to the current
// size of the file.
if (sizeDiff < 0) {
fileSize = 0;
sizeDiff = newFileSize;
}
// Create a buffer to hold only the data we intend to read.
var buffer = new Buffer(sizeDiff);
// Obtain reference to the file's descriptor.
var fileDescriptor = fs.openSync(options.logFile, 'r');
// Synchronously read from the file starting from where we read
// to last time and store data in our buffer.
fs.readSync(fileDescriptor, buffer, 0, sizeDiff, fileSize);
fs.closeSync(fileDescriptor); // close the file
// Set old file size to the new size for next read.
fileSize = newFileSize;
// Parse the line(s) in the buffer.
parseBuffer(buffer);
});
function stop () {
fs.unwatchFile(options.logFile);
};
function parseBuffer (buffer) {
// Iterate over each line in the buffer.
buffer.toString().split(options.endOfLineChar).forEach(function (line) {
// Do stuff with the line :)
});
};
It first calculates the initial size of the file because in this log watcher module I only want to read new data as it's being written by the game. I don't care about existing data. It then starts watching the file for changes. When the change handler fires we check if the modified time is really newer because some other changes about the file can trigger the handler when no data we care about actually changed. We wanted this watcher to be as performant as we could.
We then read the new size of the file and calculate the difference from the last time. This tells us exactly how much data to read from the file to get only the newly written data. Then we store the data in a buffer and parse it as a string. Just split the string by newline characters. Using core module os to get os.EOL will give you the correct line ending character for the operating system you are running on (windows line ending character is different from linux/unix).
Now you have an array of lines written to the file :)
On bash you would do something like that with tail --follow.
There is also a package tail availible.
you can watch on a file, and get new lines with an event:
const Tail = require('tail').Tail;
var tail = new Tail("var/log/query.log");
tail.watch()
tail.on("line", data => {
console.log(data);
});

Unable to read accented characters from csv file stream in node

To start off. I am currently using npm fast-csv which is a nice CSV reader/writer that is pretty straightforward and simple. What Im attempting to do is use this in conjunction with iconv to process "accented" character and non-ASCII characters and either convert them to an ASCII equivalent or remove them depending on the character.
My current process Im doing with fast-csv is to bring in a chunk for processing (comes in as one row) via a read stream, pause the read stream, process the data, pipe the data to a write stream and then resume the read stream using a callback. Fast-csv currently knows where to separate the chunks based on the format of the data coming in from the readstream.
The entire process looks like this:
var stream = fs.createReadStream(inputFileName);
function csvPull(source) {
csvWrite = csv.createWriteStream({ headers: true });
writableStream = fs.createWriteStream(outputFileName);
csvStream = csv()
.on("data", function (data) {
csvStream.pause();
processRow(data, function () {
csvStream.resume();
});
})
.on("end", function () {
console.log('END OF CSV FILE');
});
csvWrite.pipe(writableStream);
source.pipe(csvStream);
}
csvPull(stream);
The problem I am currently running into is that Im noticing that for some reason, when my javascript compiles, it does not inherently recognise non-ASCII characters, so I am resorting to having to use npm iconv-lite to encode the data stream as it comes in to something usable. However, this presents a bigger issue as fast-csv will no longer know where to split the chunks (rows) due to the now encoded data. This is a problem due to the sizes of the CSVs I will be working with; it will not be an option to load the entire CSV into the buffer to then decode.
Are there any suggestions on how I might get around this without writing my own CSV parser into my code?
Try reading your file with binary for the encoding option. I had to read few csv with some accented characters and it worked fine with that.
var stream = fs.createReadStream(inputFileName, { encoding: 'binary' });
Unless I misunderstand, you should be able to fix this by setting the encoding on the stream to utf-8 (docs).
for the first line:
var stream = fs.createReadStream(inputFileName, {encoding: 'utf8'});
and if needed:
writableStream = fs.createWriteStream(outputFileName, {defaultEncoding: 'utf8'});

Changing the size of an array in Javascript or node.js efficiently

In my node.js code, there is a buffer array to which I store the contents of a received image, every time an image being received via TCP connection between the node.js as TCP client and another TCP server:
var byBMP = new Buffer (imageSize);
However
the size of image,imageSize, differs every time which makes the size of byBMP change accordingly. That means something like this happen constantly:
var byBMP = new Buffer (10000);
.... wait to receive another image
byBMP = new Buffer (30000);
.... wait to receive another image
byBMP = new Buffer (10000);
... etc
Question:
Is there any more efficient way of resizing the array byBMP. I looked into this: How to empty an array in JavaScript? which gave me some ideas ebout the efficient way of emptying an array, however for resizing it I need you guys' comments.node.js
The efficient way would be to use Streams rather than a buffer. The buffer will keep the content of the image in your memory which is not very scalable.
Instead, simply pipe the download stream directly to a file output:
imgDownloadStream.pipe( fs.createWriteStream('tmp-img') );
If you're keeping the img in memory to perform transformation, then simply apply the transformation on the stream buffers parts as they pass through.

Non blocking loop through binary buffer and push to socket

I am uploading a file to ftp via chrome.sockets but the socket buffer size is limited, so i need to loop through the blob and send out smaller chunks of data. I have tried several methods with closures and callbacks but the only way working for me is do/while loop, which is of course blocking. Part of the problem is multiple variables that need to be kept in the closure. Can you please suggest better way of looping through the blob?
do
{
chunk = blob.slice(start,end)
start =end
end =end + 8192
chrome.socket.write(this.info.socketId, Socket.string2buffer(chunk), function(writeInfo) {});
}
while (chunk.length>0);
Complete code of the extension (single purpose ftp manager) https://github.com/vanous/minime-content-manager/tree/master/chromium-ext-broadcast
I believe something along the lines of the following should work:
var self=this;
var writeChunk=function(start,end){
var chunk = blob.slice(start,end);
chrome.socket.write(self.info.socketId, Socket.string2buffer(chunk), function(writeInfo) {
if(chunk.length>0) writeChunk(end,end+8192);
});
};
writeChunk(0,8192);

Categories