I have a code, which reades .csv line by line and saves each parsed row to the database
const csv = require('csv-parse')
const errors = []
csv.parse(content, {})
.on('data', async function (row: any) {
const error = await tryToSaveToDatabase(row);
if (error) {
errors.push(error)
}
})
.on('end', function () {
// somehow process all errors
})
but, unfortunately, .on('end', ... block is beeing called earlier then all await block succeded.
I have read NodeJs Csv parser async operations - seems we cannot use await inside .on('data', ... callback.
What is the correct way to perform such thing - if I want to read .csv line by line (files might be very huge - so it must be performed in a streaming manner) and collect some errors while saving to database? (these errors are displayed on frontend then)
https://csv.js.org/parse/api/async_iterator/
This solution reads .csv line by line
Related
Requirement : I am supposed to hit around 100 links given in a file and download it sequentially. The response content type is octet stream file (.ts file). I am using "got" library to stream. However I am not able to do the task sequentially even though i am using await function.
I Do not want to hit all the links asynchronously and then do Promise.allsettled().
I want to it to be in sequential manner.
however in the output, i am seeing the for loop executing without waiting for previous file to be downloaded and written to file completely.
First all the console.log statements for each file is getting printed and then writing to file is happening until process is not yet exited.
(each file size is around 2 mb)
How do I execute this sequentially ?
const allFileLines = allFileContents.split(/\r?\n/);
async function makeApiCall(){
for await(const line of allFileLines){
if (line.startsWith("https")) {
const fileName = new URL(line).pathname.split('/')[1];
const downloadPathAndFile = `${download_path}${dirName}/${fileName}`;
console.log(`Downloading file : ${fileName}`);
await downloadFileWithGot(line, downloadPathAndFile, getOptions(line))
}
}
}
await makeApiCall();
// in a different utility file
export async function downloadFileWithGot(url, outputLocationPath, options) {
console.log('entered inside fn');
const pipeResp = await got.stream(url, undefined, options)
console.log('got pipe resp');
const writeResp = await pipeResp.pipe(createWriteStream(outputLocationPath));
console.log('successfully wrote to file');
return writeResp;
}
/*
File in which links are present (its a simple txt file) :
https://example.com/__segment:12345/stream_1/file_0.ts
https://example.com/__segment:12345/stream_1/file_1.ts
.
.
------------ output ------------
Downloading file : file_0.ts
entered inside fn
got pipe resp
successfully wrote to file
Downloading file : file_1.ts
entered inside fn
got pipe resp
successfully wrote to file
.
.
.
*/
I am trying the following code (from sample of parquetjs-lite and stackoverflow) to read a parquet file in nodejs :
const readParquetFile = async () => {
try {
// create new ParquetReader that reads from test.parquet
let reader = await parquet.ParquetReader.openFile('test.parquet');
}
catch (e){
console.log(e);
throw e;
}
// create a new cursor
let cursor = reader.getCursor();
// read all records from the file and print them
let record = null;
while (record = await cursor.next()) {
console.log(record);
}
await reader.close();
};
When I run this code nothing happens . There is nothing written to the console, for testing purpose I have only used a small csv file which I converted using python to parquet.
Is it because I have converted from csv to parquet using python (I couldn't find any JS equivalent for large files on which I have to ultimately be able to use).
I want my application to be able to take in any parquet file and read it. Is there any limitation for parquetjs-lite in this regard.
There are NaN values in my CSV could that be a problem ?
Any pointers would be helpful.
Thanks
Possible failure cases are
you are calling this function in some file without a webserver running.
In this case the file will run in async mode and as async function goes in callback stack and your main stack is empty the program will end and even is you have code in your call stack it will never run or log anything.
To solve this try running a webserver or better use sync calls
//app.js (without webserver)
const readParquetFile = async () => {
console.log("running")
}
readParquetFile()
console.log("exit")
when you run the above code the output will be
exit
//syncApp.js
const readParquetFile = () => {
console.log("running")
// all function should be sync
}
readParquetFile()
console.log("exit")
here the console log will be
running
exit
I read that createRreadStream doesn't put the whole file into the memory, instead it works with chunks. However I have a situation where I am simultaneously writing and reading from a file; Write gets finished first, then I delete the file from disk. Somehow, readstream was able to complete reading whole file without any error.
Does anyone have any explanation for this ? Am I wrong to think that streams doesn't load the file into memory?
Here's the code for writing to a file
const fs = require('fs');
const file = fs.createWriteStream('./bigFile4.txt');
function write(stream,data) {
if(!stream.write(data))
return new Promise(resolve=>stream.once('drain',resolve));
return true;
}
(async() => {
for(let i=0; i<1e6; i++) {
const res = write(file,'a')
if(res instanceof Promise)
await res;
}
write(file,'success');
})();
For Reading I used this,
const file = fs.createReadStream('bigFile4.txt')
file.on('data',(chunk)=>{
console.log(chunk.toString())
})
file.on('end',()=>{
console.log('done')
})
At least on UNIX-type OS'es, if you open a file and then remove it, the file data will still be available to read until you close the file.
I have following code working for every file except one that keeps hanging without emitting end or error events (I tried other stream events too).
const fs = require('fs');
const rs = fs.createReadStream(filePath, {
encoding: 'base64',
});
rs.on('data', () => {
console.log('data');
});
rs.on('end', () => {
console.log('end');
});
rs.on('error', e => {
console.log('error', e);
});
If I move read point with start option to 1 instead of 0 it works properly. Same if highWaterMark is set to value other than default. It doesn't really help as it seems it can fail with other "corrupted" file.
It seems like Node bug, but maybe there's something I'm missing here.
I'll post file in here too, but first I need to strip it to down to only corrupting part as it's somewhat private.
Update
Here's file to recreate the issue:
http://s3.eu-west-1.amazonaws.com/jjapitest/file
Update
Here's interactive demo of the issue:
https://repl.it/repls/AnimatedDisguisedNumerator
I tried to write a program with highland.js to download several files, unzip them and parse into objects, then merge object streams into one stream by flatMap and print out.
function download(url) {
return _(request(url))
.through(zlib.createGunzip())
.errors((err) => console.log('Error in gunzip', err))
.through(toObjParser)
.errors((err) => console.log('Error in OsmToObj', err));
}
const urlList = ['url_1', 'url_2', 'url_3'];
_(urlList)
.flatMap(download)
.each(console.log);
When all URLs are valid, it works fine. If a URL is invalid there is no file downloaded, then gunzip reports error. I suspect that the stream closes when error occurs. I expect that flatMap will continue with other streams, however the program doesn't download other files and there is nothing printed out.
What's the correct way to handle error in stream and how to make flatMap not stop after one stream has error?
In imperative programming, I can add debug logs to trace where error happens. How to debug streaming code?
PS. toObjParser is a Node Transform Stream. It takes a readable stream of OSM XML and outputs a stream of objects compatible with Overpass OSM JSON. See https://www.npmjs.com/package/osm2obj
2017-12-19 update:
I tried to call push in errors as #amsross suggested. To verify if push really works, I pushed a XML document and it was parsed by following parser and I saw it from output. However, stream still stopped and url_3 was not downloaded.
function download(url) {
console.log('download', url);
return _(request(url))
.through(zlib.createGunzip())
.errors((err, push) => {
console.log('Error in gunzip', err);
push(null, Buffer.from(`<?xml version='1.0' encoding='UTF-8'?>
<osmChange version="0.6">
<delete>
<node id="1" version="2" timestamp="2008-10-15T10:06:55Z" uid="5553" user="foo" changeset="1" lat="30.2719406" lon="120.1663723"/>
</delete>
</osmChange>`));
})
.through(new OsmToObj())
.errors((err) => console.log('Error in OsmToObj', err));
}
const urlList = ['url_1_correct', 'url_2_wrong', 'url_3_correct'];
_(urlList)
.flatMap(download)
.each(console.log);
Update 12/19/2017:
Ok, so I can't give you a good why on this, but I can tell you that switching from consuming the streams resulting from download in sequence to merge'ing them together will probably give you the result you're after. Unfortunately (or not?), you will no longer be getting the results back in any prescribed order.
const request = require('request')
const zlib = require('zlib')
const h = require('highland')
// just so you can see there isn't some sort of race
const rnd = (min, max) => Math.floor((Math.random() * (max - min))) + min
const delay = ms => x => h(push => setTimeout(() => {
push(null, x)
push(null, h.nil)
}, ms))
const download = url => h(request(url))
.flatMap(delay(rnd(0, 2000)))
.through(zlib.createGunzip())
h(['urlh1hcorrect', 'urlh2hwrong', 'urlh3hcorrect'])
.map(download).merge()
// vs .flatMap(download) or .map(download).sequence()
.errors(err => h.log(err))
.each(h.log)
Update 12/03/2017:
When an error is encountered on the stream, it ends that stream. To avoid this, you need to handle the error. You are currently using errors to report the error, but not handle it. You can do something like this to move on to the next value in the stream:
.errors((err, push) => {
console.log(err)
push(null) // push no error forward
})
Original:
It's difficult to answer without knowing the input and output types of toObjParser are.
Because through passes a stream of values to the provided function and expects a stream of values in return, your issue may reside in toObjParser having a signature like Stream -> Object, or Stream -> Stream Object, where the errors are occurring on the inner stream, which will not emit any errors until it is consumed.
What is the output of .each(console.log)? If it is logging a stream, that is most likely your problem.