Save csv-parse output to a variable - javascript

I'm new to using csv-parse and this example from the project's github does what I need with one exception. Instead of outputting via console.log I want to store data in a variable. I've tried assigning the fs line to a variable and then returning data rather than logging it but that just returned a whole bunch of stuff I didn't understand. The end goal is to import a CSV file into SQLite.
var fs = require('fs');
var parse = require('..');
var parser = parse({delimiter: ';'}, function(err, data){
console.log(data);
});
fs.createReadStream(__dirname+'/fs_read.csv').pipe(parser);
Here is what I have tried:
const fs = require("fs");
const parse = require("./node_modules/csv-parse");
const sqlite3 = require("sqlite3");
// const db = new sqlite3.Database("testing.sqlite");
let parser = parse({delimiter: ","}, (err, data) => {
// console.log(data);
return data;
});
const output = fs.createReadStream(__dirname + "/users.csv").pipe(parser);
console.log(output);

I was also struggling to figure out how to get the data from csv-parse back to the top-level that invokes parsing. Specifically I was trying to get parser.info data at the end of processing to see if it was successful, but the solution for that can work to get the row data as well, if you need.
The key was to wrap all the stream event listeners into a Promise, and within the parser's callback resolve the Promise.
function startFileImport(myFile) {
// THIS IS THE WRAPPER YOU NEED
return new Promise((resolve, reject) => {
let readStream = fs.createReadStream(myFile);
let fileRows = [];
const parser = parse({
delimiter: ','
});
// Use the readable stream api
parser.on('readable', function () {
let record
while (record = parser.read()) {
if (record) { fileRows.push(record); }
}
});
// Catch any error
parser.on('error', function (err) {
console.error(err.message)
});
parser.on('end', function () {
const { lines } = parser.info;
// RESOLVE OUTPUT THAT YOU WANT AT PARENT-LEVEL
resolve({ status: 'Successfully processed lines: ', lines });
});
// This will wait until we know the readable stream is actually valid before piping
readStream.on('open', function () {
// This just pipes the read stream to the response object (which goes to the client)
readStream.pipe(parser);
});
// This catches any errors that happen while creating the readable stream (usually invalid names)
readStream.on('error', function (err) {
resolve({ status: null, error: 'readStream error' + err });
});
});
}

This is a question that suggests confusion about an asynchronous streaming API and seems to ask at least three things.
How do I get output to contain an array-of-arrays representing the parsed CSV data?
That output will never exist at the top-level, like you (and many other programmers) hope it would, because of how asynchronous APIs operate. All the data assembled neatly in one place can only exist in a callback function. The next best thing syntactically is const output = await somePromiseOfOutput() but that can only occur in an async function and only if we switch from streams to promises. That's all possible, and I mention it so you can check it out later on your own. I'll assume you want to stick with streams.
An array consisting of all the rows can only exist after reading the entire stream. That's why all the rows are only available in the author's "Stream API" example only in the .on('end', ...) callback. If you want to do anything with all the rows present at the same time, you'll need to do it in the end callback.
From https://csv.js.org/parse/api/ note that the author:
uses the on readable callback to push single records into a previously empty array defined externally named output.
uses the on error callback to report errors
uses the on end callback to compare all the accumulated records in output to the expected result
...
const output = []
...
parser.on('readable', function(){
let record
while (record = parser.read()) {
output.push(record)
}
})
// Catch any error
parser.on('error', function(err){
console.error(err.message)
})
// When we are done, test that the parsed output matched what expected
parser.on('end', function(){
assert.deepEqual(
output,
[
[ 'root','x','0','0','root','/root','/bin/bash' ],
[ 'someone','x','1022','1022','','/home/someone','/bin/bash' ]
]
)
})
As to the goal on interfacing with sqlite, this is essentially building a customized streaming endpoint.
In this use case, implement a customized writable stream that accepts the output of parser and sends rows to the database.
Then you simply chain pipe calls as
fs.createReadStream(__dirname+'/fs_read.csv')
.pipe(parser)
.pipe(your_writable_stream)
Beware: This code returns immediately. It does not wait for the operations to finish. It interacts with a hidden event loop internal to node.js. The event loop often confuses new developers who are arriving from another language, used to a more imperative style, and skipped this part of their node.js training.
Implementing such a customized writable stream can get complicated and is left as an exercise for the reader. It will be easiest if the parser emits a row, and then the writer can be written to handle single rows. Make sure you are able to notice errors somehow and throw appropriate exceptions, or you'll be cursed with incomplete results and no warning or reason why.
A hackish way to do it would have been to replace console.log(data) in let parser = ... with a customized function writeRowToSqlite(data) that you'll have to write anyway to implement a custom stream. Because of asynchronous API issues, using return data there does not do anything useful. It certainly, as you saw, fails to put the data into the output variable.
As to why output in your modified posting does not contain the data...
Unfortunately, as you discovered, this is usually wrong-headed:
const output = fs.createReadStream(__dirname + "/users.csv").pipe(parser);
console.log(output);
Here, the variable output will be a ReadableStream, which is not the same as the data contained in the readable stream. Put simply, it's like when you have a file in your filesystem, and you can obtain all kinds of system information about the file, but the content contained in the file is accessed through a different call.

Related

Javascript : fs Read function is ignored

I'm using javascript because I have no choice : the api that i'm using is in javascript and only in javascript. I've no knowledge about it at all. I'm trying to do a quit function that store data (polls list) in case in of an error before exiting. In parallel of that I have an init function that restore the data.
function quit() {
fs.readFile(pollsName, "utf8", function readFileCallback(err, data) {
// never reach this line
if (err) {
console.log(err);
} else {
obj = JSON.parse(data); //now it an object
obj.poll = polls.map((poll) => poll.serialize);
json = JSON.stringify(obj); //convert it back to json
fs.writeFile(jsonName, json, (e) => {
if (e) throw e;
console.log("Polls saved");
}); // write it back
}
});
exit(1);
}
As you can see I never enter in the readFile() function. However I have another function write() that correctly store logs in similar file. The file name is correct, it is a json file with a key "poll" that represent an empty list. Both fs and exit package are import with "require".
I currently have no clue to solve this situation and i'm struggling on it. At first I was handling signal but I realised it wasn't possible because write/read are unsafe. With this solution I'm going to catch every error and start this function.
Thank you if you can help me to find a solution
You call readFile, but then you immediately terminate the porgram using exit(1). Simply move the exit(1) into the callback function readFileCallback, so your program terminates after the polls are saved.

Returning ffprobe metadata to another function using fluent-ffmpeg

I'm trying to use fluent-ffmpeg's ffprobe to get the metadata of a file and add it to a list, but I want to separate the process of getting the metadata from the method related to checking the file, mostly because the addFileToList() function is quite long as is and the ffprobe routine is quite long as well.
I've tried the following code, but it doesn't give the results I'm expecting:
export default {
// ...
methods: {
getVideoMetadata (file) {
const ffmpeg = require('fluent-ffmpeg')
ffmpeg.ffprobe(file.name, (err, metadata) => {
if (!err) {
console.log(metadata) // this shows the metadata just fine
return metadata
}
})
},
addFileToList (file) {
// file checking routines
console.log(this.getVideoMetadata(file)) // this returns null
item.metadata = this.getVideoMetadata(file)
// item saving routines
}
}
}
I've already tried to nest the getVideoMetadata() routines inside addFileToList(), and it works, but not as intended, because the actions are carried, but not the first time, only the second time. It seems to be an async issue, but I don't know how can I tackle this.
What can I do? Should I stick to my idea of decoupling getVideoMetadata() or should I nest it inside addFileToList() and wrestle with async/await?
It turns out that probing for metadata introduces a race condition, so we should structure the code so that the flow continues after the callback:
const ffmpeg = require('fluent-ffmpeg')
var data
ffmpeg.ffprobe(file.name, (err, metadata) => {
if (!err) {
data = metadata
continueDoingStuff()
}
})

Duplicate Array Data Web Scraping

I can't seem to get the article duplicates out of my web scraper results, this is my code:
app.get("/scrape", function (req, res) {
request("https://www.nytimes.com/", function (error, response, html) {
// Load the HTML into cheerio and save it to a variable
// '$' becomes a shorthand for cheerio's selector commands, much like jQuery's '$'
var $ = cheerio.load(html);
var uniqueResults = [];
// With cheerio, find each p-tag with the "title" class
// (i: iterator. element: the current element)
$("div.collection").each(function (i, element) {
// An empty array to save the data that we'll scrape
var results = [];
// store scraped data in appropriate variables
results.link = $(element).find("a").attr("href");
results.title = $(element).find("a").text();
results.summary = $(element).find("p.summary").text().trim();
// Log the results once you've looped through each of the elements found with cheerio
db.Article.create(results)
.then(function (dbArticle) {
res.json(dbArticle);
}).catch(function (err) {
return res.json(err);
});
});
res.send("You scraped the data successfully.");
});
});
// Route for getting all Articles from the db
app.get("/articles", function (req, res) {
// Grab every document in the Articles collection
db.Article.find()
.then(function (dbArticle) {
res.json(dbArticle);
})
.catch(function (err) {
res.json(err);
});
});
Right now I am getting five copies of each article sent to the user. I have tried db.Article.distinct and various versions of this to filter the results down to only unique articles. Any tips?
In Short:
Switching the var results = [] from an Array to an Object var results = {} did the trick for me. Still haven't figured out the exact reason for the duplicate insertion of documents in database, will update as soon I find out.
Long Story:
You have multiple mistakes and points of improvement there in your code. I will try pointing them out:
Let's follow them first to make your code error free.
Mistakes
1. Although mongoose's model.create, new mongoose() does seem to work fine with Arrays but I haven't seen such a use before and it does not even look appropriate.
If you intend to create documents one after another then represent your documents using an object instead of an Array. Using an array is more mainstream when you intend to create multiple documents at once.
So switch -
var results = [];
to
var results = {};
2. Sending response headers after they are already sent will create for you an error. I don't know if you have already noticed it or not but its pretty much clear upfront as once the error is popped up the remaining documents won't get stored because of PromiseRejection Error if you haven't setup a try/catch block.
The block inside $("div.collection").each(function (i, element) runs asynchronously so your process control won't wait for each document to get processed, instead it would immediately execute res.send("You scraped the data successfully.");.
This will effectively terminate the Http connection between the client and the server and any further issue of response termination calls like res.json(dbArticle) or res.json(err) will throw an error.
So, just comment the res.json statements inside the .create's then and catch methods. This will although terminate the response even before the whole articles are saved in the DB but you need not to worry as your code would still work behind the scene saving articles in database for you (asynchronously).
If you want your response to be terminated only after you have successfully saved the data then change your middleware implementation to -
request('https://www.nytimes.com', (err, response, html) => {
var $ = cheerio.load(html);
var results = [];
$("div.collection").each(function (i, element) {
var ob = {};
ob.link = $(element).find("a").attr("href");
ob.title = $(element).find("a").text();
ob.summary = $(element).find("p.summary").text().trim();
results.push(ob);
});
db.Article.create(results)
.then(function (dbArticles) {
res.json(dbArticles);
}).catch(function (err) {
return res.json(err);
});
});
After making above changes and even after the first one, my version of your code ran fine. So if you want you can continue on with your current version, or you may try reading some points of improvement.
Points of Improvements
1. Era of callbacks is long gone:
Convert your implementation to utilise Promises as they are more maintainable and easier to reason about. Here are the things you can do -
Change request library from request to axios or any one which supports Promises by default.
2. Make effective use of mongoose methods for insertion. You can perform bulk inserts of multiple statements in just one query. You may find docs on creating documents in mongodb quite helpful.
3. Start using some frontend task automation library such as puppeteer or nightmare.js for data scraping related task. Trust me, they make life a hell lot easier than using cheerio or any other library for the same. Their docs are really good and well maintained so you won't have have hard time picking these up.

Node read line by line, process and store

I have the following code in Node.js which reads from a file, line by line. I want to do stuff to each line and store it in an array. The array would then be used in other functions in the same file. The problem I'm running into is the async nature of reading the stream which results in an empty array. The solutions I've come across all seem to rely on modules.
function processLine(file) {
const fs = require('fs');
const readline = require('readline');
const input = fs.createReadStream(file);
const rl = readline.createInterface(input);
const arr = []
rl.on('line', (line) => {
// do stuff to data and store in array
})
// return array;
}
I am aware of being able to store the chunks and operate on the whole file with input.on('end', cb)... However, I feel like this would put too much functionality within the cb. Plus I still can't use its return value since its async. I guess my question is, is there a way to store data being read and use it within the file?
If you would like to process elements like chunks - take a look on
highWaterMark
https://nodejs.org/api/stream.html#stream_types_of_streams
Proably you will be instered in:
objectMode
as well.
Also there are interfaces which you could use while use streams:
Readable
Writable
Duplex
Transform
https://nodejs.org/api/stream.html#stream_transform_transform_chunk_encoding_callback
Where you could use any Promise based function and simply use callback to finish processing element at right point of time:
_transform = function(data, encoding, callback) {
this.push(data);
callback();
};
or
https://nodejs.org/api/stream.html#stream_class_stream_transform
_write(chunk, encoding, callback) {
// ...
}
However there is another solution - rxjs binding for node stream - which you could use while process elements.

How to read an entire text stream in node.js?

In RingoJS there's a function called read which allows you to read an entire stream until the end is reached. This is useful when you're making a command line application. For example you may write a tac program as follows:
#!/usr/bin/env ringo
var string = system.stdin.read(); // read the entire input stream
var lines = string.split("\n"); // split the lines
lines.reverse(); // reverse the lines
var reversed = lines.join("\n"); // join the reversed lines
system.stdout.write(reversed); // write the reversed lines
This allows you to fire up a shell and run the tac command. Then you type in as many lines as you wish to and after you're done you can press Ctrl+D (or Ctrl+Z on Windows) to signal the end of transmission.
I want to do the same thing in node.js but I can't find any function which would do so. I thought of using the readSync function from the fs library to simulate as follows, but to no avail:
fs.readSync(0, buffer, 0, buffer.length, null);
The file descriptor for stdin (the first argument) is 0. So it should read the data from the keyboard. Instead it gives me the following error:
Error: ESPIPE, invalid seek
at Object.fs.readSync (fs.js:381:19)
at repl:1:4
at REPLServer.self.eval (repl.js:109:21)
at rli.on.self.bufferedCmd (repl.js:258:20)
at REPLServer.self.eval (repl.js:116:5)
at Interface.<anonymous> (repl.js:248:12)
at Interface.EventEmitter.emit (events.js:96:17)
at Interface._onLine (readline.js:200:10)
at Interface._line (readline.js:518:8)
at Interface._ttyWrite (readline.js:736:14)
How would you synchronously collect all the data in an input text stream and return it as a string in node.js? A code example would be very helpful.
As node.js is event and stream oriented there is no API to wait until end of stdin and buffer result, but it's easy to do manually
var content = '';
process.stdin.resume();
process.stdin.on('data', function(buf) { content += buf.toString(); });
process.stdin.on('end', function() {
// your code here
console.log(content.split('').reverse().join(''));
});
In most cases it's better not to buffer data and process incoming chunks as they arrive (using chain of already available stream parsers like xml or zlib or your own FSM parser)
The key is to use these two Stream events:
Event: 'data'
Event: 'end'
For stream.on('data', ...) you should collect your data data into either a Buffer (if it is binary) or into a string.
For on('end', ...) you should call a callback with you completed buffer, or if you can inline it and use return using a Promises library.
Let me illustrate StreetStrider's answer.
Here is how to do it with concat-stream
var concat = require('concat-stream');
yourStream.pipe(concat(function(buf){
// buf is a Node Buffer instance which contains the entire data in stream
// if your stream sends textual data, use buf.toString() to get entire stream as string
var streamContent = buf.toString();
doSomething(streamContent);
}));
// error handling is still on stream
yourStream.on('error',function(err){
console.error(err);
});
Please note that process.stdin is a stream.
There is a module for that particular task, called concat-stream.
If you are in async context and have a recent version of Node.js, here is a quick suggestion:
const chunks = []
for await (let chunk of readable) {
chunks.push(chunk)
}
console.log(Buffer.concat(chunks))
On Windows, I had some problems with the other solutions posted here - the program would run indefinitely when there's no input.
Here is a TypeScript implementation for modern NodeJS, using async generators and for await - quite a bit simpler and more robust than using the old callback based APIs, and this worked on Windows:
import process from "process";
/**
* Read everything from standard input and return a string.
*
* (If there is no data available, the Promise is rejected.)
*/
export async function readInput(): Promise<string> {
const { stdin } = process;
const chunks: Uint8Array[] = [];
if (stdin.isTTY) {
throw new Error("No input available");
}
for await (const chunk of stdin) {
chunks.push(chunk);
}
return Buffer.concat(chunks).toString('utf8');
}
Example:
(async () => {
const input = await readInput();
console.log(input);
})();
(consider adding a try/catch, if you want to handle the Promise rejection and display a more user-friendly error-message when there's no input.)

Categories