Node.js WriteStream synchronous - javascript

I'm writing a purely synchronous, single threaded command line program in node.js, which needs to write a single binary file, for which I'm using WriteStream. My usage pattern is along the lines of:
var stream = fs.createWriteStream(file)
stream.write(buf1)
stream.write(buf2)
This seems to work, but the documentation says it's asynchronous and I want to make sure I'm not writing code that works 99% of the time. I don't care exactly when the data gets written as long as it's written in the specified order and no later than when the program exits, and the quantity of data is small so speed and memory consumption are not issues.
I've seen mention of stream.end() but it seems to work without it and I've also seen suggestions that calling it may actually be a bad idea if you're not using callbacks because it might end up getting called before all the data is written.
Is my approach correct (given that I want purely synchronous) or is there anything I need to watch out for?

You can do this, the only problem can be if you create two or more concurrent streams for the same path: the order of writes from different streams will be undefined. By the way, there is a synchronous fs write stream implementation in node: fs.SyncWriteStream. It's kind of private and requires fd as an argument, but if you really want it...

I'm working on a timing-critical API, where a new file has to have been written and its stream completely handled before the next action can be performed. The solution, in my case (and, quite possibly, that of the OP's question) was to use:
writer.on('finish', () => {
console.error('All writes are now complete.');
});
as per the fs Event: 'finish' documentation

const writeToLocalDisk = (stream, path) => {
return new Promise((resolve, reject) => {
const istream = stream;
const ostream = fs.createWriteStream(path);
istream.pipe(ostream);
istream.on("end", () => {
console.log(`Fetched ${path} from elsewhere`);
resolve();
});
istream.on("error", (err) => {
console.log(JSON.stringify(err, null, 2));
resolve();
});
});
};
// Then use an async function to perform sequential-like operation
async function sequential (stream) {
const path = "";
await writeToLocalDisk(stream, path);
console.log('other operation here');
}

Related

Asynchronous readline loop without async / await

I'd like to be using this function (which runs fine on my laptop), or something like it, on my embedded device. But two problems:
Node.JS on our device is so old that the JavaScript doesn't support async / await. Given our hundreds of units in the field, updating is probably impractical.
Even if that weren't a problem, we have another problem: the file may be tens of megabytes in size, not something that could fit in memory all at once. And this for-await loop will set thousands of asynchronous tasks into motion, more or less all at once.
async function sendOldData(pathname) {
const reader = ReadLine.createInterface({
input: fs.createReadStream(pathname),
crlfDelay: Infinity
})
for await (const line of reader) {
record = JSON.parse(line);
sendOldRecord(record);
}
}
function sendOldRecord(record) {...}
Promises work in this old version. I'm sure there is a graceful syntax for doing this sequentially with Promises:
read one line
massage its data
send that data to our server
sequentially, but asynchronously, so that the JavaScript event loop is not blocked while the data is sent to the server.
Please, could someone suggest the right syntax for doing this in my outdated JavaScript?
Make a queue so it takes the next one off the array
function foo() {
const reader = [
'{"foo": 1}',
'{"foo": 2}',
'{"foo": 3}',
'{"foo": 4}',
'{"foo": 5}',
];
function doNext() {
if (!reader.length) {
console.log('done');
return;
}
const line = reader.shift();
const record = JSON.parse(line);
sendOldRecord(record, doNext);
}
doNext();
}
function sendOldRecord(record, done) {
console.log(record);
// what ever your async task is
window.setTimeout(function () {
done();
}, Math.floor(2000 * Math.random()));
}
foo();
The Problem
Streams and asynchronous processing are somewhat of a pain to get them to work well together and handle all possible error conditions and things are even worse for the readline module. Since you seem to be saying that you can't use the for await () construct for the readable (which even when it is supported, has various issues), things are even a bit more complicated.
The main problem with readline.createInterface() on a stream is that it reads a chunk of the file, parses that chunk for full lines and then synchronously sends all the lines in a tight for loop.
You can literally see the code here:
for (let n = 0; n < lines.length; n++) this[kOnLine](lines[n]);
The implementation of kOnLine does this:
this.emit('line', line);
So, this is a tight for loop that emits all the lines it read out. So ... if you try to do something asynchronous in your responding to the line event, the moment you hit an await or an asynchronous callback, this readline code will send the next line event before you're done processing the previous one. This makes it a pain to do asynchronous processing of the line events in sequential order where you finishing asynchronous processing one line before starting on the next one. IMO, this is a very busted design as it only really works with synchronous processing. You will notice that this for loop also doesn't care if the readline object was paused either. It just pumps out all the lines it has without regard for anything.
Discussion of Possible Solutions
So, what to do about that. Some part of a fix for this is in the asynchronous iterator interface to readline (but it has other problems which I've filed bugs on). But, the supposition of your question seems to be that you can't use that asynchronous iterator interface because your device may have an older version of nodejs. If that's the case, then I only know of two options:
Ditch the readline.createInterface() functionality entirely and either use a 3rd party module or do your own line boundary processing.
Cover the line event with your own code that supports asynchronous processing of lines without getting the next line in the middle of still processing the previous one.
A Solution
I've written an implementation for option #2, covering the line event with your own code. In my implementation, we just acknowledge that line events will arrive during our asynchronous processing of previous lines, but instead of notifying you about then, the input stream gets paused and these "early" lines get queued. With this solution the readline code will read a chunk of data from the input stream, parse it into its full lines, synchronously send all the line events for those full lines. But, upon receipt of the first line event, we will pause the input stream and initiate queueing of subsequent line events. So, you can asynchronously process a line and you won't get another one until you ask for the next line.
This code has a different way of communicating incoming lines to your code. Since we're in the age of promises for asynchronous code, I've added a promise-based reader.getNextLine() function to the reader object.
This lets you write code like this:
import fs from 'fs';
async function run(filename) {
let reader = createLineReader({
input: fs.createReadStream(filename),
crlfDelay: Infinity
});
let line;
let cntr = 0;
while ((line = await reader.getNextLine()) !== null) {
// simulate some asynchronous operation in the processing of the line
console.log(`${++cntr}: ${line}`);
await processLine(line);
}
}
run("temp.txt").then(result => {
console.log("done");
}).catch(err => {
console.log(err);
});
And, here's the implementation of createLineReader():
import * as ReadLine from 'readline';
function createLineReader(options) {
const stream = options.input;
const reader = ReadLine.createInterface(options);
// state machine variables
let latchedErr = null;
let isPaused = false;
let readerClosed = false;
const queuedLines = [];
// resolves with line
// resolves with null if no more lines
// rejects with error
reader.getNextLine = async function() {
if (latchedErr) {
// once we get an error, we're done
throw latchedErr;
} else if (queuedLines.length) {
// if something in the queue, return the oldest from the queue
const line = queuedLines.shift();
if (queuedLines.length === 0 && isPaused) {
reader.resume();
}
return line;
} else if (readerClosed) {
// if nothing in the queue and the reader is closed, then signify end of data
return null;
} else {
// waiting for more line data to arrive
return new Promise((resolve, reject) => {
function clear() {
reader.off('error', errorListener);
reader.off('queued', queuedListener);
reader.off('done', doneListener);
}
function queuedListener() {
clear();
resolve(queuedLines.shift());
}
function errorListener(e) {
clear();
reject(e);
}
function doneListener() {
clear();
resolve(null);
}
reader.once('queued', queuedListener);
reader.once('error', errorListener);
reader.once('done', doneListener);
});
}
}
reader.on('pause', () => {
isPaused = true;
}).on('resume', () => {
isPaused = false;
}).on('line', line => {
queuedLines.push(line);
if (!isPaused) {
reader.pause();
}
// tell any queue listener that something was just added to the queue
reader.emit('queued');
}).on('close', () => {
readerClosed = true;
if (queuedLines.length === 0) {
reader.emit('done');
}
});
return reader;
}
Explanation
Internally, the implementation takes each new line event and puts it into a queue. Then, reader.getNextLine() just pulls items from the queue or waits (with a promise) for the queue to get something put in it.
During operation, the readline object will get a chunk of data from your readstream, it will parse that into whole lines. The whole lines will all get added to the queue (via line events). The readstream will be paused so it won't generate any more lines until the queue has been drained.
When the queue becomes empty, the readstream will be resumed so it can send more data to the reader object.
This is scalable to very large files because it will only queue the whole lines found in one chunk of the file being read. Once those lines are queued, the input stream is paused so it won't put more into the queue. After the queue is drained, the inputs stream is resumed so it can send more data and repeat...
Any errors in the readstream will trigger an error event on the readline object which will either reject a reader.getNextLine() that is already waiting for the next line or will reject the next time reader.getNextLine() is called.
Disclaimers
This has only been tested with file-based readstreams.
I would not recommend having more than one reader.getNextLine() in process at once as this code does not anticipate that and it's not even clear what that should do.
Basically you can achieve this using functional approach:
const arrayOfValues = [1,2,3,4,5];
const chainOfPromises = arrayOfValues.reduce((acc, item) => {
return acc.then((result) => {
// Here you can add your logic for parsing/sending request
// And here you are chaining next promise request
return yourAsyncFunction(item);
})
}, Promise.resolve());
// Basically this will do
// Promise.resolve().then(_ => yourAsyncFunction(1)).then(_ => yourAsyncFunction(2)) and so on...
// Start
chainOfPromises.then();

Unit test handler that contains promise

I have TDD knowledge, and I've been trying to start a project in javascript applying the same principles.
I am building an API that, once hit, fires a request to an external service in order to gather some data, and once retrieved, parses it and returns the response.
So far, I've been unlucky with my saga, I've searched a lot and the most similar issue on SO I've found is this one. But I've had no success in order to apply the same solution.
My implementation on the implementation side is as follows:
//... other handlers
weather(request, response) {
//... some setup and other calls
this.externalService.get(externalURL)
.then(serviceResponse => {
this.externalResponseParser.parse(serviceResponse)
});
//... some more code
}
And on the test side:
let requester;
let mockParser;
let handler;
let promiseResolve;
let promiseReject;
beforeEach(function () {
requester = new externalRequesterService();
mockParser = sinon.createStubInstance(...);
handler = new someHandler({
externalService: requester,
externalResponseParser: mockParser
});
});
it('returns data', function () {
sinon.stub(requester, 'get').callsFake((url) => {
return new Promise((resolve, reject) => {
// so I can be able to handle when the promise gets resolved
promiseResolve = resolve;
promiseReject = reject;
});
});
handler.weather(request, response);
// assertions of what happens before externalService.get gets called - all green
promiseResolve(expectedServiceResponse);
assert.ok(mockExternalResponseParser.parse.calledOnce, 'Expected externalResponseParser.parse to have been called once');
});
In the last line of the test, it fails, even though I am calling what I am supposed to.
At some point I've added some logging, and I was able to see that the code of the then block, seems to get executed after the assertion in the test, which might be source of the problem.
I've tried to find out if there is some sort of eventually that could be used, so my assertion after resolving the promise would be something like:
assert.eventually(mockExternalResponseParser.parse.calledOnce, 'Expected externalResponseParser.parse to eventually be called once');
but no luck.
Does anyone have some clear explanation of what is missing? Many thanks in advance
P.S.- As per request, please find a stripped down version of my code here. Just run npm install, followed by npm test in order to get the same output.
Thank you all for the time spent.
I ended up finding this very good article on Medium which allowed me to solve my issue. It has a very nice explanation moving from callbacks to promises, which was exactly the scenario I had in hands.
I have updated the github repository I created on how to fix it.
If you want to TL;DR here goes just the highlight of the changes. Implementation side:
async weather(request, response) {
//...
let serviceResponse = await this.requesterService.get(weatherURL);
//...
};
And on the test side:
it('returns weather on success', async function () {
sinon.stub(requester, 'get').callsFake((_) => {
return new Promise((resolve, _) => {
resolve(expectedServiceResponse);
});
});
await handler.weather(request, response);
//...
assert.ok(mockWeatherParser.parseWeather.calledOnce, 'Expected WeatherParser.parseWeather to have been called once'); // no longer fails here
//...
});
Now keep in mind in this example, it is still synchronous. However, I evolved my API already step by step, and after migrating to this synchronous version using Promises, it was way more easier to migrate to an async version. Both in terms of testing and implementation.
If you have the same issue and need help or have questions, let me know. I'll be happy to help.

Functional Programming and async/promises

I'm refactoring some old node modules into a more functional style. I'm like a second year freshman when it comes to FP :) Where I keep getting hung up is handling large async flows. Here is an example where I'm making a request to a db and then caching the response:
// Some external xhr/promise lib
const fetchFromDb = make => {
return new Promise(resolve => {
console.log('Simulate async db request...'); // just simulating a async request/response here.
setTimeout(() => {
console.log('Simulate db response...');
resolve({ make: 'toyota', data: 'stuff' });
}, 100);
});
};
// memoized fn
// this caches the response to getCarData(x) so that whenever it is invoked with 'x' again, the same response gets returned.
const getCarData = R.memoizeWith(R.identity, (carMake, response) => response.data);
// Is this function pure? Or is it setting something outside the scope (i.e., getCarData)?
const getCarDataFromDb = (carMake) => {
return fetchFromDb(carMake).then(getCarData.bind(null, carMake));
// Note: This return statement is essentially the same as:
// return fetchFromDb(carMake).then(result => getCarData(carMake, result));
};
// Initialize the request for 'toyota' data
const toyota = getCarDataFromDb('toyota'); // must be called no matter what
// Approach #1 - Just rely on thenable
console.log(`Value of toyota is: ${toyota.toString()}`);
toyota.then(d => console.log(`Value in thenable: ${d}`)); // -> Value in thenable: stuff
// Approach #2 - Just make sure you do not call this fn before db response.
setTimeout(() => {
const car = getCarData('toyota'); // so nice!
console.log(`later, car is: ${car}`); // -> 'later, car is: stuff'
}, 200);
<script src="https://cdnjs.cloudflare.com/ajax/libs/ramda/0.25.0/ramda.min.js"></script>
I really like memoization for caching large JSON objects and other computed properties. But with a lot of asynchronous requests whose responses are dependent on each other for doing work, I'm having trouble keeping track of what information I have and when. I want to get away from using promises so heavily to manage flow. It's a node app, so making things synchronous to ensure availability was blocking the event loop and really affecting performance.
I prefer approach #2, where I can get the car data simply with getCarData('toyota'). But the downside is that I have to be sure that the response has already been returned. With approach #1 I'll always have to use a thenable which alleviates the issue with approach #2 but introduces its own problems.
Questions:
Is getCarFromDb a pure function as it is written above? If not, how is that not a side-effect?
Is using memoization in this way an FP anti-pattern? That is, calling it from a thenable with the response so that future invocations of that same method return the cached value?
Question 1
It's almost a philosophical question here as to whether there are side-effects here. Calling it does update the memoization cache. But that itself has no observable side-effects. So I would say that this is effectively pure.
Update: a comment pointed out that as this calls IO, it can never be pure. That is correct. But that's the essence of this behavior. It's not meaningful as a pure function. My answer above is only about side-effects, and not about purity.
Question 2
I can't speak for the whole FP community, but I can tell you that the Ramda team (disclaimer: I'm a Ramda author) prefers to avoid Promises, preferring more lawful types such Futures or Tasks. But the same questions you have here would be in play with those types substituted for Promises. (More on these issues below.)
In General
There is a central point here: if you're doing asynchronous programming, it will spread to every bit of the application that touches it. There is nothing you will do that changes this basic fact. Using Promises/Tasks/Futures helps avoid some of the boilerplate of callback-based code, but it requires you to put the post response/rejection code inside a then/map function. Using async/await helps you avoid some of the boilerplate of Promise-based code, but it requires you to put the post reponse/rejection code inside async functions. And if one day we layer something else on top of async/await, it will likely have the same characteristics.
(While I would suggest that you look at Futures or Tasks instead of Promises, below I will only discuss Promises. The same ideas should apply regardless.)
My suggestion
If you're going to memoize anything, memoize the resulting Promises.
However you deal with your asynchrony, you will have to put the code that depends on the result of an asynchronous call into a function. I assume that the setTimeout of your second approach was just for demonstration purposes: using timeout to wait for a DB result over the network is extremely error-prone. But even with setTimeout, the rest of your code is running from within the setTimeout callback.
So rather than trying to separate the cases for when your data has already been cached and when it hasn't, simply use the same technique everywhere: myPromise.then(... my code ... ). That could look something like this:
// getCarData :: String -> Promise AutoInfo
const getCarData = R.memoizeWith(R.identity, make => new Promise(resolve => {
console.log('Simulate async db request...')
setTimeout(() => {
console.log('Simulate db response...')
resolve({ make: 'toyota', data: 'stuff' });
}, 100)
})
)
getCarData('toyota').then(carData => {
console.log('now we can go', carData)
// any code which depends on carData
})
// later
getCarData('toyota').then(carData => {
console.log('now it is cached', carData)
})
<script src="//cdnjs.cloudflare.com/ajax/libs/ramda/0.25.0/ramda.min.js"></script>
In this approach, whenever you need car data, you call getCarData(make). Only the first time will it actually call the server. After that, the Promise is served out of the cache. But you use the same structures everywhere to deal with it.
I only see one reasonable alternative. I couldn't tell if your discussion about having to have to wait for the data before making remaining calls means that you would be able to pre-fetch your data. If that's the case, then there is one additional possibility, one which would allow you to skip the memoization as well:
// getCarData :: String -> Promise AutoInfo
const getCarData = make => new Promise(resolve => {
console.log('Simulate async db request...')
setTimeout(() => {
console.log('Simulate db response...')
resolve({ make: 'toyota', data: 'stuff' });
}, 100)
})
const makes = ['toyota', 'ford', 'audi']
Promise.all(makes.map(getCarData)).then(allAutoInfo => {
const autos = R.zipObj(makes, allAutoInfo)
console.log('cooking with gas', autos)
// remainder of app that depends on auto data here
})
<script src="//cdnjs.cloudflare.com/ajax/libs/ramda/0.25.0/ramda.min.js"></script>
But this one means that nothing will be available until all your data has been fetched. That may or may not be all right with you, depending on all sorts of factors. And for many situations, it's not even remotely possible or desirable. But it is possible that yours is one where it is helpful.
One technical point about your code:
const getCarDataFromDb = (carMake) => {
return fetchFromDb(carMake).then(getCarData.bind(null, carMake));
};
Is there any reason to use getCarData.bind(null, carMake) instead of () => getCarData(carMake)? This seems much more readable.
Is getCarFromDb a pure function as it is written above?
No. Pretty much anything that uses I/O is impure. The data in the DB could change, the request could fail, so it doesn't give any reliable guarantee that it will return consistent values.
Is using memoization in this way an FP anti-pattern? That is, calling it from a thenable with the response so that future invocations of that same method return the cached value?
It's definitely an asynchrony antipattern. In your approach #2 you are creating a race condition where the operation will succeed if the DB query completes in less than 200 ms, and fail if it takes longer than that. You've labeled a line in your code "so nice!" because you're able to retrieve data synchronously. That suggests to me that you're looking for a way to skirt the issue of asynchrony rather than facing it head-on.
The way you're using bind and "tricking" memoizeWith into storing the value you're passing into it after the fact also looks very awkward and unnatural.
It is possible to take advantage of caching and still use asynchrony in a more reliable way.
For example:
// Some external xhr/promise lib
const fetchFromDb = make => {
return new Promise(resolve => {
console.log('Simulate async db request...')
setTimeout(() => {
console.log('Simulate db response...')
resolve({ make: 'toyota', data: 'stuff' });
}, 2000);
});
};
const getCarDataFromDb = R.memoizeWith(R.identity, fetchFromDb);
// Initialize the request for 'toyota' data
const toyota = getCarDataFromDb('toyota'); // must be called no matter what
// Finishes after two seconds
toyota.then(d => console.log(`Value in thenable: ${d.data}`));
// Wait for 5 seconds before getting Toyota data again.
// This time, there is no 2-second wait before the data comes back.
setTimeout(() => {
console.log('About to get Toyota data again');
getCarDataFromDb('toyota').then(d => console.log(`Value in thenable: ${d.data}`));
}, 5000);
<script src="https://cdnjs.cloudflare.com/ajax/libs/ramda/0.25.0/ramda.min.js"></script>
The one potential pitfall here is that if a request should fail, you'll be stuck with a rejected promise in your cache. I'm not sure what would be the best way to address that, but you'd surely need some way of invalidating that part of the cache or implementing some sort of retry logic somewhere.

Best way to calling API inside for loop using Promises

I have 500 millions of object in which each has n number of contacts as like below
var groupsArray = [
{'G1': ['C1','C2','C3'....]},
{'G2': ['D1','D2','D3'....]}
...
{'G2000': ['D2001','D2002','D2003'....]}
...
]
I have two way of implementation in nodejs which is based on regular promises and another one using bluebird as shown below
Regular promises
...
var groupsArray = [
{'G1': ['C1','C2','C3']},
{'G2': ['D1','D2','D3']}
]
function ajax(url) {
return new Promise(function(resolve, reject) {
request.get(url,{json: true}, function(error, data) {
if (error) {
reject(error);
} else {
resolve(data);
}
});
});
}
_.each(groupsArray,function(groupData){
_.each(groupData,function(contactlists,groupIndex){
// console.log(groupIndex)
_.each(contactlists,function(contactData){
ajax('http://localhost:3001/api/getcontactdata/'+groupIndex+'/'+contactData).then(function(result) {
console.log(result.body);
// Code depending on result
}).catch(function() {
// An error occurred
});
})
})
})
...
Using bluebird way i have used concurrency to check how to control the queue of promises
...
_.each(groupsArray,function(groupData){
_.each(groupData,function(contactlists,groupIndex){
var contacts = [];
// console.log(groupIndex)
_.each(contactlists,function(contactData){
contacts.push({
contact_name: 'Contact ' + contactData
});
})
groups.push({
task_name: 'Group ' + groupIndex,
contacts: contacts
});
})
})
Promise.each(groups, group =>
Promise.map(group.contacts,
contact => new Promise((resolve, reject) => {
/*setTimeout(() =>
resolve(group.task_name + ' ' + contact.contact_name), 1000);*/
request.get('http://localhost:3001/api/getcontactdata/'+group.task_name+'/'+contact.contact_name,{json: true}, function(error, data) {
if (error) {
reject(error);
} else {
resolve(data);
}
});
}).then(log => console.log(log.body)),
{
concurrency: 50
}).then(() => console.log())).then(() => {
console.log('All Done!!');
});
...
I want to know when dealing with 100 millions of api call inside loop using promises. please advise the best way to call API asynchronously and deal the response later.
My answer using regular Node.js promises (this can probably easily be adapted to Bluebird or another library).
You could fire off all Promises at once using Promise.all:
var groupsArray = [
{'G1': ['C1','C2','C3']},
{'G2': ['D1','D2','D3']}
];
function ajax(url) {
return new Promise(function(resolve, reject) {
request.get(url,{json: true}, function(error, data) {
if (error) {
reject(error);
} else {
resolve(data);
}
});
});
}
Promise.all(groupsArray.map(group => ajax("your-url-here")))
.then(results => {
// Code that depends on all results.
})
.catch(err => {
// Handle the error.
});
Using Promise.all attempts to run all your requests in parallel. This probably won't work well when you have 500 million requests to make all being attempted at the same time!
A more effective way to do it is to use the JavaScript reduce function to sequence your requests one after the other:
// ... Setup as before ...
const results = [];
groupsArray.reduce((prevPromise, group) => {
return prevPromise.then(() => {
return ajax("your-url-here")
.then(result => {
// Process a single result if necessary.
results.push(result); // Collect your results.
});
});
},
Promise.resolve() // Seed promise.
);
.then(() => {
// Code that depends on all results.
})
.catch(err => {
// Handle the error.
});
This example chains together the promises so that the next one only starts once the previous one completes.
Unfortunately the sequencing approach will be very slow because it has to wait until each request has completed before starting a new one. Whilst each request is in progress (it takes time to make an API request) your CPU is sitting idle whereas it could be working on another request!
A more efficient, but complicated approach to this problem is to use a combination of the above approaches. You should batch your requests so that the requests in each batch (of say 10) are executed in parallel and then the batches are sequenced one after the other.
It's tricky to implement this yourself - although it's a great learning exercise
- using a combination of Promise.all and the reduce function, but I'd suggest using the library async-await-parallel. There's a bunch of such libraries, but I use this one and it works well and it easily does the job you want.
You can install the library like this:
npm install --save async-await-parallel
Here's how you would use it:
const parallel = require("async-await-parallel");
// ... Setup as before ...
const batchSize = 10;
parallel(groupsArray.map(group => {
return () => { // We need to return a 'thunk' function, so that the jobs can be started when they are need, rather than all at once.
return ajax("your-url-here");
}
}, batchSize)
.then(() => {
// Code that depends on all results.
})
.catch(err => {
// Handle the error.
});
This is better, but it's still a clunky way to make such a large amount of requests! Maybe you need to up the ante and consider investing time in proper asynchronous job management.
I've been using Kue lately for managing a cluster of worker processes. Using Kue with the Node.js cluster library allows you to get proper parallelism happening on a multi-core PC and you can then easily extend it to muliple cloud-based VMs if you need even more grunt.
See my answer here for some Kue example code.
In my opinion you have two problems coupled in one questions - I'd decouple them.
#1 Loading of a large dataset
Operation on such a large dataset (500m records) will surely cause some memory limit issues sooner or later - node.js runs in a single thread and that is limited to use approx 1.5GB of memory - after that your process will crash.
In order to avoid that you could be reading your data as a stream from a CSV - I'll use scramjet as it'll help us with the second problem, but JSONStream or papaparse would do pretty well too:
$ npm install --save scramjet
Then let's read the data - I'd assume from a CSV:
const {StringStream} = require("scramjet");
const stream = require("fs")
.createReadStream(pathToFile)
.pipe(new StringStream('utf-8'))
.csvParse()
Now we have a stream of objects that will return the data line by line, but only if we read it. Solved problem #1, now to "augment" the stream:
#2 Stream data asynchronous augmentation
No worries - that's just what you do - for every line of data you want to fetch some additional info (so augment) from some API, which by default is asynchronous.
That's where scramjet kicks in with just couple additional lines:
stream
.flatMap(groupData => Object.entries(groupData))
.flatMap(([groupIndex, contactList]) => contactList.map(contactData => ([contactData, groupIndex])
// now you have a simple stream of entries for your call
.map(([contactData, groupIndex]) => ajax('http://localhost:3001/api/getcontactdata/'+groupIndex+'/'+contactData))
// and here you can print or do anything you like with your data stream
.each(console.log)
After this you'd need to accumulate the data or output it to stream - there are numbers of options - for example: .toJSONArray().pipe(fileStream).
Using scramjet you are able to separate the process to multiple lines without much impact on performance. Using setOptions({maxParallel: 32}) you can control concurrency and best of all, all this will run with a minimal memory footprint - much much faster than if you were to load the whole data into memory.
Let me know how if this is helpful - your question is quite complex so let me know if you run into any problems - I'll be happy to help. :)

Do something before consuming stream, using highland.js

I'm trying to code a writable stream which takes a stream of objects and inputs them into a mongodb database. Before consuming the stream of objects, I first need to wait for the db-connection to establish, but I seem to be doing something wrong, because the program never gets to the insertion-part.
// ./mongowriter.js
let mongo = mongodb.MongoClient,
connectToDb = _.wrapCallback(mongo.connect);
export default url => _.pipeline(s => {
return connectToDb(url).flatMap(db => {
console.log('Connection established!');
return s.flatMap(x => /* insert x into db */);
});
});
....
// Usage in other file
import mongowriter from './mongowriter.js';
let objStream = _([/* json objects */]);
objStream.pipe(mongoWriter);
The program just quits without "Connection established!" ever being written to the console.
What am I missing? Is there some kind of idiom I should be following?
By reading the source and some general experimentation, I figured out how to do a single asynchronous thing and then continue processing through the stream. Basically, you use flatMap to replace the event from the asynchronous task with the stream you actually want to process.
Another quirk I didn't expect and which was throwing me off, was that _.pipeline won't work unless the original stream is fully consumed in the callback. That's why it won't work simply putting in a _.map and log stuff (which was how I tried to debug it). Instead one needs to make sure to have an each or done at the end. Below is a minimal example:
export default _ => _.pipeline( stream => {
return _(promiseReturningFunction())
.tap(_ => process.stdout.write('.'))
.flatMap(_ => stream)
.each(_ => process.stdout.write('-'));
});
// Will produce something like the following when called with a non-empty stream.
// Note the lone '.' in the beginning.
// => .-------------------
Basically, a '.' is output when the async function is done, and a '-' for every object of the stream.
Hopefully, this saves someone some time. Took embarrassingly long for me to figure this out. ^^

Categories