I have 500 millions of object in which each has n number of contacts as like below
var groupsArray = [
{'G1': ['C1','C2','C3'....]},
{'G2': ['D1','D2','D3'....]}
...
{'G2000': ['D2001','D2002','D2003'....]}
...
]
I have two way of implementation in nodejs which is based on regular promises and another one using bluebird as shown below
Regular promises
...
var groupsArray = [
{'G1': ['C1','C2','C3']},
{'G2': ['D1','D2','D3']}
]
function ajax(url) {
return new Promise(function(resolve, reject) {
request.get(url,{json: true}, function(error, data) {
if (error) {
reject(error);
} else {
resolve(data);
}
});
});
}
_.each(groupsArray,function(groupData){
_.each(groupData,function(contactlists,groupIndex){
// console.log(groupIndex)
_.each(contactlists,function(contactData){
ajax('http://localhost:3001/api/getcontactdata/'+groupIndex+'/'+contactData).then(function(result) {
console.log(result.body);
// Code depending on result
}).catch(function() {
// An error occurred
});
})
})
})
...
Using bluebird way i have used concurrency to check how to control the queue of promises
...
_.each(groupsArray,function(groupData){
_.each(groupData,function(contactlists,groupIndex){
var contacts = [];
// console.log(groupIndex)
_.each(contactlists,function(contactData){
contacts.push({
contact_name: 'Contact ' + contactData
});
})
groups.push({
task_name: 'Group ' + groupIndex,
contacts: contacts
});
})
})
Promise.each(groups, group =>
Promise.map(group.contacts,
contact => new Promise((resolve, reject) => {
/*setTimeout(() =>
resolve(group.task_name + ' ' + contact.contact_name), 1000);*/
request.get('http://localhost:3001/api/getcontactdata/'+group.task_name+'/'+contact.contact_name,{json: true}, function(error, data) {
if (error) {
reject(error);
} else {
resolve(data);
}
});
}).then(log => console.log(log.body)),
{
concurrency: 50
}).then(() => console.log())).then(() => {
console.log('All Done!!');
});
...
I want to know when dealing with 100 millions of api call inside loop using promises. please advise the best way to call API asynchronously and deal the response later.
My answer using regular Node.js promises (this can probably easily be adapted to Bluebird or another library).
You could fire off all Promises at once using Promise.all:
var groupsArray = [
{'G1': ['C1','C2','C3']},
{'G2': ['D1','D2','D3']}
];
function ajax(url) {
return new Promise(function(resolve, reject) {
request.get(url,{json: true}, function(error, data) {
if (error) {
reject(error);
} else {
resolve(data);
}
});
});
}
Promise.all(groupsArray.map(group => ajax("your-url-here")))
.then(results => {
// Code that depends on all results.
})
.catch(err => {
// Handle the error.
});
Using Promise.all attempts to run all your requests in parallel. This probably won't work well when you have 500 million requests to make all being attempted at the same time!
A more effective way to do it is to use the JavaScript reduce function to sequence your requests one after the other:
// ... Setup as before ...
const results = [];
groupsArray.reduce((prevPromise, group) => {
return prevPromise.then(() => {
return ajax("your-url-here")
.then(result => {
// Process a single result if necessary.
results.push(result); // Collect your results.
});
});
},
Promise.resolve() // Seed promise.
);
.then(() => {
// Code that depends on all results.
})
.catch(err => {
// Handle the error.
});
This example chains together the promises so that the next one only starts once the previous one completes.
Unfortunately the sequencing approach will be very slow because it has to wait until each request has completed before starting a new one. Whilst each request is in progress (it takes time to make an API request) your CPU is sitting idle whereas it could be working on another request!
A more efficient, but complicated approach to this problem is to use a combination of the above approaches. You should batch your requests so that the requests in each batch (of say 10) are executed in parallel and then the batches are sequenced one after the other.
It's tricky to implement this yourself - although it's a great learning exercise
- using a combination of Promise.all and the reduce function, but I'd suggest using the library async-await-parallel. There's a bunch of such libraries, but I use this one and it works well and it easily does the job you want.
You can install the library like this:
npm install --save async-await-parallel
Here's how you would use it:
const parallel = require("async-await-parallel");
// ... Setup as before ...
const batchSize = 10;
parallel(groupsArray.map(group => {
return () => { // We need to return a 'thunk' function, so that the jobs can be started when they are need, rather than all at once.
return ajax("your-url-here");
}
}, batchSize)
.then(() => {
// Code that depends on all results.
})
.catch(err => {
// Handle the error.
});
This is better, but it's still a clunky way to make such a large amount of requests! Maybe you need to up the ante and consider investing time in proper asynchronous job management.
I've been using Kue lately for managing a cluster of worker processes. Using Kue with the Node.js cluster library allows you to get proper parallelism happening on a multi-core PC and you can then easily extend it to muliple cloud-based VMs if you need even more grunt.
See my answer here for some Kue example code.
In my opinion you have two problems coupled in one questions - I'd decouple them.
#1 Loading of a large dataset
Operation on such a large dataset (500m records) will surely cause some memory limit issues sooner or later - node.js runs in a single thread and that is limited to use approx 1.5GB of memory - after that your process will crash.
In order to avoid that you could be reading your data as a stream from a CSV - I'll use scramjet as it'll help us with the second problem, but JSONStream or papaparse would do pretty well too:
$ npm install --save scramjet
Then let's read the data - I'd assume from a CSV:
const {StringStream} = require("scramjet");
const stream = require("fs")
.createReadStream(pathToFile)
.pipe(new StringStream('utf-8'))
.csvParse()
Now we have a stream of objects that will return the data line by line, but only if we read it. Solved problem #1, now to "augment" the stream:
#2 Stream data asynchronous augmentation
No worries - that's just what you do - for every line of data you want to fetch some additional info (so augment) from some API, which by default is asynchronous.
That's where scramjet kicks in with just couple additional lines:
stream
.flatMap(groupData => Object.entries(groupData))
.flatMap(([groupIndex, contactList]) => contactList.map(contactData => ([contactData, groupIndex])
// now you have a simple stream of entries for your call
.map(([contactData, groupIndex]) => ajax('http://localhost:3001/api/getcontactdata/'+groupIndex+'/'+contactData))
// and here you can print or do anything you like with your data stream
.each(console.log)
After this you'd need to accumulate the data or output it to stream - there are numbers of options - for example: .toJSONArray().pipe(fileStream).
Using scramjet you are able to separate the process to multiple lines without much impact on performance. Using setOptions({maxParallel: 32}) you can control concurrency and best of all, all this will run with a minimal memory footprint - much much faster than if you were to load the whole data into memory.
Let me know how if this is helpful - your question is quite complex so let me know if you run into any problems - I'll be happy to help. :)
Related
I have been reading up on methods to implement a polling function and found a great article on https://davidwalsh.name/javascript-polling. Now using a setTimeout rather than setInterval to poll makes a log of sense, especially with an API that I have no control over and has shown to have varying response times.
So I tried to implement such a solution in my own code in order to challenge my understanding of callbacks, promises and the event loop. I have followed guidance outlined in the post to avoid any anti-patterns Is this a "Deferred Antipattern"? and to ensure promise resolution before a .then() promise resolve before inner promise resolved and this is where I am getting stuck. I have put some code together to simulate the scenario so I can highlight the issues.
My hypothetical scenario is this:
I have an API call to a server which responds with a userID. I then use that userID to make a request to another database server which returns a set of data that carries out some machine learning processing that can take several minutes.
Due to the latency, the task is put onto a task queue and once it is complete it updates a NoSql database entry with from isComplete: false to isComplete: true. This means that we then need to poll the database every n seconds until we get a response indicating isComplete: true and then we cease the polling. I understand there are a number of solutions to polling an api but I have yet to see one involving promises, conditional polling, and not following some of the anti-patterns mentioned in the previously linked post. If I have missed anything and this is a repeat I do apologize in advance.
So far the process is outlined by the code below:
let state = false;
const userIdApi = () => {
return new Promise((res, rej) => {
console.log("userIdApi");
const userId = "uid123";
setTimeout(()=> res(userId), 2000)
})
}
const startProcessingTaskApi = userIdApi().then(result => {
return new Promise((res, rej) => {
console.log("startProcessingTaskApi");
const msg = "Task submitted";
setTimeout(()=> res(msg), 2000)
})
})
const pollDatabase = (userId) => {
return new Promise((res, rej) => {
console.log("Polling databse with " + userId)
setTimeout(()=> res(true), 2000)
})
}
Promise.all([userIdApi(), startProcessingTaskApi])
.then(([resultsuserIdApi, resultsStartProcessingTaskApi]) => {
const id = setTimeout(function poll(resultsuserIdApi){
console.log(resultsuserIdApi)
return pollDatabase(resultsuserIdApi)
.then(res=> {
state = res
if (state === true){
clearTimeout(id);
return;
}
setTimeout(poll, 2000, resultsuserIdApi);
})
},2000)
})
I have a question that relates to this code as it is failing to carry out the polling as I need:
I saw in the accepted answer of the post How do I access previous promise results in a .then() chain? that one should "Break the chain" to avoid huge chains of .then() statements. I followed the guidance and it seemed to do the trick (before adding the polling), however, when I console logged out every line it seems that userIdApi is executed twice; once where it is used in the startProcessingTaskApi definition and then in the Promise.all line.
Is this a known occurrence? It makes sense why it happens I am just wondering why this is fine to send two requests to execute the same promise, or if there is a way to perhaps prevent the first request from happening and restrict the function execution to the Promise.all statement?
I am fairly new to Javascript having come from Python so any pointers on where I may be missing some knowledge to be able to get this seemingly simple task working would be greatly appreciated.
I think you're almost there, it seems you're just struggling with the asynchronous nature of javascript. Using promises is definitely the way to go here and understanding how to chain them together is key to implementing your use case.
I would start by implementing a single method that wraps setTimeout to simplify things down.
function delay(millis) {
return new Promise((resolve) => setTimeout(resolve, millis));
}
Then you can re-implement the "API" methods using the delay function.
const userIdApi = () => {
return delay(2000).then(() => "uid123");
};
// Make userId an argument to this method (like pollDatabase) so we don't need to get it twice.
const startProcessingTaskApi = (userId) => {
return delay(2000).then(() => "Task submitted");
};
const pollDatabase = (userId) => {
return delay(2000).then(() => true);
};
You can continue polling the database by simply chaining another promise in the chain when your condition is not met.
function pollUntilComplete(userId) {
return pollDatabase(userId).then((result) => {
if (!result) {
// Result is not ready yet, chain another database polling promise.
return pollUntilComplete(userId);
}
});
}
Then you can put everything together to implement your use case.
userIdApi().then((userId) => {
// Add task processing to the promise chain.
return startProcessingTaskApi(userId).then(() => {
// Add database polling to the promise chain.
return pollUntilComplete(userId);
});
}).then(() => {
// Everything is done at this point.
console.log('done');
}).catch((err) => {
// An error occurred at some point in the promise chain.
console.error(err);
});
This becomes a lot easier if you're able to actually use the async and await keywords.
Using the same delay function as in Jake's answer:
async function doItAll(userID) {
await startTaskProcessingApi(userID);
while (true) {
if (await pollDatabase(userID)) break;
}
}
I am using node and axios (with TS, but that's not too important) to query an API. I have a suite of scripts that make calls to different endpoints and log the data (sometimes filtering it.) These scripts are used for debugging purposes. I am trying to make these scripts "better" by adding a delay between requests so that I don't "blow up" the API, especially when I have a large array I'm trying to pass. So basically I want it to make a GET request and pause for a certain amount of time before making the next request.
I have played with trying setTimeout() functions, but I'm only putting them in places where they add the delay after the requests have executed; everywhere I have inserted the function has had this result. I understand why I am getting this result, I just had to try everything I could to at least increase my understanding of how things are working.
I have though about trying to set up a queue or trying to use interceptors, but I think I might be "straying far" from a simpler solution with those ideas.
Additionally, I have another "base script" that I wrote on the fly (sorta the birth point for this batch of scripts) that I constructed with a for loop instead of the map() function and promise.all. I have played with trying to set the delay in that script as well, but I didn't get anywhere helpful.
var axios = require('axios');
var fs = require('fs');
const Ids = [arrayOfIds];
try {
// Promise All takes an array of promises
Promise.all(Ids.map(id => {
// Return each request as its individual promise
return axios
.get(URL + 'endPoint/' + id, config)
}))
.then((vals) =>{
// Vals is the array of data from the resolved promise all
fs.appendFileSync(`${__dirname}/*responseOutput.txt`,
vals.map((v) => {
return `${JSON.stringify(v.data)} \n \r`
}).toString())
}).catch((e) => console.log)
} catch (err) {
console.log(err);
}
No errors with the above code; just can't figure out how to put the delay in correctly.
You could try Promise.map from bluebird
It has the option of setting concurrency
var axios = require('axios');
var fs = require('fs');
var Promise = require('bluebird');
const Ids = [arrayOfIds];
let concurrency = 3; // only maximum 3 HTTP request will run concurrently
try {
Promise.map(Ids, id => {
console.log(`starting request`, id);
return axios.get(URL + 'endPoint/' + id, config)
}, { concurrency })
.then(vals => {
console.log({vals});
})
;
} catch (err) {
console.log(err);
}
This question already has answers here:
How do I return the response from an asynchronous call?
(41 answers)
Closed 5 years ago.
I have question regarding asynchronous functions and how to send something after a function has returned it's result. This is what I am trying to accomplish:
Within the handling of a GET request in Node I read the contents of a folder, returning the files in that folder. Next I want to loop over the stats of each file in that folder, loading only the files created within a certain period and lastly send the data in those files as a response to the request. It looks something like this:
array = []
fs.readdir(path, function(err, items) {
items.forEach(function(item) {
fs.stat(path, function(err, stats) {
if (period check) {
array.push(data)
}
})
})
}
res.send(array)
This approach ends up sending an empty array, and I've looked into Promises which seem the solution here but I can't get them to work in this scenario. Using fs.statSync instead of fs.stat does work but this greatly reduces the performance, and it feels like it should be doable with Promises but I just don't know how.
Is there anyone with a solution for this?
EDIT: Regarding to the question marked as duplicate, I tried to solve my problem with the answer there first but didn't succeed. My problem has some nested functions and loops and is more complex than the examples given there.
Use this if you prefer a Promise-based approach:
var path = require('path')
fs.readdir(myPath, function(err, items) {
var array = [];
Promise.all(items.map(function(item) {
return new Promise(function(resolve, reject) {
fs.stat(path.resolve(myPath, item), function(err, stats) {
if (err) {
return reject(err)
}
if (/* period check */) {
array.push(data)
}
resolve()
})
})
})).then(function() {
res.send(array)
}).catch(function(error) {
// error handling
res.sendStatus(500)
})
}
Here is what I would suggest.
// This is a new API and you might need to use the util.promisify
// npm package if you are using old node versions.
const promisify = require('util').promisify;
const fs = require('fs');
// promisify transforms a callback-based API into a promise-based one.
const readdir = promisify(fs.readdir);
const stat = promisify(fs.stat);
const dataProm = readdir(path)
.then((items) => {
// Map each items to a promise on its stat.
const statProms = items.map(path => fs.stat(path);
// Wait for all these promises to resolve.
return Promise.all(statProms);
})
// Remove undesirable file stats based on the result
// of period check.
.then(stats => stats.filter(stat => periodCheck(stat)));
// dataProm will resolve with your data. You might as well return it
// as is. But if you need to use `res.send`, you can do:
dataProm.then((data) => {
res.send(data);
}, (err) => {
// If you go away from the promise chain, you need to handle
// errors now or you are silently swallowing them.
res.sendError(err);
});
Here is a link toward the util.promisify package I am referring to. If you are using node v8+, you do not need it. If you do, do not forget to replace require('util').promisify; by require('util.promisify');.
I have a file where I'm writing things:
var stream = fs.createWriteStream("my_file.txt");
stream.once('open', function(fd) {
names.forEach(function(name){
doSomething(name);
});
stream.end();
});
This is working ok and I'm able to write to the file.
The problem is that the doSomething() function has some parts that are asynchronous. An example can be given with the dnsLookup function. Somewhere in my doSomething() I have:
dns.lookup(domain, (err, addresses, family) => {
if(err){
stream.write("Error:", err);
}else{
stream.write(addresses);
}
});
Now, my problem is, since the DNS check is asynchronous, the code keeps executing closing the stream. When the DNS response finally comes it cannot write to anywhere.
I already tried to use the async module but it didn't work. Probably I did something wrong.
Any idea?
Now that NodeJS is mostly up to speed with ES2015 features (and I notice you're using at least one arrow function), you can use the native promises in JavaScript (previously you could use a library):
var stream = fs.createWriteStream("my_file.txt");
stream.once('open', function(fd) {
Promise.all(names.map(name => doSomething(name)))
.then(() => {
// success handling
stream.end();
})
.catch(() => {
// error handling
stream.end();
});
});
(The line Promise.all(names.map(name => doSomething(name))) can be simply Promise.all(names.map(doSomething)) if you know doSomething ignores extra arguments and only uses the first.)
Promise.all (spec | MDN) accepts an iterable and returns a promise that is settled when all of the promises in the iterable are settled (non-promise values are treated as resolved promises using the value as the resolution).
Where doSomething becomes:
function doSomething(name) {
return new Promise((resolve, reject) => {
dns.lookup(domain, (err, addresses, family) => {
if(!err){ // <== You meant `if (err)` here, right?
stream.write("Error:", err);
reject(/*...reason...*/);
}else{
stream.write(addresses);
resolve(/*...possibly include addresses*/);
});
});
});
There are various libs that will "promise-ify" Node-style callbacks for you so using promises is less clunky than the mix above; in that case, you could use the promise from a promise-ified dns.lookup directly rather than creating your own extra one.
I'm writing a purely synchronous, single threaded command line program in node.js, which needs to write a single binary file, for which I'm using WriteStream. My usage pattern is along the lines of:
var stream = fs.createWriteStream(file)
stream.write(buf1)
stream.write(buf2)
This seems to work, but the documentation says it's asynchronous and I want to make sure I'm not writing code that works 99% of the time. I don't care exactly when the data gets written as long as it's written in the specified order and no later than when the program exits, and the quantity of data is small so speed and memory consumption are not issues.
I've seen mention of stream.end() but it seems to work without it and I've also seen suggestions that calling it may actually be a bad idea if you're not using callbacks because it might end up getting called before all the data is written.
Is my approach correct (given that I want purely synchronous) or is there anything I need to watch out for?
You can do this, the only problem can be if you create two or more concurrent streams for the same path: the order of writes from different streams will be undefined. By the way, there is a synchronous fs write stream implementation in node: fs.SyncWriteStream. It's kind of private and requires fd as an argument, but if you really want it...
I'm working on a timing-critical API, where a new file has to have been written and its stream completely handled before the next action can be performed. The solution, in my case (and, quite possibly, that of the OP's question) was to use:
writer.on('finish', () => {
console.error('All writes are now complete.');
});
as per the fs Event: 'finish' documentation
const writeToLocalDisk = (stream, path) => {
return new Promise((resolve, reject) => {
const istream = stream;
const ostream = fs.createWriteStream(path);
istream.pipe(ostream);
istream.on("end", () => {
console.log(`Fetched ${path} from elsewhere`);
resolve();
});
istream.on("error", (err) => {
console.log(JSON.stringify(err, null, 2));
resolve();
});
});
};
// Then use an async function to perform sequential-like operation
async function sequential (stream) {
const path = "";
await writeToLocalDisk(stream, path);
console.log('other operation here');
}