I have the following code:
const rl = require('readline').createInterface({
input: require('fs').createReadStream(__dirname + '/../resources/profiles.txt'),
terminal: true
});
for await (const line of rl) {
scrape_profile(line)
}
scrape_profile is a function that makes some request to the web and perform some processing. now the issue is that i wanted to limit so that 5 scrape_profile is executed per 30 seconds.. as of now if i have a text file with 1000 lines, it would go ahead and execute 1000 concurrent requests at one time.. how do i limit this ?
I'm not entirely sure why you're using a readlineInterface if you're asynchronously reading the entire file into memory at once, so for my answer, I've replaced it with a call to fs.readFileSync as it's much easier to deal with finite values than a stream and the question didn't explicitly state the file IO needed to be streamed.
You could try using Bluebird Promise.reduce:
const fs = require('fs');
const lines = fs.readFileSync('./test.txt').toString().split('\r\n');
const Promise = require('bluebird');
const BATCHES = 5;
const scrape_profile = file => new Promise((resolve, reject) => {
setTimeout(() => {
console.log("Done with", file);
resolve(Math.random());
}, Math.random() * 1000);
});
const runBatch = batchNo => {
const batchSize = Math.round(lines.length / BATCHES);
const start = batchSize * batchNo;
const end = batchSize * (batchNo + 1);
const index = start;
return Promise.reduce(lines.slice(start, end), (aggregate, line) => {
console.log({ aggregate });
return scrape_profile(line)
.then(result => {
aggregate.push(result);
return aggregate;
});
}, []);
}
runBatch(0).then(/* batch 1 done*/)
runBatch(1).then(/* batch 2 done*/)
runBatch(2).then(/* batch 3 done*/)
runBatch(3).then(/* batch 4 done*/)
runBatch(4).then(/* batch 5 done*/)
// ... preferably use a for loop to do this
This is a full example; you should be able to run this locally (with a file called 'test.txt' that has any contents), for each line it will spend a random amount of time generating a random number; it runs 5 separate batches. You need to change the value of BATCHES to reflect the number of batches you need
you can use setinterval for 30 seconds to perform a loop of scrape_profile for 5 times, your loop is using the number of lines which is like you specified 1000 lines without stopping, then make a loop for 5 times and put it in a function that you call with setinterval and of course keep the index of the current line as a variable too to continue from where you left off
Related
i'm trying to automatically delete data older than 2 hours in the firebase real-time database, but after typing this code, it returns me a Malformed calls from JS:field sizes are different error.
function reloadDatas() {
const ref = database().ref('messages/');
const now = Date.now();
const cutoff = now - 2 * 60 * 60 * 1000; // 1 minute in milliseconds
const old = ref.orderByChild('timestamp').endAt(cutoff);
old.once('value', function (snapshot) {
snapshot.forEach(function (childSnapshot) {
childSnapshot.ref.remove();
});
});
}
what am I doing wrong?
I didn't test your code but we can see the following error in the code: The remove() method is asynchronous so you cannot call it in a forEach() loop. One solution is to use Promise.all() in order to call it a variable number of times in parallel.
So, the following should do the trick (untested):
JS SDK v8
old.get().then((snapshot) => {
const promises = [];
snapshot.forEach((childSnapshot) => {
promises.push(childSnapshot.ref.remove());
});
return Promise.all(promises)
});
JS SDK v9
get(old).then((snapshot) => {
const promises = [];
snapshot.forEach((childSnapshot) => {
promises.push(childSnapshot.ref.remove());
});
return Promise.all(promises)
});
Another possibility would be to simultaneously write to the different database nodes with the update() method and passing null. See here and here in the doc form more details.
I know the title is quite generic but I am inserting 1 Million records into a AWS DynamoDB and currently it takes ~30 minutes to load.
I have the 1 Million records in memory, I just need to improve the speed to insert the items. AWS only allows to send batches of 25 records but I all my code in syncronous.
Usually my data has a very small amount of data in the object (e.g. like 3-5 properties with number ids)
I read the 1 million entries from a CSV and basically store it in data array
Then I do this:
await DatabaseHandler.batchWriteItems('myTable', data); // data length is 1 Million
Which calls my insert function
const documentClient = new DynamoDB.DocumentClient();
export class DatabaseHandler {
static batchWriteItems = async (tableName: string, data: {}[]) => {
// AWS only allows batches of max 25 items
while (data.length) {
const batch = data.splice(0, 25);
const putRequests = batch.map((elem => {
return {
PutRequest: {
Item: elem
}
};
});
const params = {
RequestItems: {
[tableName]: putRequests,
},
};
await documentClient.batchWrite(params).promise();
}
}
}
I believe I am doing 40,000 HTTP requests to create 25 records in the database
Is there any way to improve this? Even some ideas would be great
Your code is "blocking", in the sense that you're waiting for the previous batch to execute before executing the next one. This is not the nature of JavaScript, and you're not taking advantage of promises. Instead, you can send all your requests at once, and JS' asynchronism will kick in and do all the work for you, which will be significantly faster:
// in your class method:
const proms = []; // <-- create a promise array
while (data.length) {
const batch = data.splice(0, 25);
const putRequests = batch.map((elem => {
return {
PutRequest: {
Item: elem
}
};
});
const params = {
RequestItems: {
[tableName]: putRequests,
},
};
proms.push(documentClient.batchWrite(params).promise()); // <-- add the promise to our array
}
}
await Promise.all(proms); // <-- wait for everything to be resolved asynchronously, then be done
This will speed up your request monumentally, as long as AWS lets you send that many concurrent requests.
I'm not sure how exactly you implemented the code, but to prove that it works, here's a dummy implementation (expect to wait about a minute):
const request = (_, t = 5) => new Promise(res => setTimeout(res, t)); // implement a dummy request API
// with your approach
async function a(data) {
while(data.length) {
const batch = data.splice(0, 25);
await request(batch);
}
}
// non-blocking
async function b(data) {
const proms = [];
while(data.length) {
const batch = data.splice(0, 25);
proms.push(request(batch));
}
await Promise.all(proms);
}
(async function time(a, b) {
const data = Array(10000).fill(); // create some dummy data (10,000 instead of a million or you'll be staring at this demo for a while)
console.time("original");
await a(data);
console.timeEnd("original");
console.time("optimized");
await b(data);
console.timeEnd("optimized");
})(a, b);
so I have a piece of asynchronous(setInterval) code in the firebase function.
export const auto_play = functions
.runWith({ memory: "512MB", timeoutSeconds: 540 })
.pubsub.schedule("*/15 * * * *")
.onRun(async (context) => {
const nums = polledDoc.data()?.nums as number[];
setInterval(() => {
const polledNum = nums.shift();
// function suppose to run for atleast 10-15 mins coz nums.length can be any number from 60 to 90.
// a function which save data to realtime database
autoPollAlgo({ gameId: scheduledGame.docs[0].id, number: polledNum as number });
}, 10 * 1000);
})
now this code works fine 3/5 times but it sometimes exits from intervals before the num array completes. sometimes it just got stop after a min and sometimes after 5mins.
I know the fact functions have a max timeout of 9mins, but how does this async code work even after 9mins.
after some digging found out I'm not returning any promise so this code can terminate any time. now to make things perfect I added a promise code to the end of the block.
return new Promise((resolve, reject) => {
setTimeout(() => {
resolve(true);
//to make sure it exit after 15mins when interval code is done.
}, 900000);
});
now what is happening, functions is got consistent now its end exactly after 9mins(in the middle of setinterval exec.). it won't wait for the promise to resolve.
how can I keep a function to run an async task for 15mins with consistency?
After struggling with cloud functions I came with a solution.
I divided the logic into two functions.
First, I divided the nums array in half.
I executed setInterval for half of the nums then I use Cloud Pub/Sub to execute another function with the other half nums.
// somewhere in first function
pubSubClient.topic("play_half").publish(Buffer.from(JSON.stringify(restNum), "utf-8"), { gameId: scheduledGame.docs[0].id });
export const auto_play_2 = functions
.runWith({ memory: "512MB", timeoutSeconds: 540 })
.pubsub.topic("play_half")
.onPublish((message) => {
const non_parse_arr = Buffer.from(message.data, "base64").toString();
//other half nums
const nums = JSON.parse(non_parse_arr);
//...
})
I have 1 array with thousands link image like this
let imageList = ["http://img1.jpg", "http://img2.jpg", ...];
I want to loop over the imageList and delay after 20 times (n times) increase index like
for(let i = 0; i <= imageList.length; i+=20){
// with i from 0 -> 20
// do download image from server
downloadImages(0,20) // [start, end]
// delay 5s to prevent server timeout because request many times
// with i from 20 -> 40
// do download image from server
downloadImages(20,40)
// continue delay 5s
// .... try to finish
}
Try smth like this:
const imageList = ['***'];
downloadImages(imageList)
.then(/* done */)
.catch(/* error */);
async function downloadImages(images) {
for(let i = 0; i + 20 <= imageList.length; i += 20){
const n20images = imageList.slice(i, i + 20);
await fetchImages(n20images);
await delay(5);
}
}
function fetchImages(images) {
return Promise.all(
images.map(image => /* fetch(image) or smth else */)
)
}
function delay(seconds) {
return new Promise(resolve => setTimeout(resolve, seconds * 1000))
}
You can use modulus operator.
let imageList = ["http://img1.jpg", "http://img2.jpg", ...];
for(let i = 0; i <= imageList.length; i++){
if (i % n == 0) //n is number of images to start delay
START_YOUR_DELAY_HERE
downloadImage(20); //20 is number of images you want to download
}
I use async/await, for of and chunk from lodash for this kind of situation. It'll make the requests in groups of 20 for not flooding the server
let i = 0
const imageListChunks = _.chunk(imageList, 20)
for await (const chunk of imageListChunks){
const chunkPromises = downloadImage(0 + i*20, 20 + i*20)
const chunkResp = await Promise.all(chunkPromises)
i = i + 1
}
If you need more delay to let the server breath you can add a setTimeout with another await to slow it more
You can use async and setTimeout to achieve this:
let downloadImage = async url => {
console.log(`Downloading ${url}`);
// Simulate a download delay
return new Promise(r => setTimeout(r, 100 + Math.floor(Math.random() * 500)));
}
let downloadAllImages = async (imageUrls, chunkSize=20, delayMs=2000) => {
for (let i = 0; i < imageUrls.length; i += chunkSize) {
console.log(`Downloading chunk [${i}, ${i + 20 - 1}]`);
// This `chunk` is a subset of `imageUrls`: the ones to be downloaded next
let chunk = imageUrls.slice(i, i + 20);
// Call `downloadImage` on each item in the chunk; wait for all downloads to finish
await Promise.all(chunk.map(url => downloadImage(url)));
// Unless this is the last chunk, wait `delayMs` milliseconds before continuing
// (Note: this step may be poor practice! See explanation at bottom of this answer)
if ((i + chunkSize) < imageUrls.length) await new Promise(r => setTimeout(r, delayMs));
}
};
// Create an array of example urls
let imageUrls = [ ...new Array(100) ].map((v, n) => `http://example.com/image${n}.jpg`);
// Call `downloadAllImages`
downloadAllImages(imageUrls)
// Use `.then` to capture the moment when all images have finished downloading
.then(() => console.log(`Finished downloading ${imageUrls.length} images!`));
Note that if you implement downloadImage correctly, so that it returns a promise which resolves when the image is downloaded, it may be best practice to forego the timeout. The timeout is a heuristic way of ensuring not too many requests are running at once, but if you have a fine-grained sense of when a request finishes you can simply wait for a batch of requests to finish before beginning the next batch.
There is an even more efficient design to consider (for your further research). To understand, let's think about a problem with this current approach (which I'll call the "batch" approach). The batch approach is incapable of beginning a new batch until the current one completes. Imagine a batch of 20 images where 1 downloads in 1ms, 18 of them download within 5ms, but the final image takes 10+ seconds to download; even though this system ought to have the bandwidth to download 20 images at once, it winds up spending 10 entire seconds with only a single request in progress. A more efficient design (which we can call the "maximal bandwidth approach") would maintain a queue of 20 in-progress requests, and every time one of those requests completes a new request is immediately begun. Imagine that first image which downloads in 1ms; the moment it finishes, and only 19 requests are in progress, the "maximal bandwidth approach" could begin a new request right away without waiting for those other 19 to finish.
Set some offset
let offset = 0
for (let i = offset; i <= imageList.length; i += 20) {
downloadImage(offset, offset + 20)
offset += 20
}
I have been trying to understand Promises and I'm hitting a brick wall.
==Order I want the code to run==
I need a .txt file to load each line into an array.
WAIT for this to happen.
Run a Function on each entry that returns and array.
WAIT for each index of the array to be processed before doing the next.
==My Functions==
Call this function to start the program.
async function start(){
var data = await getData();
console.log(data);
for (var i = 0; i < data.length; i++){
console.log(await searchGoogle(data[i]));
}
}
'await' for the data from getData
async function getData(){
return new Promise(function(resolve, reject){
fs.readFile('./thingsToGoogle.txt', function(err, data) {
if(err) throw err;
var array = data.toString().split("\n");
resolve(array);
});
});
}
Then call searchGoogle on each index in the array.
async function searchGoogle(toSearch) {
(async() => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://www.google.com/');
await page.type('input[name=q]', toSearch);
try {
console.log('Setting Search' + toSearch);
await page.evaluate(() => {
let elements = document.getElementsByClassName('gNO89b');
for (let element of elements)
element.click();
});
await page.waitForNavigation();
} catch (err) {
console.log(err)
}
try {
console.log("Collecting Data");
const[response] = await Promise.all([
page.waitForNavigation(),
await page.click('.rINcab'),
]);
} catch (err) {
console.log("Error2: " + err)
}
let test = await page.$$('.LC20lb');
// console.log(test);
allresults = [];
for (const t of test) {
const label = await page.evaluate(el => el.innerText, t);
if (label != "") {
allresults.push(label);
}
}
await browser.close();
resolve(allresults);
})();
}
The problem is that this does not work. it does not wait for the file to load.
Picture of Node JS output.
Hopefully the screen shot has uploaded, but you can see it stacking the SearchGoogle function console.logs;
console.log('Setting..')
console.log('Setting..')
console.log('Collecting..')
console.log('Collecting..')
When it should be
console.log('Setting..')
console.log('Collecting..')
console.log('Setting..')
console.log('Collecting..')
This is the 'first' time sort of dealing with promises, i have done a lot of reading up on them and done bits of code to understand them, however when I have tried to apply this knowledge I am struggling. Hope someone can help.
-Peachman-
Queue with concurrent Limit (using p-queue)
You need a queue with concurrency limit. You will read every single line and add them to a queue. We will be using readline and p-queue module for this.
First, create a queue with concurrency of 1.
const {default: PQueue} = require('p-queue');
const queue = new PQueue({concurrency: 1});
Then, create our reader instance.
const fs = require('fs');
const readline = require('readline');
const rl = readline.createInterface({
input: fs.createReadStream('your-input-file.txt')
});
For every line of the file, add an entry to the queue.
rl.on('line', (line) => {
console.log(`Line from file: ${line}`);
queue.add(() => searchGoogle(line));
});
That's it! If you want to process 10 lines at once, just change the concurrency line. It will still read one line at a time, but the queue will limit how many searchGoogle is invoked.
Optional Fixes: Async Await
Your code has the following structure,
async yourFunction(){
(async()=>{
const browser = await puppeteer.launch();
// ... rest of the code
})()
}
While this might run as intended, you will have a hard time debugging because you will be creating an anonymous function every time you run yourFunction.
The following is enough.
async yourFunction(){
const browser = await puppeteer.launch();
// ... rest of the code
}
Here's a way to process them that lets you process N URLs at a time where you can adjust the value of N. My guess is that you want it set to a value of between 5 and 20 in order to keep your CPU busy, but not use too many server resources.
Here's an outline of how it works:
It uses the line-by-line module to read a file line by line and (unlike the built-in readline interface), this module pauses line events when you call .pause() which is important in this implementation.
It maintains a numInFlight counter that tells you how many lines are in the midst of processing.
You set a maxInFlight constant to the maximum number of lines you want to be processed in parallel.
It maintains a resultCntr that helps you keep results in the proper order.
It creates the readline interface and establishes a listener for the line event. This will start the stream flowing with line events.
On each line event, we increment our numInFlight counter. If we have reached the maximum number allowed in flight, we pause the readline stream so it won't produce any more line events. If we haven't reached the max in flight yet, then more line events will flow until we do reach the max.
We pass that line off to your existing searchGoogle() function.
When that line is done processing, we save the result in the appropriate spot in the array, decrement the numInFlight counter and resume the stream (in case it was previously paused).
We check if we're all done (by checking if numInFlight is 0 and if we've reached the end of our file). If we are done, resolve the master promise with the results.
If we're not all done, then there will either be more line events coming or more searchGoogle() functions in flight that will finish, both of which will check again to see if we're done.
Note that the way this is designed to work is that errors on any given URL are just put into the result array (the error object is in the array) and processing continues on the rest of the URLs with an eventual resolved promise. Errors while reading the input file will terminate processing and reject the return promise.
Here's the code:
const fs = require('fs');
const Readline = require('line-by-line');
function searchAll(file) {
return new Promise(function(resolve, reject) {
const rl = new Readline(file);
// set maxInFlight to something between 5 and 20 to optimize performance by
// running multiple requests in flight at the same time without
// overusing memory and other system resources.
const maxInFlight = 1;
let numInFlight = 0;
let resultCntr = 0;
let results = [];
let doneReading = false;
function checkDone(e) {
if (e) {
reject(e);
} else if (doneReading && numInFlight === 0) {
resolve(results);
}
}
rl.on('line', async (url) => {
if (url) {
let resultIndex = resultCntr++;
try {
++numInFlight;
if (numInFlight >= maxInFlight) {
// stop flowing line events when we hit maxInFlight
rl.pause();
}
let result = await searchGoogle(url);
// store results in order
results[resultIndex] = result;
} catch(e) {
// store error object as result
results[resultIndex] = e;
} finally {
--numInFlight;
rl.resume();
checkDone();
}
}
}).on('end', () => {
// all done reading here, may still be some processing in flight
doneReading = true;
checkDone();
}).on('error', (e) => {
doneReading = true;
checkDone(e);
});
});
}
FYI, you can set maxInFlight to a value of 1 and it will read process the URLs one at a time, but the whole point of writing this type of function is so that you can likely get better performance by setting it to a value higher than 1 (I'm guessing 5-20).