I'm building a simple web crawler to automate a newsletter, which means I only need to scape a set amount of pages. In this example, it is not a big deal because the script will only crawl 3 extra pages. But for a different case, this would be hugely inefficient.
So my question is, would there be a way to stop executing request() in this forEach loop?
Or would I need to change my approach to crawl pages one-by-one, as outlined in this guide.
Script
'use strict';
var request = require('request');
var cheerio = require('cheerio');
var BASEURL = 'https://jobsite.procore.com';
scrape(BASEURL, getMeta);
function scrape(url, callback) {
var pages = [];
request(url, function(error, response, body) {
if(!error && response.statusCode == 200) {
var $ = cheerio.load(body);
$('.left-sidebar .article-title').each(function(index) {
var link = $(this).find('a').attr('href');
pages[index] = BASEURL + link;
});
callback(pages, log);
}
});
}
function getMeta(pages, callback) {
var meta = [];
// using forEach's index does not work, it will loop through the array before the first request can execute
var i = 0;
// using a for loop does not work here
pages.forEach(function(url) {
request(url, function(error, response, body) {
if(error) {
console.log('Error: ' + error);
}
var $ = cheerio.load(body);
var desc = $('meta[name="description"]').attr('content');
meta[i] = desc.trim();
i++;
// Limit
if (i == 6) callback(meta);
console.log(i);
});
});
}
function log(arr) {
console.log(arr);
}
Output
$ node crawl.js
1
2
3
4
5
6
[ 'Find out why fall protection (or lack thereof) lands on the Occupational Safety and Health Administration (OSHA) list of top violations year after year.',
'noneChances are you won’t be seeing any scented candles on the jobsite anytime soon, but what if it came in a different form? The allure of smell has conjured up some interesting scent technology in recent years. Take for example the Cyrano, a brushed-aluminum cylinder that fits in a cup holder. It’s Bluetooth-enabled and emits up to 12 scents or smelltracks that can be controlled using a smartphone app. Among the smelltracks: “Thai Beach Vacation.”',
'The premise behind the hazard communication standard is that employees have a right to know the toxic substances and chemical hazards they could encounter while working. They also need to know the protective things they can do to prevent adverse effects of working with those substances. Here are the steps to comply with the standard.',
'The Weitz Company has been using Procore on its projects for just under two years. Within that time frame, the national general contractor partnered with Procore to implement one of the largest technological advancements in its 163-year history. Click here to learn more about their story and their journey with Procore.',
'MGM Resorts International is now targeting Aug. 24 as the new opening date for the $960 million hotel and casino complex it has been building in downtown Springfield, Massachusetts.',
'So, what trends are taking center stage this year? Below are six of the most prominent. Some of them are new, and some of them are continuations of current trends, but they are all having a substantial impact on construction and the structures people live and work in.' ]
7
8
9
Aside from using slice to limit your selection, you can also refactor the code to reuse some functionality.
Sorry, I couldn't help myself after thinking about this for a second.
We can begin with the refactor:
const rp = require('request-promise-native');
const {load} = require('cheerio');
function scrape(uri, transform) {
const options = {
uri,
transform: load
};
return rp(options).then(transform);
}
scrape(
'https://jobsite.procore.com',
($) => $('.left-sidebar .article-title a').toArray().slice(0,6).map((linkEl) => linkEl.attribs.href)
).then((links) => Promise.all(
links.map(
(link) => scrape(
`https://jobsite.procore.com/${link}`,
($) => $('meta[name="description"]').attr('content').trim()
)
)
)).then(console.log).catch(console.error);
While this does make the code a bit more DRY and concise, it points out one part that might need to be improved upon: the requesting of the links.
Currently it will fire off a request for all (or up to) 6 links found on the original page nearly at once. This may or may not be what you want depending on how many links this will be requesting at some other point that you alluded to.
Another potential concern is error management. As the refactor stands, if any one of the requests fail then all of the requests will be discarded.
Just a couple of points to consider if you like this approach. Both can be resolved in a variety of ways.
There's no way of stopping a forEach. You can simulate a stop by checking a flag inside the forEach, but that will still loop through all the elements. By the way, using a loop for an io operation is not optimal.
As you have stated, the best way to process a set of increasing data to process is to do it one-by-one, but I'll add a twist: Threaded-one-by-one.
NOTE: With thread I don't mean actual threads. Take it more of a
definition of "multiple lines of work". As IO operations don't lock
the main thread, while one or more requests are waiting for the data,
other "line of work" can run the JavaScript to process the data
received, as JavaScript is single threaded (Not talking about
WebWorkers).
Is as easy as having an array of pages, which receives pages to be crawled on the fly, and one function that reads one page of that array, process the result and then returns to the starting point (loading the next page of the array and processing the result).
Now you just call that function the amount of threads that you want to run, and done. Pseudo-code:
var pages = [];
function loadNextPage() {
if (pages.length == 0) {
console.log("Thread ended");
return;
}
var page = shift(); // get the first element
loadAndProcessPage(page, loadNextPage);
}
loadAndProcessPage(page, callback) {
requestOrWhatever(page, (error, data) => {
if (error) {
// retry or whatever
} else {
processData(data);
callback();
}
});
}
function processData(data) {
// Process the data and push new links to the pages array
pages.push(data.link1);
pages.push(data.link2);
pages.push(data.link3);
}
console.log("Start new thread");
loadNextPage();
console.log("And another one");
loadNextPage();
console.log("And another one");
loadNextPage();
console.log("And another thread");
loadNextPage();
This code will stop when no more pages are in the array, and if at some point happens to be less pages than the amount of threads, the threads will close. Needs some tweaks here and there, but you get the point.
I'm assuming you're trying to stop executing after some amount of pages (it looks like six in you're example). As some other replies have stated you can't prevent executing the callback from a Array.prototype.forEach(), however on each execution you could prevent running the request call.
function getMeta(pages, callback) {
var meta = []
var i = 0
pages.forEach(url => {
// MaxPages you were looking for
if(i <= maxPages)
request((err, res, body) => {
// ... Request logic
})
})
You could also use a while loop to wrap to iterate over each page and once i hits the value you want the loop will exit and no run on the additional pages
Related
I'm using the npm library jsdiff, which has a function that determines the difference between two strings. This is a synchronous function, but given two large, very different strings, it will take extremely long periods of time to compute.
diff = jsdiff.diffWords(article[revision_comparison.field], content[revision_comparison.comparison]);
This function is called in a stack that handles an request through Express. How can I, for the sake of the user, make the experience more bearable? I think my two options are:
Cancelling the synchronous function somehow.
Cancelling the user request somehow. (But would this keep the function still running?)
Edit: I should note that given two very large and different strings, I want a different logic to take place in the code. Therefore, simply waiting for the process to finish is unnecessary and cumbersome on the load - I definitely don't want it to run for any long period of time.
fork a child process for that specific task, you can even create a queu to limit the number of child process that can be running in a given moment.
Here you have a basic example of a worker that sends the original express req and res to a child that performs heavy sync. operations without blocking the main (master) thread, and once it has finished returns back to the master the outcome.
Worker (Fork Example) :
process.on('message', function(req,res) {
/* > Your jsdiff logic goes here */
//change this for your heavy synchronous :
var input = req.params.input;
var outcome = false;
if(input=='testlongerstring'){outcome = true;}
// Pass results back to parent process :
process.send(req,res,outcome);
});
And from your Master :
var cp = require('child_process');
var child = cp.fork(__dirname+'/worker.js');
child.on('message', function(req,res,outcome) {
// Receive results from child process
console.log('received: ' + outcome);
res.send(outcome); // end response with data
});
You can perfectly send some work to the child along with the req and res like this (from the Master): (imagine app = express)
app.get('/stringCheck/:input',function(req,res){
child.send(req,res);
});
I found this on jsdiff's repository:
All methods above which accept the optional callback method will run in sync mode when that parameter is omitted and in async mode when supplied. This allows for larger diffs without blocking the event loop. This may be passed either directly as the final parameter or as the callback field in the options object.
This means that you should be able to add a callback as the last parameter, making the function asynchronous. It will look something like this:
jsdiff.diffWords(article[x], content[y], function(err, diff) {
//add whatever you need
});
Now, you have several choices:
Return directly to the user and keep the function running in the background.
Set a 2 second timeout (or whatever limit fits your application) using setTimeout as outlined in this
answer.
If you go with option 2, your code should look something like this
jsdiff.diffWords(article[x], content[y], function(err, diff) {
//add whatever you need
return callback(err, diff);
});
//if this was called, it means that the above operation took more than 2000ms (2 seconds)
setTimeout(function() { return callback(); }, 2000);
I am facing this issue for the past 1 week and I am just confused about this.
Keeping it short and simple to explain the problem.
We have an in memory Model which stores values like budget etc.Now when a call is made to the API it has a spent associated with it.
We then check the in memory model and add the spent to the existing spend and then check to the budget and if it exceeds we donot accept any more clicks of that model. for each call we also udpate the db but that is a async operation.
A short example
api.get('/clk/:spent/:id', function(req, res) {
checkbudget(spent, id);
}
checkbudget(spent, id){
var obj = in memory model[id]
obj.spent+= spent;
obj.spent > obj.budjet // if greater.
obj.status = 11 // 11 is the stopped status
update db and rebuild model.
}
This used to work fine but now with concurrent requests we are getting false spends out spends increase more than budget and it stops after some time. We simulated the call with j meter and found this.
As far as we could find node is async so by the time the status is updated to 11 many threads have already updated the spent for the campaign.
How to have a semaphore kind of logic for Node.js so that the variable budget is in sync with the model
update
db.addSpend(campaignId, spent, function(err, data) {
campaign.spent += spent;
var totalSpent = (+camp.spent) + (+camp.cpb);
if (totalSpent > camp.budget) {
logger.info('Stopping it..');
camp.status = 11; // in-memory stop
var History = [];
History.push(some data);
db.stopCamp(campId, function(err, data) {
if (err) {
logger.error('Error while stopping );
}
model.campMAP = buildCatMap(model);
model.campKeyMap = buildKeyMap(model);
db.campEventHistory(cpcHistory, false, function(err) {
if (err) {
logger.error(Error);
}
})
});
}
});
GIST of the code can anyone help now please
Q: Is there semaphore or equivalent in NodeJs?
A: No.
Q: Then how do NodeJs users deal with race condition?
A: In theory you shouldn't have to as there is no thread in javascript.
Before going deeper into my proposed solution I think it is important for you to know how NodeJs works.
For NodeJs it is driven by an event based architecture. This means that in the Node process there is an event queue that contains all the "to-do" events.
When an event gets pop from the queue, node will execute all of the required code until it is finished. Any async calls that were made during the run were spawned as other events and they are queued up in the event queue until a response is heard back and it is time to run them again.
Q: So what can I do to ensure that only 1 request can perform updates to the database at a time?
A: I believe there are many ways you can achieve this but one of the easier way out is to use the set_timeout API.
Example:
api.get('/clk/:spent/:id', function(req, res) {
var data = {
id: id
spending: spent
}
canProceed(data, /*functions to exec after canProceed=*/ checkbudget);
}
var canProceed = function(data, next) {
var model = in memory model[id];
if (model.is_updating) {
set_timeout(isUpdating(data, next), /*try again in=*/1000/*milliseconds*/);
}
else {
// lock is released. Proceed.
next(data.spending, data.id)
}
}
checkbudget(spent, id){
var obj = in memory model[id]
obj.is_updating = true; // Lock this model
obj.spent+= spent;
obj.spent > obj.budjet // if greater.
obj.status = 11 // 11 is the stopped status
update db and rebuild model.
obj.is_updating = false; // Unlock the model
}
Note: What I got here is pseudo code as well so you'll may have to tweak it a bit.
The idea here is to have a flag in your model to indicate whether a HTTP request can proceed to do the critical code path. In this case your checkbudget function and beyond.
When a request comes in it checks the is_updating flag to see if it can proceed. If it is true then it schedules an event, to be fired in a second later, this "setTimeout" basically becomes an event and gets placed into node's event queue for later processing
When this event gets fired later, the checks again. This occurs until the is_update flag becomes false then the request goes on to do its stuff and is_update is set to false again when all the critical code is done.
Not the most efficient way but it gets the job done, you can always revisit the solution when performance becomes a problem.
in my node server I have a variable,
var clicks = 0;
each time a user clicks in the webapp, a websocket event sends a message. on the server,
clicks++;
if (clicks % 10 == 0) {
saveClicks();
}
function saveClicks() {
var placementData = JSON.stringify({'clicks' : clicks});
fs.writeFile( __dirname + '/clicks.json', placementData, function(err) {
});
}
At what rate do I have to start worrying about overwrites? How would I calculate this math?
(I'm looking at creating a MongoDB json object for each click but I'm curious what a native solution can offer).
From the node.js doc for fs.writeFile():
Note that it is unsafe to use fs.writeFile() multiple times on the
same file without waiting for the callback. For this scenario,
fs.createWriteStream() is strongly recommended.
This isn't a math problem to figure out when this might cause a problem - it's just bad code that gives you the chance of a conflict in circumstances that cannot be predicted. The node.js doc clearly states that this can cause a conflict.
To make sure you don't have a conflict, write the code in a different way so a conflict cannot happen.
If you want to make sure that all writes happen in the proper order of incoming requests so the last request to arrive is always the one who ends up in the file, then you make need to queue your data as it arrives (so order is preserved) and write to the file in a way that opens the file for exclusive access so no other request can write while that prior request is still writing and handle contention errors appropriately.
This is an issue that databases mostly do for you automatically so it may be one reason to use a database.
Assuming you weren't using clustering and thus do not have multiple processes trying to write to this file and that you just want to make sure the last value sent is the one written to the file by this process, you could do something like this:
var saveClicks = (function() {
var isWriting = false;
var lastData;
return function() {
// always save most recent data here
lastData = JSON.stringify({'clicks' : clicks});
if (!isWriting) {
writeData(lastData);
}
function writeData(data) {
isWriting = true;
lastData = null;
fs.writeFile(__dirname + '/clicks.json', data, function(err) {
isWriting = false;
if (err) {
// decide what to do if an error occurs
}
// if more data arrived while we were writing this, then write it now
if (lastData) {
writeData(lastData);
}
});
}
}
})();
#jfriend00 is definitely right about createWriteStream and already made a point about the database, and everything's pretty much said, but I would like to emphasize on the point about databases because basically the file-saving approach seems weird to me.
So, use databases.
Not only would this save you from the headache of tracking such things, but would significantly speed up things (remember that the way stuff is done in node, the numerous file reading-writing processes would be parallelized in a single thread, so basically if one of them lasts for ages, it might slightly affect the overall performance).
Redis is a perfect solution to store key-value data, so you can store data like clicks per user in a Redis database which you'll have to get running alongside anyway when your get enough traffic :)
If you're not convinced yet, take a look at this simple benchmark:
Redis:
var async = require('async');
var redis = require("redis"),
client = redis.createClient();
console.time("To Redis");
async.mapLimit(new Array(100000).fill(0), 1, (el, cb) => client.set("./test", 777, cb), () => {
console.timeEnd("To Redis");
});
To Redis: 5410.383ms
fs:
var async = require('async');
var fs = require('fs');
console.time("To file");
async.mapLimit(new Array(100000).fill(0), 1, (el, cb) => fs.writeFile("./test", 777, cb), () => {
console.timeEnd("To file");
});
To file: 20344.749ms
And, by the way, you can significantly increase the number of clicks after which the progress would be stored (now it's 10) by simply adding this "click-saver" to the socket socket.on('disconnect', ....
I'm trying to use the Gmail API to retrieve all the thread subjects in a gmail account.
That's easy with threads.list, but that mainly gets the ID of the thread, not the subject.
The only way I've found is by using threads.list then, for each thread, calling threads.get and fetching the subject from the headers in the payload metadata.
Obviously this makes a lot of API calls, i.e. 101 calls if there's 100 threads.
Is there a better way?
Here's the code I'm currently using:
var getIndivThread = function(threads) {
threads.threads.forEach(function(e) {
indivThreadRequst.id = e.id,
gmail.api.users.threads.get(indivThreadRequst).execute(showThread);
});
};
var indivThreadRequst= {
format: 'metadata',
metadataHeaders:['subject'],
userId: myUserId,
maxResults:1};
var showThread = function(thread) {
console.log(thread.messages[0].payload.headers[0].value);
};
gmail.api.users.threads.list({userId:myUserId}).execute(getIndivThread);
Unfortunately, there isn't a way to get more than one thread subject at a time through the current API. However, there are a few things you might do to improve the performance:
Use the API's paging feature to fetch limited amounts of threads at once.
Fetch batches of messages in parallel rather than attempting to fetch all at once or one at a time. Experiment for yourself, but 5 would be a good number to start with.
If this is a browser implementation, consider making a call to your back-end instead of making the calls from the browser. This way the client only makes 1 call per page, and it allows you to potentially add caching and pre-loading mechanisms to your server that will improve the customer experience. The potential downside here is scalability; as you get more clients, you'll need considerably more processing power than an fat-client approach.
As an example of #2, you could fetch 5 initially, and then have the callback for each function fire the next call so there are always 5 fetching concurrently:
var CONCURRENCY_LIMIT = 5;
function getThread(threadId, done) {
threadRequest.id = e.id;
gmail.api.users.threads.get(threadRequest).execute(function(thread) {
showThread();
done();
});
}
gmail.api.users.threads.list({userId:myUserId}).execute(function(threads) {
function fetchNextThread() {
var nextThread = threads.shift();
nextThread.id && getThread(nextThread.id, fetchNextThread);
}
for (var i = 0; i < CONCURRENCY_LIMIT; i++) {
fetchNextThread();
}
});
var sendBuffer = new ArrayBuffer(4096);
var dv = new DataView(sendBuffer);
dv.setInt32(0, 1234);
var service = svcName;
for (var i = 0; i < service.length; i++)
{
dv.setUint8(i + 4, service.charCodeAt(i));
}
ws.send(sendBuffer);
how to workout this wihout using for loop. for loop decreasing performance while works with huge amount of data.
As based on the comments your real problem is that the loop will make your UI to block.
The split answer above does not give you a proper way to prevent blocking. Everything done in the Javascript main thread will block the UI.
You need to use Web Workers (separate threads) for processing your data, so the processing does not block the UI thread:
http://updates.html5rocks.com/2011/09/Workers-ArrayBuffer
You post the data to the separate worker for processing using postMessage() and then post the resulting data back to the main thread using another postMessage().
In javascript, for() loops are very tight and efficient relative to other operations. Doing this sequentially on every permutation (i.e. getting rid of the for() loop) would be inelegant and also not save you very many cycles.
If an operation is likely to cause a client to grind to a halt, you need to split the problem into smaller components and give a warning to the user that performing the operation will take some time.
I would recommend splitting this operation into smaller chunks instead of trying to find another algorithm that doesn't use for().
Perhaps like this, using callbacks prevent the code from blocking:
var split = service.length/4;
function alpha(split, position, callback) {
for (var i = split*(position-1); i < split*(position); i++) {
dv.setUint8(i + 4, service.charCodeAt(i));}
if(callback && (typeof(callback) == 'function') {
callback();}
}
var split = service.length/4;
alpha(split, 1, function() {
// poll here for other information or to confirm user wishes to proceed
alpha(split, 2, function() {
// poll here for other information or to confirm user wishes to proceed
alpha(split, 3, function() {
// poll here for other information or to confirm user wishes to proceed
alpha(split, 4);}
}
}
This is an enormously simplified and not the best way to implement this solution. But it will give you a chance to optimize the processing going on and prioritize the operations in relation to other ops.