Currently I'm using promises to try to prevent the need for nested callbacks in my code, but I've hit a setback. In this case, I'm using node's request-promise and cheerio to emulate jQuery on the server. However, at some point I need to call jQuery.each(), to create a request for each <a> element. Is there any way I can use promises to prevent this nested callback?
request("http://url.com").then(function (html) {
var $ = cheerio.load(html);
var rows = $("tr.class a");
rows.each(function (index, el) {
//Iterate over all <a> elements, and send a request for each one.
//Can this code be modified to return a promise?
//Is there another way to prevent this from being nested?
request($(el).attr("href")).then(function (html) {
var $ = cheerio.load(html);
var url = $("td>img").attr("src");
return request(url);
})
.then(function (img) {
//Save the image to the database
});
});
});
Assuming Bluebird promises (code in other libraries is similar):
Promise.resolve(request("http://url.com").then(function (html) {
var $ = cheerio.load(html)("tr.class a");
})).map(function(el){ // map is `then` over an array
return el.href;
}).map(request).map(function(html){
return cheerio.load(html)("td>img").src;
}).map(request).map(function(img){
// save to database.
});
Alternatively, you can define actions for a single link and then process those. It would look similar.
This is the best solution I got in the end. Some incidental changes I made include using url.resolve to allow relative URLs to work.
var $ = require('cheerio');
var request = require('request-promise');
var url = require('url');
var baseURL = "http://url.com";
request(baseURL).then(function (html) {
$("tr.class a", html).toArray();
}).map(function (el) {
return request(url.resolve(baseURL, jq.attr("href")));
}).map(function (html) {
var src = $("td>img", html).attr("src");
return request(url.resolve(baseURL, src));
}).map(function (img) {
//Save the image to the database
});
Thanks to Benjamin Gruenbaum for alterting me to the .map() method in bluebird.
Related
Specifically, given a list of data, I want to loop over that list and do a fetch for each element of that data before I combine it all afterward. The thing is, as written, the code iterates through the entire list immediately, starting all the operations at once. Then, even though the fetch operations are still running, the then call I have after all that runs, before the data could've been processed.
I read something about putting all the Promises in an array, then passing that array to a Promise.all() call, followed by a then that will have access to all that processed data as intended, but I'm not sure how exactly to go about doing it in this case, since I have nested Promises in this for loop.
for(var i in repoData) {
var repoName = repoData[i].name;
var repoUrl = repoData[i].url;
(function(name, url) {
Promise.all([fetch(`https://api.github.com/repos/${username}/${repoData[i].name}/commits`),
fetch(`https://api.github.com/repos/${username}/${repoData[i].name}/pulls`)])
.then(function(results) {
Promise.all([results[0].json(), results[1].json()])
.then(function(json) {
//console.log(json[0]);
var commits = json[0];
var pulls = json[1];
var repo = {};
repo.name = name;
repo.url = url;
repo.commitCount = commits.length;
repo.pullRequestCount = pulls.length;
console.log(repo);
user.repositories.push(repo);
});
});
})(repoName, repoUrl);
}
}).then(function() {
var payload = new Object();
payload.user = user;
//console.log(payload);
//console.log(repoData[0]);
res.send(payload);
});
Generally when you need to run asynchronous operations for all of the items in an array, the answer is to use Promise.all(arr.map(...)) and this case appears to be no exception.
Also remember that you need to return values in your then callbacks in order to pass values on to the next then (or to the Promise.all aggregating everything).
When faced with a complex situation, it helps to break it down into smaller pieces. In this case, you can isolate the code to query data for a single repo into its own function. Once you've done that, the code to query data for all of them boils down to:
Promise.all(repoData.map(function (repoItem) {
return getDataForRepo(username, repoItem);
}))
Please try the following:
// function to query details for a single repo
function getDataForRepo(username, repoInfo) {
return Promise
.all([
fetch(`https://api.github.com/repos/${username}/${repoInfo.name}/commits`),
fetch(`https://api.github.com/repos/${username}/${repoInfo.name}/pulls`)
])
.then(function (results) {
return Promise.all([results[0].json(), results[1].json()])
})
.then(function (json) {
var commits = json[0];
var pulls = json[1];
var repo = {
name: repoInfo.name,
url: repoInfo.url,
commitCount: commits.length,
pullRequestCount: pulls.length
};
console.log(repo);
return repo;
});
}
Promise.all(repoData.map(function (repoItem) {
return getDataForRepo(username, repoItem);
})).then(function (retrievedRepoData) {
console.log(retrievedRepoData);
var payload = new Object();
payload.user = user;
//console.log(payload);
//console.log(repoData[0]);
res.send(payload);
});
I want to make a request to website, get its html, and give it to cheerio. I need to get all the "href" attribute of all the elements with class ".thumb". I'm showing the results on the console and I just get undefined many times, I assume it's for each element found. I get undefined when trying to loop through any other element with tag or identifier, but if I don't loop and just get the first one the correct value is given.
function firstReq(){
req(url, (err,res,html)=>{
if(err){console.error(err);}
var $ = cheerio.load(html);
var arr = []
$("a").each(()=>{
console.log($(this).attr("href"));
});
});
}
I tried console.log(html) to check that the document was alright and it is. I also tried setting setTimeout on the iteration, maybe to give "request" and "cheerio" time to load the file, and still the same. I tried first downloading the html file from the url to my computer (outside of function, before call) and then passing it to cheerio, and still undefined.
It's my first personal project with Node and I'm very confused. Any help is appreciated.
U can use two ways here:
Arrow function
function firstReq(){
req(url, (err,res,html)=>{
if(err){console.error(err);}
var $ = cheerio.load(html);
var arr = []
$("a").each((i , elem)=>{
console.log($(elem).attr("href"));
});
});
}
Or function:
function firstReq(){
req(url, (err,res,html)=>{
if(err){console.error(err);}
var $ = cheerio.load(html);
var arr = []
$("a").each(function(){
console.log($(this).attr("href"));
});
});
}
request(url, (err,res,html)=>{
if(err){console.error(err);}
var $ = cheerio.load(html);
var arr = []
$("a").each((index,a)=>{
console.log($(a).attr("href"));
});
});
So I'm making a little scraper for learning purposes, in the end I should get a tree-like structure of the pages on the website.
I've been banging my head trying to get the requests right. This is more or less what I have:
var request = require('request');
function scanPage(url) {
// request the page at given url:
request.get(url, function(err, res, body) {
var pageObject = {};
/* [... Jquery mumbo-jumbo to
1. Fill the page object with information and
2. Get the links on that page and store them into arrayOfLinks
*/
var arrayOfLinks = ['url1', 'url2', 'url3'];
for (var i = 0; i < arrayOfLinks.length; i++) {
pageObj[arrayOfLinks[i]] = scanPage[arrayOfLinks[i]];
}
});
return pageObj;
}
I know this code is wrong on many levels, but it should give you an idea of what I'm trying to do.
How should I modify it to make it work? (without the use of promises if possible)
(You can assume that the website has a tree-like structure, so every page only has links to pages further down the three, hence the recursive approach)
I know that you'd rather not use promises for whatever reason (and I can't ask why in the comments because I'm new), but I believe that promises are the best way to achieve this.
Here's a solution using promises that answers your question, but might not be exactly what you need:
var request = require('request');
var Promise = require('bluebird');
var get = Promise.promisify(request.get);
var maxConnections = 1; // maximum number of concurrent connections
function scanPage(url) {
// request the page at given url:
return get(url).then((res) => {
var body = res.body;
/* [... Jquery mumbo-jumbo to
1. Fill the page object with information and
2. Get the links on that page and store them into arrayOfLinks
*/
var arrayOfLinks = ['url1', 'url2', 'url3'];
return Promise.map(arrayOfLinks, scanPage, { concurrency: maxConnections })
.then(results => {
var res = {};
for (var i = 0; i < results.length; i++)
res[arrayOfLinks[i]] = results[i];
return res;
});
});
}
scanPage("http://example.com/").then((res) => {
// do whatever with res
});
Edit: Thanks to Bergi's comment, rewrote the code to avoid the Promise constructor antipattern.
Edit: Rewrote in a much better way. By using Bluebird's concurrency option, you can easily limit the number of simultaneous connections.
I have two main loops for posts and comments. However the comments don't display, presumably because the post ID is not on the DOM yet (?).
$.getJSON('fresh_posts.php',function(data){
//posts
$.each(data.freshposts, function(id, post) {
// set variables and append divs to document
var id = post.id;
...
});
// comments attached to each post
$.each(data.freshcomments, function(id, commentList) {
$.each(commentList, function(index, c) {
// set variables and append comments to each post div
var postid = c.postid; // this is the same as post.id (linked)
...
var full = "<div> ... </div>";
$('#comment-block'+postid).append(full); // comment-block+postid is attached with each post div, so it tells the comment which div it should be appended to.
})
});
});
Does not display comments ^
If I wrap the $.each loop for the comments in a setTimeOut(function(){},1), the comments are able to be displayed - I suppose it needs to wait 1 millisecond before the loop can commence? However this doesn't seem like a good/fool-proof way to ensure this.
setTimeOut(function(){
$.each(data.freshcomments, function(id, commentList) {
...
})
},1)
Displays comments ^
I managed to solve this after a few hours of work.
I made two functions, getPosts() and appendComments(). I used the promise method:
function getPosts(){
var deferred = new $.Deferred(); // new deferred
$.getJSON('fresh_posts.php',function(data){
global_save_json = data.freshcomments;
var howManyPosts = Object.keys(data.freshposts).length; // how many posts there are (object length)
var arrayCount = 0;
$.each(data.freshposts, function(id, post) {
// closing tags for above
return deferred.promise();
}
I had an async method (just learned what that was) to check if the image was designed for retina displays within getPosts.
if (retina === "true") {
var img = new Image();
img.onload = function() {
var retina_width = this.width/2;
var post = '...';
$('.main').append(post);
arrayCount++; // for each post add +1 to arrayCount
if (arrayCount == howManyPosts) { // if last post
deferred.resolve();
}
}
img.src = image;
} else {
...
I then did
getPosts().then(appendComments);
after closing tag of the function getPosts. So basically even though the function finishes, it still waits for the async method to also finish within that function before the function getComments is called. That way the post id's will exist on the document for the comments to be appended to.
appendComments looks like this:
function appendComments(){
if (global_save_json != null){
$.each(global_save_json, function(id, commentList) {
$.each(commentList, function(index, c) {
I'm looking for some function that would allow me to reiterate through the div elements scraped by PhantomJs (which uses jQuery-like syntax), but one by one - not all at the same time like .each seems to be doing. So I guess I need it to run syncroniously.
At the moment my code looks something like this
page.open("https://www.google.com" + expandedurl, function (status) {
console.log("opened google knowledge graph ", status);
page.evaluate(function () { return document.body.innerHTML; }, function (result) {
var $ = cheerio.load(result);
$(".kltat").each(function() {
var link = $(this);
var text = link.text();
launch(text);
});
ph.exit();
// Move on to the next one
});
});
I need something that would not launch all of the each iterations at the same time. Maybe there's some way of reiterating I could use that would not work asynchroniously - that's what i need...
If launch is something asynchronous and is able to take a callback, then
Use async for this:
var async = require('async');
var $ = cheerio.load(result);
var callbacks = [];
$(".kltat").each(function() {
var link = $(this);
var text = link.text();
callbacks.push(function(cb){
launch(text, cb);
});
});
async.series(callbacks, function(){
ph.exit();
});
Otherwise, you can either use a static wait amount:
callbacks.push(function(cb){
launch(text);
setTimeout(function(){
cb(null);
});
});
or use something like waitFor to wait for an external condition triggered through launch.