I'm trying to make a JSON file eventually that will contain all of the results from Google Maps Reviews but I can only get one/latest review to output...
Can anyone help me as to how to make this into an array to get all the reviews?
const puppeteer = require('puppeteer');
let scrape = async () => {
const browser = await puppeteer.launch({args: ['--no-sandbox', '--disable-setuid-sandbox']});
const page = await browser.newPage();
await page.goto('https://www.google.com/maps/place/Microsoft/#36.1275216,-115.1728651,17z/data=!3m2!4b1!5s0x80c8c416a26be787:0x4392ab27a0ae83e0!4m7!3m6!1s0x80c8c4141f4642c5:0x764c3f951cfc6355!8m2!3d36.1275216!4d-115.1706764!9m1!1b1');
await page.waitFor(1000);
const result = await page.evaluate(async () => {
let fullName = document.querySelector('.section-review-title').innerText;
let postedDate = document.querySelector('.section-review-publish-date').innerText;
let starRating = document.querySelector('.section-review-stars').getAttribute("aria-label");
let review = document.querySelector('.section-review-text').innerText;
return {
fullName,
postedDate,
starRating,
review
}
});
browser.close();
return result;
};
scrape().then((value) => {
console.log(value); // Success!
});
Thank you!
In general document.querySelectorAll gives you all results and not just the first.
In specific to your use case, what you want to do is getting a handle on ALL reviews first (before processing them).
I checked the url you provided and would start this way (Puppeteer style):
await page.$$('.section-review-content') will return a promise that resolves to an array with all reviews as ElementHandles.
Then you loop through the array and operate on every ElementHandle like this:
await ElementHandle.$eval('.section-review-title', el => el.innerText)
So for example, inside your scrape function you would have (I shortened your scenario a little):
...
await page.goto('https://www.google.com/maps/place/Microsoft/#36.1275216,-115.1728651,17z/data=!3m2!4b1!5s0x80c8c416a26be787:0x4392ab27a0ae83e0!4m7!3m6!1s0x80c8c4141f4642c5:0x764c3f951cfc6355!8m2!3d36.1275216!4d-115.1706764!9m1!1b1');
await page.waitFor(1000);
const reviews = await page.$$(".section-review-content");
for (const review of reviews) {
const reviewTitle = await review.$eval(
".section-review-title",
div => div.innerText
);
console.log('\n' + reviewTitle);
}
...
Check out the Puppeteer API how page.$$ works.
Related
I'm trying to write a webscraper tool that returns the url of the first result from a search based on some input. Here is the test.js file I'm using to try and test the webscraper:
const BrowserTool = async(props, websiteNum) => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(props.websites[websiteNum]);
await page.setViewport({width: 1080, height: 1024});
let ingredients = "";
for (var ingredient in props.ingredients) {
ingredients += '${ingredient} ' ;
}
await page.type('#typeaheadinput', '${ingredients}');
await page.keyboard.press('Enter');
const searchResultSelector = '#mod-site-search-results-1';
await page.waitForSelector(searchResultSelector);
await page.click(searchResultSelector);
const url = page.url();
await browser.close();
return(url);
};
export default BrowserTool;
let object = {ingredients: ["chicken breast"], websites: ["https://www.foodnetwork.com/"]};
let returnString = BrowserTool(object, 0);
console.log(returnString);
I originally didn't have the await page.keyboard.press('Enter'); line and I figured that might be the issue, however there's still nothing printing to the console. I also tried switching const url = page.url(); with const url = await page.evaluate(() => document.location.href); and that also didn't work.
A few primary issues:
You're not awaiting the promise returned by BrowserTool.
const url = page.url(); happens immediately after await page.click(searchResultSelector); without necessarily giving the navigation a chance to resolve. You probably mean to waitForNavigation as follows:
await Promise.all([
page.waitForNavigation(),
page.click(searchResultSelector),
]);
const url = page.url();
There are other issues and antipatterns to consider:
Instead of navigating to a page, typing into an input and pressing a button, it's easier and more reliable to navigate directly to the search results page using the URL.
Instead of clicking a link, waiting for the page to load, then getting its URL, you can probably grab the URL from the link without clicking it. This assumes no redirects, as appears to be the case here.
Avoid passing an index into an array into a function--do the indexing in the caller.
Functions should be camelCased actions, not nouns (you can omit get from the start of the function, though). Noun functions are for class constructors.
Avoid vague names like object and returnString.
Put browser.close() in a finally block so it runs even if there's an error.
Here's more or less how I'd approach this:
const puppeteer = require("puppeteer"); // ^19.6.3
const firstRecipeResultURL = async searchTerm => {
let browser;
try {
browser = await puppeteer.launch();
const [page] = await browser.pages();
const url =
`https://www.foodnetwork.com/search/${searchTerm.replace(/\s/g, "-")}-`;
await page.goto(url, {waitUntil: "domcontentloaded"});
const sel = ".o-RecipeResult .m-MediaBlock__a-Headline a";
const el = await page.waitForSelector(sel);
return await el.evaluate(el => el.getAttribute("href"));
}
finally {
browser?.close();
}
};
firstRecipeResultURL("chicken breast")
.then(url => console.log(url))
.catch(err => console.error(err));
Making a browser is expensive, so depending on your needs, you might want to create and cache a single instance for the duration of your app.
I am new to js/react api calls and am trying to use 2 apis one to fetch a random movie
https://k2maan-moviehut.herokuapp.com/api/random
and the second to get movie
http://www.omdbapi.com/
where the t paramter in second is movie name, and here is my code
in my main component
const getMovies = async () => {
let movies = await fetchMovies();
console.log(movies);
return movies;
}
const movies = getMovies();
and i call these functions from another file
const getMovie = async () => {
const omdpURL = "http://www.omdbapi.com/?i=tt3896198&apikey=xxxxxx&t=";
let moviename = (await (await fetch("https://k2maan-moviehut.herokuapp.com/api/random")).json()).name
let movie = await (await fetch(omdpURL.concat(moviename))).json();
return movie;
}
export const fetchMovies = async() => {
let values = [];
for (let i = 0; i < 12; i++) {
let movie = await getMovie();
values.push(movie)
}
return values;
}
the problem is that when I try to see my movies in main component it returns a {Promise fulfilled Array(12)} while if I logged the movies while I call getMovies it gives me the result I want which is the 12 movie I called , how can I have the results I need
Here's a basic setup to help you get started:
const getMovie = async () => {
const randomMovieURL = "https://k2maan-moviehut.herokuapp.com/api/random";
const omdpURL = "https://www.omdbapi.com/?i=tt3896198&apikey=apiKey&t=";
// Fetch the first URL and get a random movie
const result = await fetch(randomMovieURL);
// Get the response and parse the JSON data coming back from this response
const movie = await result.json();
// Fetch the OMDB using the random movie name
const result2 = await fetch(omdpURL + movie.name);
// Once again, parse the JSON from the Response we got back
const movieData = await result2.json();
// Return the JSON data which contains the movie details
return movieData;
}
const fetchMovies = async () => {
const requests = [];
for (let i = 0; i < 12; i++) {
// Fill in the requests Array with the asynchronous operations
requests.push(getMovie())
}
// Run all 12 async operations in parallel to speed things up
const movies = await Promise.all(requests);
// Return the results:
return movies;
}
// fetchMovies() is async so we need to `await` for the result inside
// another async function:
async function main(){
const movies = await fetchMovies();
console.log(movies);
}
main();
// Alternative syntax using then():
fetchMovies().then( movies => console.log( movies ) );
Updated:
Keep in mind, that the correct way to loop over a list of async calls in parallel and get the results, is through Promise.all.
Using a normal for loop and await (as in your case) will result in running all 12 calls sequentially, which is much slower.
Regarding the question:
"the problem is that when I try to see my movies in main component it return a {Promise fulfilled Array(12)} while if I logged the movies while I call getMovies it gives me the result I want which is the 12 movie I called , how can I have the results I need"
The reason why you see a Promise fulfilled Array(12) in the main component is that the getMovies function is async and thus always returns a Promise not a normal value.
In order to see the value, you will need to either use await before the function call (const movies = await getMovies()) or use then (getMovies().then( movies => console.log(movies) )).
This is the reason why you see the actual value in the console.log inside getMovies. You are using await before the async function fetchMovies.
Lesson of the day
You have just stumbled upon one of the trickiest and hardest parts of JavaScript: asynchronous programming using Promises.
Make sure to go through the following MDN resources to get a comprehensive and solid foundation of Promises:
Using Promises
async function
await expression
I am not sure why you try to nest your await fetch clauses. If you do it like this it would work.
const getMovies = async () => {
const omdpURL = "http://www.omdbapi.com/?i=tt3896198&apikey=myApiKey&t=";
const res = await fetch(
"https://k2maan-moviehut.herokuapp.com/api/random"
);
var data = await res.json();
const movie = await fetch(omdpURL + data.name);
var moviedata = await movie.json();
console.log(moviedata);
};
getMovies();
Excuse me, but the documentation is a little incomprehensible to me.
I use the argument:
const myDiv = await page.$$eval(".myDiv", myDiv => myDiv.textContent);
but the console.log will only return one result while the results for this div are >10.
How do I display them all ?
edit// That's my code that I'm learning from
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('mypage');
// await page.screenshot({path: 'example.png'});
await page.waitForSelector(".myDiv");
const myDiv = await page.$eval(".myDiv", myDiv => myDiv.textContent);
console.log(myDiv);
await browser.close();
})();
You can use page.evaluate:
const myDiv = await page.evaluate(() => {
const divs = Array.from(document.querySelectorAll('.myDiv'))
return divs.map(d => d.textContent)
});
Function passed to page.evaluate will be serialized and sent to browser, so it is executed in browser context (not Node).
Since you did not provide more code, this answer is pretty opinionated and maybe doesn't solve your problem. But it shows you a way how to understand what is happening.
Exspecially in developement, it's very helpful to use a combination of page.exposeFunction() and page.evaluate() to see what is going on in the browser and also in node/puppeteer.
Here is a draft which I hope it helps you to understand.
(async () => {
function executedInNodeContext(result) {
//this prints in the Node Console
console.log(result);
}
function executedInBrowserContext() {
console.log('Im in the Browser');
const myDiv = [...document.querySelectorAll('.myDiv')];
window.nameOfNodeFunction(myDiv);
}
// See the browser
const browser = await puppeteer.launch({ headless: false });
// Create a new page
const page = await browser.newPage();
// Callback in Node Context
await page.exposeFunction('nameOfNodeFunction', executedInNodeContext);
// Send puppeteer to a Url
await page.goto('http://localhost/awesome/URL');
// Function executed in the Browser on the given page
await page.evaluate(executedInBrowserContext);
})();
page.$$eval() sends in its callback an array with elements, so you need something like this to get all elements data:
const myDivs = await page.$$eval(".myDiv", divs => divs.map(div => div.textContent));
SITUATION:
I am currently learning to scrape using puppeteer.
For some reason, my current code gives me this error:
"UnhandledPromiseRejectionWarning: Error: Evaluation failed: ReferenceError: page is not defined"
(EDIT)
The issue is that while the page loads and each item is clicked, the data is not scraped because the code does not seem to wait for it to load after each item is clicked.
Here is what the code should do:
Load web page (OK)
Click on each item (OK)
Every time an item is clicked, some data is loaded in a div on the left, this is the data I want to scrape. (does not currently happen)
To achieve that, I make the code wait 2 seconds after a click to let the data load. (does not currently happen)
QUESTION:
How can I fix this and appropriately scrape said data ?
CODE:
const puppeteer = require('puppeteer');
let scrape = async () => {
const browser = await puppeteer.launch({headless: false});
const page = await browser.newPage();
await page.goto('https://website.com');
await page.setViewport({width: ..., height: ...});
const result = await page.evaluate(() => {
let data = [];
let elements = document.querySelector('.class1').querySelectorAll('.class2');
for (var element of elements){
page.click(element);
page.waitFor(2000);
let 1 = document.querySelector('.class0').querySelector('.class3').getAttribute("data-1");
let 2 = document.querySelector('.class0').querySelector('.class4').innerText;
let 3 = document.querySelector('.class0').querySelector('.class5').innerText;
let 4 = document.querySelector('.class0').querySelector('.class6').innerText;
data.push({1: 1, 2: 2, 3: 3, 4: 4}); // Push an object with the data onto our array
}
return data; // Return our data array
});
browser.close();
return result; // Return the data
};
scrape().then((value) => {
console.log(value); // Success!
});
There are a few issues with this code:
1, 2, etc. are not valid identifiers (I’m guessing this is just for the example, though)
.click() and .waitFor() would return promises, which you don’t wait for, but in any case…
the function you pass to evaluate is evaluated in the context of the page, not your Node.JS code, so page doesn’t exist
Instead, you can interact with the page directly in the function, as you do already:
const puppeteer = require('puppeteer');
let scrape = async () => {
const browser = await puppeteer.launch({ headless: false });
const page = await browser.newPage();
await page.goto('https://website.com');
await page.setViewport({ width: ..., height: ... });
const result = await page.evaluate(async () => {
const data = [];
const elements = document.querySelector('.class1').querySelectorAll('.class2');
for (const element of elements) {
element.click();
await new Promise((resolve) => setTimeout(resolve, 2000));
const one = document.querySelector('.class0').querySelector('.class3').getAttribute("data-1");
const two = document.querySelector('.class0').querySelector('.class4').innerText;
const three = document.querySelector('.class0').querySelector('.class5').innerText;
const four = document.querySelector('.class0').querySelector('.class6').innerText;
data.push({ 1: 1, 2: 2, 3: 3, 4: 4 }); // Push an object with the data onto our array
}
return data; // Return our data array
});
browser.close();
return result; // Return the data
};
scrape().then((value) => {
console.log(value); // Success!
});
Try changing this line:
await page.goto('https://website.com');
to:
await page.goto('https://website.com', { waitUntil: 'networkidle' })
I can make a change to a property and trigger a re-render of an element on this page via the console
Open JBrowse demo site via this link then read on...
In console I can run the following to update 1 element (in reality I'd do them all):
document.querySelectorAll('.track_jbrowse_view_track_alignments2')[0].track.displayMode = 'compact'
document.querySelectorAll('.track_jbrowse_view_track_alignments2')[0].track.layout = null
document.querySelectorAll('.track_jbrowse_view_track_alignments2')[0].track.redraw()
I'm attempting to perform this in the puppeteer code with:
const tracks = await page.$$('.track_jbrowse_view_track_alignments2');
for (let t of tracks) {
await page.evaluate(t => {
t.displayMode = 'compact';
t.layout = null;
t.redraw();
}, t);
}
The existing functional script is under this link, the above snippet would be inserted immediately following the highlighted line.
Any guidance would be great, thanks.
You should wait for your JS code when it is fully ready. It's up to you how you detect it, I just use dummy await page.waitFor(10000);. Next, you should invoke your operation on t.track object not on t.
Here is working example:
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(
'http://jbrowse.org/code/JBrowse-1.12.4/?loc=ctgA%3A16801..23754&tracks=volvox-sorted_bam&data=sample_data%2Fjson%2Fvolvox&nav=0&tracklist=0&fullviewlink=0&highlight='
);
await page.waitFor(10000);
const tracks = await page.$$('.track_jbrowse_view_track_alignments2');
for (let t of tracks) {
await page.evaluate(t => {
t.track.displayMode = 'compact';
t.track.layout = null;
t.track.redraw();
}, t);
}
await page.screenshot({ path: 'image.png' });
await browser.close();
})();