I'm using phantom 6.0.3 to scrape a web page. Here is the initial setup:
(async function () {
const instance = await phantom.create(['--ignore-ssl-errors=yes', '--load-images=no', '--web-security=false'], {logLevel: 'error'});
const page = await instance.createPage();
await page.on('onResourceRequested', function (requestData) {
console.info('Requesting', requestData.url);
});
const url = // Some url
const status = await page.open(url);
const content = await page.evaluate(function () {
return document.querySelector('ul > li');
});
const contentLength = content.length // 5
//Code Block 2 goes here
})();
So far everything works fine. It was able to successfully determine that the length of the content is 5 (there are 5 li items). So what I want to do now is get the innerText of each of those li elements... and this is where I get my issue.
I've try using a for loop to retrieve the innerText of each li element, but it always returns null. Here's what I've tried:
//Code Block 2:
for (let i = 0; i < contentLength; i++) {
const info = await page.evaluate(function () {
const element = document.querySelector('ul > li');
return element[i].innerText;
});
console.log(info); // this returns null 5 times
}
I don't know what's going on. I can give a specific index to return, such as: return element[3].innerText, and this will give me the correct innerText, but I can't get this working via loop
PhantomJS evaluates the function in a different context so it's not aware of the parameter i.
You should pass i to the evaluate function in order to forward it to the browser process:
for (let i = 0; i < contentLength; i++) {
const info = await page.evaluate(function (index) { // notice index argument
const element = document.querySelector('ul > li');
return element[index].innerText;
}, i); // notice second argument is i
console.log(info);
}
Related
I am working on a simple restaurant web app which uses a mongo-db database to store the menu items. My issue is that I have a client js file that will use a routing function that then accesses the database to return all the menu items of a certain restaurant. My issue is that my endpoint for the url isn't being recognized:
Client.js
function readMenu(rest){
(async () => {
// const newURL = url + "/menus/"+rest
const resp = await fetch(url+"/menus/"+rest)
const j = await resp.json();
itemlist = j["items"]
var element = document.getElementById("menu")
var i;
for (i = 0; i < itemlist.length; i++) {
var para = document.createElement("p")
item = itemList[i]
text = item["name"]+" | "+item["cost"]+" | "+item["descr"] +"<br>";
var node = document.createTextNode(text)
para.appendChild(node)
element.appendChild(para)
}
})
}
Server-routing.ts (Routings):
this.router.get("/menus", this.getResturants.bind(this))
this.router.post("/menus", this.addResturaunt.bind(this))
this.router.get("/menus/:rest", this.getResturauntItems.bind(this))
this.router.delete("/menus/:rest",this.deleteResturaunt.bind(this))
this.router.get("/menus/:rest/:item",[this.errorHandler.bind(this),this.getItem.bind(this)])
this.router.post("/menus/:rest",this.addItem.bind(this))
this.router.delete("/menus/:rest/:item",this.deleteItem.bind(this))
Server-routing.ts (function):
public async getResturauntItems(request, response) : Promise<void> {
console.log("Getting Restaurant Items")
let rest = request.params.rest
let obj = await this.theDatabase.getResturauntItems(rest)
console.log(obj)
response.status(201).send(JSON.stringify(obj))
response.end()
}
So, what should happen is a button calls readMenu(), it then makes a GET fetch request to localhost:8080/api/menus/ and then the menu items from the collection should be returned. The issue is that when I click the button, nothing happens. I know it is not being redirected to some other function as they all have "console.log()" to keep track of them and none of them where called. I used the "inspect" tool to see if the request was being sent or received anywhere and nothing. I am unsure of what the issue happens to be. If anyone can help, it would be really appreciated.
you just never called your function, you declared the async function inside your function but never called it.
function readMenu(rest){
(async () => {
// const newURL = url + "/menus/"+rest
const resp = await fetch(url+"/menus/"+rest)
const j = await resp.json();
itemlist = j["items"]
var element = document.getElementById("menu")
var i;
for (i = 0; i < itemlist.length; i++) {
var para = document.createElement("p")
item = itemList[i]
text = item["name"]+" | "+item["cost"]+" | "+item["descr"] +"<br>";
var node = document.createTextNode(text)
para.appendChild(node)
element.appendChild(para)
}
})();
}
you need to add () after creating the functions to call it.
I am trying to scrape a one-page website. There are multiple selection combinations that would result in different search redirects. I wrote a for loop in the page.evaluate's call back function to click the different selections and did the click search in every button. However, I got error: Converting circular structure to JSON Are you passing a nested JSHandle?
Please help!
My current version of code looks like this:
const res = await page.evaluate(async (i, courseCountArr, page) => {
for (let j = 1; j < courseCountArr[i]; j++) {
await document.querySelectorAll('.btn-group > button, .bootstrap-select > button')['1'].click() // click on school drop down
await document.querySelectorAll('div.bs-container > div.dropdown-menu > ul > li > a')[`${j}`].click() // click on each school option
await document.querySelectorAll('.btn-group > button, .bootstrap-select > button')['2'].click() // click on subject drop down
const subjectLen = document.querySelectorAll('div.bs-container > div.dropdown-menu > ul > li > a').length // length of the subject drop down
for (let k = 1; k < subjectLen; k++) {
await document.querySelectorAll('div.bs-container > div.dropdown-menu > ul > li > a')[`${k}`].click() // click on each subject option
document.getElementById('buttonSearch').click() //click on search button
page.waitForSelector('.strong, .section-body')
return document.querySelectorAll('.strong, .section-body').length
}
}
}, i, courseCountArr, page);
Why the error happens
While you haven't shown enough code to reproduce the problem (is courseCountArr an array of ElementHandles? Passing page to evaluate won't work either, that's a Node object), here's a minimal reproduction that shows the likely pattern:
const puppeteer = require("puppeteer");
let browser;
(async () => {
const html = `<ul><li>a</li><li>b</li><li>c</li></ul>`;
browser = await puppeteer.launch();
const [page] = await browser.pages();
await page.setContent(html);
// ...
const nestedHandle = await page.$$("li"); // $$ selects all matches
await page.evaluate(els => {}, nestedHandle); // throws
// ...
})()
.catch(err => console.error(err))
.finally(() => browser?.close())
;
The output is
TypeError: Converting circular structure to JSON
--> starting at object with constructor 'BrowserContext'
| property '_browser' -> object with constructor 'Browser'
--- property '_defaultContext' closes the circle Are you passing a nested JSHandle?
at JSON.stringify (<anonymous>)
Why is this happening? All code inside of the callback to page.evaluate (and family: evaluateHandle, $eval, $$eval) is executed inside the browser console programmatically by Puppeteer. The browser console is a distinct environment from Node, where Puppeteer and the ElementHandles live. To bridge the inter-process gap, the callback to evaluate, parameters and return value are serialized and deserialized.
The consequence of this is that you can't access any Node state like you're attempting with page.waitForSelector('.strong, .section-body') inside the browser. page is in a totally different process from the browser. (As an aside, document.querySelectorAll is purely synchronous, so there's no point in awaiting it.)
Puppeteer ElementHandles are complex structures used to hook into the page's DOM that can't be serialized and passed to the page as you're trying to do. Puppeteer has to perform the translation under the hood. Any ElementHandles passed to evaluate (or have .evaluate() called on them) are followed to the DOM node in the browser that they represent, and that DOM node is what your evaluate's callback is invoked with. Puppeteer can't do this with nested ElementHandles, as of the time of writing.
Possible fixes
In the above code, if you change .$$ to .$, you'll retrieve only the first <li>. This singular, non-nested ElementHandle can be converted to an element:
// ...
const handle = await page.$("li");
const val = await page.evaluate(el => el.innerText, handle);
console.log(val); // => a
// ...
Or:
const handle = await page.$("li");
const val = await handle.evaluate(el => el.innerText);
console.log(val); // => a
Making this work on your example is a matter of either swapping the loop and the evaluate call so that you access courseCountArr[i] in Puppeteer land, unpacking the nested ElementHandles into separate parameters to evaluate, or moving most of your console browser calls to click on things back to Puppeteer (depending on your use case and goals with the code).
You could apply the evaluate call to each ElementHandle:
const nestedHandles = await page.$$("li");
for (const handle of nestedHandles) {
const val = await handle.evaluate(el => el.innerText);
console.log(val); // a b c
}
To get an array of results, you could do:
const nestedHandles = await page.$$("li");
const vals = await Promise.all(
nestedHandles.map(el => el.evaluate(el => el.innerText))
);
console.log(vals); // [ 'a', 'b', 'c' ]
You can also unpack the ElementHandles into arguments for evaluate and use the (...els) parameter list in the callback:
const nestedHandles = await page.$$("li");
const vals = await page.evaluate((...els) =>
els.map(e => e.innerText),
...nestedHandles
);
console.log(vals); // => [ 'a', 'b', 'c' ]
If you have other arguments in addition to the handles you can do:
const nestedHandle = await page.$$("li");
const vals = await page.evaluate((foo, bar, ...els) =>
els.map(e => e.innerText + foo + bar)
, 1, 2, ...nestedHandle);
console.log(vals); // => [ 'a12', 'b12', 'c12' ]
or:
const nestedHandle = await page.$$("li");
const vals = await page.evaluate(({foo, bar}, ...els) =>
els.map(e => e.innerText + foo + bar)
, {foo: 1, bar: 2}, ...nestedHandle);
console.log(vals); // => [ 'a12', 'b12', 'c12' ]
Another option may be to use $$eval, which selects multiple handles, then runs a callback in browser context with the array of selected elements as its parameter:
const vals = await page.$$eval("li", els =>
els.map(e => e.innerText)
);
console.log(vals); // => [ 'a', 'b', 'c' ]
This is probably cleanest if you're not doing anything else with the handles in Node.
Similarly, you can totally bypass Puppeteer and do the entire selection and manipulation in browser context:
const vals = await page.evaluate(() =>
[...document.querySelectorAll("li")].map(e => e.innerText)
);
console.log(vals); // => [ 'a', 'b', 'c' ]
(note that getting the inner text throughout is just a placeholder for whatever browser code of arbitrary complexity you might have)
I wrote a little utility to solve this problem
const jsHandleToJSON = (jsHandle) => {
if (jsHandle.length > 0) {
let json = []
for (let i = 0; i < jsHandle.length; i++) {
json.push(jsHandleToJSON(jsHandle[i]))
}
return json
} else {
let json = {}
const keys = Object.keys(jsHandle)
for (let i = 0; i < keys.length; i++) {
if (typeof jsHandle[keys[i]] !== 'object') {
json[keys[i]] = jsHandle[keys[i]]
} else if (['elements', 'element'].includes(keys[i])) {
json[keys[i]] = jsHandleToJSON(jsHandle[keys[i]])
} else {
console.log(`skipping field ${keys[i]}`)
}
}
return json
}
}
It will create a new object with all the primitive fields of the jsHandle (recursively) and parse some extra jsHandle properties ['elements', 'element'], skips the others.
You could add more properties in there if you need them (but adding all of them will result in a infinite loop).
To make the log into puppeteer working you need to add the following line before the evaluate
page.on('console', message => console.log(`${message.type()}: ${message.text()}`))
I am trying to make a simple webscraper using Node and Puppeteer to get the titles of posts on reddit, but am having issues accessing a global variable, SUBREDDIT_NAME from within only one function, extractItems(). It works fine with every other function, but for that one I have to make a local variable with the same value for it to work.
Am I completely misunderstanding variable scope in Javascript?
I have tried everything I can think of, and the only thing that works is to create a local variable inside of extractedItems() with the value of "news", otherwise I get nothing.
const fs = require('fs');
const puppeteer = require('puppeteer');
const SUBREDDIT = (subreddit_name) => `https://reddit.com/r/${subreddit_name}/`;
const SUBREDDIT_NAME= "news";
function extractItems() {
const extractedElements = document.querySelectorAll(`a[href*='r/${SUBREDDIT_NAME}/comments/'] h3`);
const items = [];
for (let element of extractedElements) {
items.push(element.innerText);
}
return items;
}
async function scrapeInfiniteScrollItems(
page,
extractItems,
itemTargetCount,
scrollDelay = 1000,
) {
let items = [];
try {
let previousHeight;5
while (items.length < itemTargetCount) {
items = await page.evaluate(extractItems);
previousHeight = await page.evaluate('document.body.scrollHeight');
await page.evaluate('window.scrollTo(0, document.body.scrollHeight)');
await page.waitForFunction(`document.body.scrollHeight > ${previousHeight}`);
await page.waitFor(scrollDelay);
}
} catch(e) { }
return items;
}
(async () => {
// Set up browser and page.
const browser = await puppeteer.launch({
headless: false,
args: ['--no-sandbox', '--disable-setuid-sandbox'],
});
const page = await browser.newPage();
page.setViewport({ width: 1280, height: 926 });
// Navigate to the demo page.
await page.goto(SUBREDDIT(SUBREDDIT_NAME));
// Scroll and extract items from the page.
const items = await scrapeInfiniteScrollItems(page, extractItems, 100);
// Save extracted items to a file.
fs.writeFileSync('./items.txt', items.join('\n') + '\n');
// Close the browser.
await browser.close();
})();
I expect a text file with the 100 first found titles, but it only works when I hardcode the subreddit into the extractItems() function.
The problem is that the extractItems function is converted to a string (without processing the template literal) and executed in the pages context where there is no SUBREDDIT_NAME variable.
You can fix that by doing something like this:
function extractItems(name) {
const extractedElements = document.querySelectorAll(`a[href*='r/${name}/comments/'] h3`);
const items = [];
for (let element of extractedElements) {
items.push(element.innerText);
}
return items;
}
page.evaluate(`(${extractItems})(${SUBREDDIT_NAME})`)
I cannot read the title attribute from the received variable, but as you see the collection is not undefined, and the title attribute also not undefined.
The error saying its undefined.
var received = document.getElementsByClassName("_2her");
if (received[0].title == "Delivered") {
chColorsForDelivered();
}
Your script started before DOM ready, If you are using html5, then use async with you script.
If you only want to check that first element (and make sure it's there after the page loads), I usually use this method:
window.onload = function() {
var received = document.getElementsByClassName("_2her")[0];
if (received.title == "Delivered") {
chColorsForDelivered();
}
}
You could also use querySelector to get the first occurrence of the class, which might be more suited to your needs:
var received = document.querySelector("_2her");
if (received.title == "Delivered") {
chColorsForDelivered();
}
var active;
var bg;
var received;
var rightDelivered;
var colorchint;
window.onload = function() {
main();
}
//=================== FUNCTIONS ===================
async function main() {
active = await document.getElementsByClassName("_2v6o");
bg = document.getElementsByClassName("_673w");
received = await document.getElementsByClassName("_2her");
rightDelivered = document.getElementsByClassName("_2jnt");
colorchint;
bg[0].onmouseover = function() {clearInterv();}
rightDelivered[0].onclick = function() {clearDeliv();}
//await sleep(2000);
if (active[0].innerText == "Active on Messenger") {
chColorsForActive();
}
else if (received[0].title == "Delivered") {
await chColorsForDelivered();
}
}
//for delivered
async function chColorsForDelivered() {
y = 1;
for (var i = 0; i < 6; i++) {
chColorsDelivered();
await sleep(1000);
}
}
function chColorsDelivered() {
if (y === 1) {
color = "#1e1e1e";
y = 2;
} else {
color = "orange";
y = 1;
}
rightDelivered[0].style.background = color;
}
//accessories
function sleep(ms) {
return new Promise(resolve => setTimeout(resolve, ms));
}
I put it into async. First I tried to
await for the variable received = await document.getElementsByClassName("_2her"); (it wasn't enough)
then I tried to sleep for 2 seconds (it was working)
tried to await the inner function (it was working too) But i have no idea why is it working if I wait for the inner function. I doesn't make sense to me. If I delete the await chColorsForDelivered(); await here It's complaining about the title is undefiend.
If you run this script in your messenger you will see an orange blinking span in the conversation information (where you see the pictures and stuffs, just delete the if)(It's only blinking if you have a Delivered message).
I'm just starting to play around with Puppeteer (Headless Chrome) and Nodejs. I'm scraping some test sites, and things work great when all the values are present, but if the value is missing I get an error like:
Cannot read property 'src' of null (so in the code below, the first two passes might have all values, but the third pass, there is no picture, so it just errors out).
Before I was using if(!picture) continue; but I think it's not working now because of the for loop.
Any help would be greatly appreciated, thanks!
for (let i = 1; i <= 3; i++) {
//...Getting to correct page and scraping it three times
const result = await page.evaluate(() => {
let title = document.querySelector('h1').innerText;
let article = document.querySelector('.c-entry-content').innerText;
let picture = document.querySelector('.c-picture img').src;
if (!document.querySelector('.c-picture img').src) {
let picture = 'No Link'; } //throws error
let source = "The Verge";
let categories = "Tech";
if (!picture)
continue; //throws error
return {
title,
article,
picture,
source,
categories
}
});
}
let picture = document.querySelector('.c-picture img').src;
if (!document.querySelector('.c-picture img').src) {
let picture = 'No Link'; } //throws error
If there is no picture, then document.querySelector() returns null, which does not have a src property. You need to check that your query found an element before trying to read the src property.
Moving the null-check to the top of the function has the added benefit of saving unnecessary calculations when you are just going to bail out anyway.
async function scrape3() {
// ...
for (let i = 1; i <= 3; i++) {
//...Getting to correct page and scraping it three times
const result = await page.evaluate(() => {
const pictureElement = document.querySelector('.c-picture img');
if (!pictureElement) return null;
const picture = pictureElement.src;
const title = document.querySelector('h1').innerText;
const article = document.querySelector('.c-entry-content').innerText;
const source = "The Verge";
const categories = "Tech";
return {
title,
article,
picture,
source,
categories
}
});
if (!result) continue;
// ... do stuff with result
}
Answering comment question: "Is there a way just to skip anything blank, and return the rest?"
Yes. You just need to check the existence of each element that could be missing before trying to read a property off of it. In this case we can omit the early return since you're always interested in all the results.
async function scrape3() {
// ...
for (let i = 1; i <= 3; i++) {
const result = await page.evaluate(() => {
const img = document.querySelector('.c-picture img');
const h1 = document.querySelector('h1');
const content = document.querySelector('.c-entry-content');
const picture = img ? img.src : '';
const title = h1 ? h1.innerText : '';
const article = content ? content.innerText : '';
const source = "The Verge";
const categories = "Tech";
return {
title,
article,
picture,
source,
categories
}
});
// ...
}
}
Further thoughts
Since I'm still on this question, let me take this one step further, and refactor it a bit with some higher level techniques you might be interested in. Not sure if this is exactly what you are after, but it should give you some ideas about writing more maintainable code.
// Generic reusable helper to return an object property
// if object exists and has property, else a default value
//
// This is a curried function accepting one argument at a
// time and capturing each parameter in a closure.
//
const maybeGetProp = default => key => object =>
(object && object.hasOwnProperty(key)) ? object.key : default
// Pass in empty string as the default value
//
const getPropOrEmptyString = maybeGetProp('')
// Apply the second parameter, the property name, making 2
// slightly different functions which have a default value
// and a property name pre-loaded. Both functions only need
// an object passed in to return either the property if it
// exists or an empty string.
//
const maybeText = getPropOrEmptyString('innerText')
const maybeSrc = getPropOrEmptyString('src')
async function scrape3() {
// ...
// The _ parameter name is acknowledging that we expect a
// an argument passed in but saying we plan to ignore it.
//
const evaluate = _ => page.evaluate(() => {
// Attempt to retrieve the desired elements
//
const img = document.querySelector('.c-picture img');
const h1 = document.querySelector('h1')
const content = document.querySelector('.c-entry-content')
// Return the results, with empty string in
// place of any missing properties.
//
return {
title: maybeText(h1),
article: maybeText(article),
picture: maybeSrc(img),
source: 'The Verge',
categories: 'Tech'
}
}))
// Start with an empty array of length 3
//
const evaluations = Array(3).fill()
// Then map over that array ignoring the undefined
// input and return a promise for a page evaluation
//
.map(evaluate)
// All 3 scrapes are occuring concurrently. We'll
// wait for all of them to finish.
//
const results = await Promise.all(evaluations)
// Now we have an array of results, so we can
// continue using array methods to iterate over them
// or otherwise manipulate or transform them
//
results
.filter(result => result.title && result.picture)
.forEach(result => {
//
// Do something with each result
//
})
}
Try-catch worked for me:
try {
if (await page.$eval('element')!==null) {
const name = await page.$eval('element')
}
}catch(error){
name = ''
}