I am creating a screen scraper that needs to scrape the content of a page and take a screenshot of it. For that I am using Puppeteer, but I am hitting a snag. When I try to call a function that runs page.screenshot inside of page.evaulate I am getting an error that the funtion is not defined.
Here is my code:
async function getContent(clink, ce, networkidle, host, filepath) {
let browser = await puppeteer.launch();
let cpage = await browser.newPage();
await cpage.goto(clink, { waitUntil: networkidle });
let content = await cpage.evaluate((clink, ce, networkidle, host, filepath, pubDate) => {
let results = '';
let enclurl = clink;
takeScreenshot(enclurl, filepath, networkidle)
.then(() => {
console.log("Screenshot taken");
})
.catch((err) => {
console.log("Error occured!");
console.dir(err);
});
results += '<title><![CDATA[' + 'test' + ']]</title>';
results += '<description><![CDATA[' + '<img src="' + host + filepath.slice(1) + '">' + document.querySelector(ce).innerHTML + ']]</description>';
results += '<link>' + clink + '</link>';
results += '<guid>' + clink + '</guid>';
results += '<pubDate>' + pubDate + '</pubDate>';
return results;
}, clink, ce, networkidle, host, filepath, pubDate);
await cpage.close();
await browser.close();
return content;
}
That code should return items before a RSS format xml file is created. The URLs of such files will then be added to WPRobot campaigns. The end goal will be a search engine the uses Wordpress to aggregate the main content of pages with full screenshots of the sources.
The takeScreenshot function is as follows:
async function takeScreenshot(enclurl, filepath, networkidle) {
let browser = await puppeteer.launch();
let page = await browser.newPage();
await page.goto(enclurl, { waitUntil: networkidle });
let buffer = await page.screenshot({
path: filepath
});
await page.close();
await browser.close();
}
Take screenshot works just fine when called outside of page.evaluate. The exact error I get says "takeScreenshot is undefined." I have another function that parses RSS feeds and takes screenshots of their source URLs, but it does not use page.evaluate at all.
I have now added the call to takeScreenshot to an earlier part of my code right before getContent() called but now it seems getContent() always returns as undefined. My new getContent() reads:
async function getContent(clink, ce, networkidle) {
let browser = await puppeteer.launch();
let cpage = await browser.newPage();
await cpage.goto(clink, { waitUntil: networkidle });
let content = await cpage.evaluate((ce) => {
let cefc = ce.charAt(0);
if (cefc != '.') {
ce = '#' + ce;
}
console.log('ce=' + ce);
let results = document.querySelector(ce).innerHTML;
return results;
}, ce);
await cpage.close();
await browser.close();
return content;
}
I am also not seeing console.log('ce=' + ce) being written to the log. After moving the console.log out of the page.evaluate loop it logged the appropriate value for the content which is the HTML of the element with the specified class. Despite that the value of return content remains undefined.
Page.evaluate has a strange and not intuitive way to work:
the code of the function ( in you case: (clink, ce, networkidle, host, filepath, pubDate) => {...} ) is NOT executed in your script. This function in serialized, and send to the headless browser, inside puppeteer.
If you want to call a function from inside the evaluate function, usually (but not in this case) you can use one of this tricks: How to pass a function in Puppeteers .evaluate() method?
BUT in this case... there is a problem!
inside takeScreenshot there are other function that CAN'T BE inside the headless browser of puppeteer, that are puppeteer.launch(); etc. This functions require a lot of dependecies (and same executable)... and can't be passed.
To do what you need, move the screenshot part of your code out of evaluate:
async function getContent(clink, ce, networkidle, host, filepath) {
let browser = await puppeteer.launch();
let cpage = await browser.newPage();
await cpage.goto(clink, { waitUntil: networkidle });
let content = await cpage.evaluate((clink, ce, networkidle, host, filepath, pubDate) => {
let results = '';
let enclurl = clink;
results += '<title><![CDATA[' + 'test' + ']]</title>';
results += '<description><![CDATA[' + '<img src="' + host + '{REPL_ME}' + '">' + document.querySelector(ce).innerHTML + ']]</description>';
results += '<link>' + clink + '</link>';
results += '<guid>' + clink + '</guid>';
results += '<pubDate>' + pubDate + '</pubDate>';
return results;
}, clink, ce, networkidle, host, filepath, pubDate);
await takeScreenshot(enclurl, filepath, networkidle);
content = content.replace('{REPL_ME}', filepath)
await cpage.close();
await browser.close();
return content;
}
Related
I'm doing some scraping after receiving html from an api. I'd like to do the following:
Open html page in chrome so I can find selectors in the console.
Immediately load the same html page into a jsdom instance
Drop into the repl - I can then find the right selectors in the console and test them out in a live jsdom environment to see if they work.
For 1, I have:
async function openHtml(htmlString) {
const browser = await puppeteer.launch({headless: false});
const page = await browser.newPage();
await page.setContent(htmlString);
return;
// await browser.close();
}
The code provided with the api is:
var req = http.request(options, function (res) {
var chunks = [];
res.on("data", function (chunk) {
chunks.push(chunk);
});
res.on("end", function () {
var body = Buffer.concat(chunks);
response = JSON.parse(body); //response.content = html, response.cookies = cookies
const dom = new JSDOM(response.content);
console.log(dom.window.document.querySelector("p").textContent); // "Hello world"
openHtml(response.content);
console.log('hi');
});
});
req.end();
If I run the code at the command line the browser opens as expected. However, if I set a breakpoint at:
console.log('hi');
It does not. How can I get this working?
openHtml is an async function. So you'll have to set the method calling in await (promise) and main function to async as well.
var req = http.request(options, function (res) {
var chunks = []
res.on('data', function (chunk) {
chunks.push(chunk)
})
res.on('end', async function () {
var body = Buffer.concat(chunks)
response = JSON.parse(body) //response.content = html, response.cookies = cookies
const dom = new JSDOM(response.content)
console.log(dom.window.document.querySelector('p').textContent) // 'Hello world'
await openHtml(response.content)
console.log('hi')
})
})
req.end()
I am very new to puppeteer (I started today). I have some code that is working the way that I want it to except for an issue that I think is making it extremely inefficient. I have a function that links me through potentially thousands of urls that have incremental IDs to pull the name, position, and stats of each player and then inserts that data into a neDB database. Here is my code:
const puppeteer = require('puppeteer');
const Datastore = require('nedb');
const database = new Datastore('database.db');
database.loadDatabase();
async function scrapeProduct(url, id){
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(url);
let attributes = [];
const [name] = await page.$x('//*[#id="ctl00_ctl00_ctl00_Main_Main_name"]');
const txt = await name.getProperty('innerText');
const playerName = await txt.jsonValue();
attributes.push(playerName);
//Make sure that there is a legitimate player profile before trying to pull a bunch of 'undefined' information.
if(playerName){
const [role] = await page.$x('//*[#id="ctl00_ctl00_ctl00_Main_Main_position"]');
const roleTxt = await role.getProperty('innerText');
const playerRole = await roleTxt.jsonValue();
attributes.push(playerRole);
//Loop through the 12 attributes and pull their values.
for(let i = 1; i < 13; i++){
let vLink = '//*[#id="ctl00_ctl00_ctl00_Main_Main_SectionTabBox"]/div/div/div/div[1]/table/tbody/tr['+i+']/td[2]';
const [e1] = await page.$x(vLink);
const val = await e1.getProperty('innerText');
const skillVal = await val.jsonValue();
attributes.push(skillVal);
}
//Create a player profile to be pushed into the database. (I realize this is very wordy and ugly code)
let player = {
Name: attributes[0],
Role: attributes[1],
Athleticism: attributes[2],
Speed: attributes[3],
Durability: attributes[4],
Work_Ethic: attributes[5],
Stamina: attributes[6],
Strength: attributes[7],
Blocking: attributes[8],
Tackling: attributes[9],
Hands: attributes[10],
Game_Instinct: attributes[11],
Elusiveness: attributes[12],
Technique: attributes[13],
_id: id,
};
database.insert(player);
console.log('player #' + id + " scraped.");
await browser.close();
} else {
console.log("Blank profile");
await browser.close();
}
}
//Making sure the first URL is scraped before moving on to the next URL. (i removed the URL because its unreasonably long and is not important for this part).
(async () => {
for(let i = 0; i <= 1000; i++){
let link = 'https://url.com/Ratings.aspx?rid='+i+'§ion=Ratings';
await scrapeProduct(link, i);
}
})();
What I think is making this so inefficient is the fact that everytime scrapeProduct() is called, i create a new browser and create a new page. Instead I believe it would be more efficient to create 1 browser and 1 page and just change the pages URL with
await page.goto(url)
I believe that in order to do what I'm trying to accomplish here, i need to move:
const browser = await puppeteer.launch();
const page = await browser.newPage();
outside of my scrapeProduct() function but i cannot seem to get this to work. Anytime I try i get an error in my function saying that page is not defined. I am very new to puppeteer (started today), I would appreciate any guidance on how to accomplish this. Thank you very much!
TL;DR
How do i create 1 Browser instance and 1 Page instance that a function can use repeatedly by only changing the await page.goto(url) function.
About a year ago I tried to a make an React Native Pokemon Go helper app. Since there wasn't an api for pokemon nest and pokestops I created a server that scraped thesilphroad.com and I found the need to implement something like #Arkan said.
I wanted the server to be able to take multiple request, so I decided to initialize the browser when the server is booted up. When a request is received, the server checks to see if MAX_TABS have been reached. If reached, it waits, if not a new tab is opened and the scrape is performed
Here's the scraper.js
const puppeteer = require ('puppeteer')
const fs = require('fs')
const Page = require('./Page')
const exec = require('child_process').exec
const execSync = require('util').promisify(exec)
module.exports = class scraper {
constructor(){
this.browser = null
this.getPages = null
this.getTotalPages = null
this.isRunning = false
//browser permissions
this.permissions = ['geolocation']
this.MAX_TABS = 5
//when puppeteer launches
this.useFirstTab = true
}
async init(config={}){
let headless = config.headless != undefined ? config.headless : true
this.permissions = this.permissions.concat(config.permissions || [])
//get local chromium location
let browserPath = await getBrowserPath('firefox') || await getBrowserPath('chrome')
this.browser = await puppeteer.launch({
headless:headless,
executablePath:browserPath,
defaultViewport:null,
args:[
'--start-maximized',
]
})
this.getPages = this.browser.pages
this.getTotalPages = ()=>{
return this.getPages().then(pages=>pages.length).catch(err=>0)
}
this.isRunning = true
}
async waitForTab(){
let time = Date.now()
let cycles = 1
await new Promise(resolve=>{
let interval = setInterval(async()=>{
let totalPages = await this.getTotalPages()
if(totalPages < this.MAX_TABS){
clearInterval(interval)
resolve()
}
if(Date.now() - time > 100)
console.log('Waiting...')
if(Date.now() - time > 20*1000){
console.log('... ...\n'.repeat(cycle)+'Still waiting...')
cycle++
time = Date.now()
}
},500)
})
}
//open new tab and go to page
async openPage(url,waitSelector,lat,long){
await this.waitForTab()
let pg
//puppeteer launches with a blank tab, use this
// if(this.useFirstTab){
// let pages = await this.browser.pages()
// pg = pages.pop()
// this.useFirstTab = false
// }
// else
pg = await this.browser.newPage()
if(lat && long){
await this.setPermissions(url)
}
let page = await new Page()
await page.init(pg,url,waitSelector,lat,long)
return page
}
async setPermissions(url){
const context = this.browser.defaultBrowserContext();
await context.overridePermissions(url,this.permissions)
}
}
// assumes that the browser is in path
async function getBrowserPath(browserName){
return execSync('command -v chromium').then(({stdout,stderr})=>{
if(stdout.includes('not found'))
return null
return stdout
}).catch(err=>null)
}
The scraper imports Page.js, which is just wrapper for a puppeteer Page object with the functions I used most made available
const path = require('path')
const fs = require('fs')
const userAgents = require('./staticData/userAgents.json')
const cookiesPath = path.normalize('./cookies.json')
// a wrapper for a puppeteer page with pre-made functions
module.exports = class Page{
constuctor(useCookies=false){
this.page = null
this.useCookies = useCookies
this.previousSession = this.useCookies && fs.existsSync(cookiesPath)
}
async close (){
await this.page.close()
}
async init(page,url,waitSelector,lat,long){
this.page = page
let userAgent = userAgents[Math.floor(Math.random()*userAgents.length)]
await this.page.setUserAgent(userAgent)
await this.restoredSession()
if(lat && long)
await this.page.setGeolocation({
latitude: lat || 59.95, longitude:long || 30.31667, accuracy:40
})
await this.page.goto(url)
await this.wait(waitSelector)
}
async screenshotElement(selector='body',directory='./screenshots',padding=0,offset={}) {
const rect = await this.page.evaluate(selector => {
const el = document.querySelector(selector)
const {x, y, width, height} = el.getBoundingClientRect()
return {
left: x,
top: y,
width,
height,
id: el.id
}
}, selector)
let ext = 'jpeg'
let filename = path.normalize(directory+'/'+Date.now())
return await this.page.screenshot({
type:ext,
path:filename+' - '+selector.substring(5)+'.'+ext,
clip: {
x: rect.left - padding+(offset.left || 0),
y: rect.top - padding+(offset.right || 0),
width: rect.width + padding * 2+(offset.width||0),
height: rect.height + padding * 2+ (offset.height||0)
},
encoding:'base64'
})
}
async restoredSession(){
if(!this.previousSession)
return false
let cookies = require(cookiesPath)
for(let cookie of cookies){
await this.page.setCookie(cookie)
}
console.log('Loaded previous session')
return true
}
async saveSession(){
//write cookie to file
if(!this.useCookies)
return
const cookies = await this.page.cookies()
fs.writeFileSync(cookiesPath,JSON.stringify(cookies,null,2))
console.log('Wrote cookies to file')
}
//wait for text input elment and type text
async type(selector,text,options={delay:150}){
await this.wait(selector)
await this.page.type(selector,text,options)
}
//click and waits
async click(clickSelector,waitSelector=500){
await this.page.click(clickSelector)
await this.wait(waitSelector)
}
//hovers over element and waits
async hover(selector,waitSelector=500){
await this.page.hover(selector)
await this.wait(1000)
await this.wait(waitSelector)
}
//waits and suppresses timeout errors
async wait(selector=500, waitForNav=false){
try{
//waitForNav is a puppeteer's waitForNavigation function
//which for me does nothing but timeouts after 30s
waitForNav && await this.page.waitForNavigation()
await this.page.waitFor(selector)
} catch (err){
//print everything but timeout errors
if(err.name != 'Timeout Error'){
console.log('error name:',err.name)
console.log(err)
console.log('- - - '.repeat(4))
}
this.close()
}
}
}
``
To achieve this, you'll just need to separate the browser from your requests, like in a class, for example:
class PuppeteerScraper {
async launch(options = {}) {
this.browser = await puppeteer.launch(options);
// you could reuse the page instance if it was defined here
}
/**
* Pass the address and the function that will scrape your data,
* in order to mantain the page inside this object
*/
async goto(url, callback) {
const page = await this.browser.newPage();
await page.goto(url);
/**evaluate its content */
await callback(page);
await page.close();
}
async close() {
await this.browser.close();
}
}
and, to implement it:
/**
* scrape function, takes the page instance as its parameters
*/
async function evaluate_page(page) {
const titles = await page.$$eval('.col-xs-6 .star-rating ~ h3 a', (itens) => {
const text_titles = [];
for (const item of itens) {
if (item && item.textContent) {
text_titles.push(item.textContent);
}
}
return text_titles;
});
console.log('titles', titles);
}
(async () => {
const scraper = new PuppeteerScraper();
await scraper.launch({ headless: false });
for (let i = 1; i <= 6; i++) {
let link = `https://books.toscrape.com/catalogue/page-${i}.html`;
await scraper.goto(link, evaluate_page);
}
scraper.close();
})();
altho, if you want something more complex, you could take a look how they done at Apify project.
I am trying to iterate over unique youtube video links to get screenshot.
After debugging, I noticed for the forloop below, JS spawn 2 process threads, 1 for each index i . The processALink() function in the second thread seems to start before the processALink() in the first thread has ended fully.
Why is this happening? I thought using async/wait stops this from happening.
The forloop is inside a async function. The code below is just a snippet from the oringinal source code.
for(let i = 0; i<2; i++){
var link = linksArr[i];
var label = labelsArr[i];
await proccessALink(link, label)
}
Function def for processALink()
var proccessALink = async (link,label)=>{
//set download path
var downloadPath = 'data/train/'+label;
//parse the url
var urlToScreenshot = parseUrl(link)
//Give a URL it will take a screen shot
if (validUrl.isWebUri(urlToScreenshot)) {
// console.log('Screenshotting: ' + urlToScreenshot + '&t=' + req.query.t)
console.log('Screenshotting: ' + link)
;(async () => {
//Logic to login to youtube below
//await login();
//go to the url and wait till all the content is loaded.
await page.goto(link, {
waitUntil: 'networkidle'
//waitUntil: 'domcontentloaded'
})
//await page.waitForNavigation();
//Find the video player in the page
const video = await page.$('.html5-video-player')
await page.content();
//Run some command on consoleDev
await page.evaluate(() => {
// Hide youtube player controls.
let dom = document.querySelector('.ytp-chrome-bottom')
if(dom != null){
dom.style.display = 'none'
}
})
await video.screenshot({path: downloadPath});
})()
} else {
res.send('Invalid url: ' + urlToScreenshot)
}
}
Remove the IIFE inside processALink() and it should resolve the issue of running multiple screenshots at the same time.
const proccessALink = async(link, label) => {
//set download path
const downloadPath = 'data/train/' + label;
//parse the url
const urlToScreenshot = parseUrl(link)
//Give a URL it will take a screen shot
if (validUrl.isWebUri(urlToScreenshot)) {
// console.log('Screenshotting: ' + urlToScreenshot + '&t=' + req.query.t)
console.log('Screenshotting: ' + link);
//Logic to login to youtube below
//await login();
//go to the url and wait till all the content is loaded.
await page.goto(link, {
waitUntil: 'networkidle'
//waitUntil: 'domcontentloaded'
})
//await page.waitForNavigation();
//Find the video player in the page
const video = await page.$('.html5-video-player')
await page.content();
//Run some command on consoleDev
await page.evaluate(() => {
// Hide youtube player controls.
let dom = document.querySelector('.ytp-chrome-bottom')
if (dom != null) {
dom.style.display = 'none'
}
})
await video.screenshot({
path: downloadPath
});
} else {
res.send('Invalid url: ' + urlToScreenshot)
}
}
using Puppeteer I'm able to navigate to a certain video src URL, and the MP4 (using a custom build of chronium) plays fine.
NOW: I want to be able to get the video data that's playing and send it to some kind of buffer in node.js that can be saved as a file or sent to a client via a websocket or sent as a response etc.... but I'm not sure how to do it, all I have is the video playing.
I'm not able to just send the URL over to node.js, because in order to view the video file you have to go through the whole puppeteer crawling process (it's not just a static URL, it's dependent on that browser session only, so only puppeteer can view it).
SO: what can I do to get a src URL to a file (or buffer) in nodeJS? this is my current code, if it helps:
var puppeteer = require("puppeteer-core");
var http=require("https");
var fs=require("fs");
var fetch=require("fetch-node");
(async() => {
var browser = await puppeteer.launch({
executablePath:"./cobchrome/chrome.exe"
});
console.log("Got browser", browser);
var page = await browser.newPage();
console.log(page,"got page");
var agentStr = `Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0`;
var agent = await page.setUserAgent(agentStr);
console.log(agent, "Set the user agent");
// await page.goto("https://drive.google.com/file/d/17tkL8jPlBIh5XtcX_tNhyDV5nSX8v7f8/preview");
await page.goto("https://docs.google.com/file/d/1Cyuh41yNfYZU_zL-MHLf_EPJCYnlT7oJ/preview?enablejsapi=1&playerapiid=player4");
console.log("went to page..");
await page._client.send('Page.setDownloadBehavior', {behavior: 'allow', downloadPath: './downloadscob/'})
await page.screenshot({path:"shots/onopen.png"});
// var btn = await page.$(".ndfHFb-c4YZDc ndfHFb-c4YZDc-AHmuwe-Hr88gd-OWB6Me ndfHFb-c4YZDc-vyDMJf-aZ2wEe ndfHFb-c4YZDc-i5oIFb ndfHFb-c4YZDc-e1YmVc ndfHFb-c4YZDc-TSZdd");
// var tst = await page.$("#start-of-content");
var clickEl = ".ndfHFb-c4YZDc-aTv5jf-bVEB4e-RJLb9c";
var newClickID = ".ndfHFb-c4YZDc-aTv5jf-NziyQe-LgbsSe";
var clicker = await page.waitForSelector(newClickID);
console.log(clicker,"got clicker");
await page.screenshot({path:"shots/ongotclicker.png"});
await page.click(clickEl);
console.log("clicked")
await page.screenshot({path:"shots/onclicked.png"});
var frame = await page.waitForSelector("iframe[id=drive-viewer-video-player-object-0]");
console.log(frame, "got video frame");
await page.screenshot({path:"shots/ongotframe.png"});
var cf = await frame.contentFrame();
await page.screenshot({path:"shots/oncf.png"});
console.log(cf, "got content frame");
await cf.waitFor(() => !!document.querySelector("video"))
await page.screenshot({path:"shots/videoappeared.png"});
//await cf.waitFor(30000);
// var videos = await cf.$("video");
// console.log(videos, videos.length, "all videos");
var video = await cf.$("video");
await page.screenshot({path:"shots/selectedvideo.png"});
var videoEl = await cf.evaluate(
v =>{
var result = {};
for(var k in v) {
result[k] = v[k];
}
return result;
},
video
);
var src = videoEl.src;
var file = fs.createWriteStream("down.mp4");
console.log("starting to stream");
var req = http.get(src, r => {
console.log("finished pipin");
r.pipe(file); //I REALLY thought this would work but it doesn't do anything
});
var start = Date.now();
await page.screenshot({path:"shots/evalled_vido.png"});
console.log("$$###VIDEO SOURCE::", "time it took", src);
await page.goto(src);
await page.screenshot({path:"shots/wentToNewPage.png"});
// await page.waitFor(5000);
await page.screenshot({path:"shots/maybeItsPlayingNow.png"});
console.log("ABOUT t oFETHC wit H SOURCE", src)
var content = await page.content();
fs.writeFile("outputagain.txt", content, (re) => {
console.log("saved it?");
})
console.log(content);
// await browser.close();
})();
currently the page.content() at the end just gets the HTML content of the page, not any binary data................
I am attempting to scrape the html from this NCBI.gov page. I need to include the #see-all URL fragment so that I am guaranteed to get the searchpage instead of retrieving the HTML from an incorrect gene page https://www.ncbi.nlm.nih.gov/gene/119016.
URL fragments are not passed to the server, and are instead used by the javascript of the page client-side to (in this case) create entirely different HTML, which is what you get when you go to the page in a browser and "View page source", which is the HTML I want to retrieve. R readLines() ignores url tags followed by #
I tried using phantomJS first, but it just returned the error described here ReferenceError: Can't find variable: Map, and it seems to result from phantomJS not supporting some feature that NCBI was using, thus eliminating this route to solution.
I had more success with Puppeteer using the following Javascript evaluated with node.js:
const puppeteer = require('puppeteer');
(async() => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(
'https://www.ncbi.nlm.nih.gov/gene/?term=AGAP8#see-all');
var HTML = await page.content()
const fs = require('fs');
var ws = fs.createWriteStream(
'TempInterfaceWithChrome.js'
);
ws.write(HTML);
ws.end();
var ws2 = fs.createWriteStream(
'finishedFlag'
);
ws2.end();
browser.close();
})();
however this returned what appeared to be the pre-rendered html. how do I (programmatically) get the final html that I get in browser?
You can try to change this:
await page.goto(
'https://www.ncbi.nlm.nih.gov/gene/?term=AGAP8#see-all');
into this:
await page.goto(
'https://www.ncbi.nlm.nih.gov/gene/?term=AGAP8#see-all', {waitUntil: 'networkidle'});
Or, you can create a function listenFor() to listen to a custom event on page load:
function listenFor(type) {
return page.evaluateOnNewDocument(type => {
document.addEventListener(type, e => {
window.onCustomEvent({type, detail: e.detail});
});
}, type);
}`
await listenFor('custom-event-ready'); // Listen for "custom-event-ready" custom event on page load.
LE:
This also might come in handy:
await page.waitForSelector('h3'); // replace h3 with your selector
Maybe try to wait
await page.waitForNavigation(5);
and after
let html = await page.content();
I had success using the following to get html content that was generated after the page has been loaded.
const browser = await puppeteer.launch();
try {
const page = await browser.newPage();
await page.goto(url);
await page.waitFor(2000);
let html_content = await page.evaluate(el => el.innerHTML, await page.$('.element-class-name'));
console.log(html_content);
} catch (err) {
console.log(err);
}
Hope this helps.
Waiting for network idle was not enough in my case, so I used dom loaded event:
await page.goto(url, {waitUntil: 'domcontentloaded', timeout: 60000} );
const data = await page.content();
Indeed you need innerHTML:
fs.writeFileSync( "test.html", await (await page.$("html")).evaluate( (content => content.innerHTML ) ) );
If you want to actually await a custom event, you can do it this way.
const page = await browser.newPage();
/**
* Attach an event listener to page to capture a custom event on page load/navigation.
* #param {string} type Event name.
* #return {!Promise}
*/
function addListener(type) {
return page.evaluateOnNewDocument(type => {
// here we are in the browser context
document.addEventListener(type, e => {
window.onCustomEvent({ type, detail: e.detail });
});
}, type);
}
const evt = await new Promise(async resolve => {
// Define a window.onCustomEvent function on the page.
await page.exposeFunction('onCustomEvent', e => {
// here we are in the node context
resolve(e); // resolve the outer Promise here so we can await it outside
});
await addListener('app-ready'); // setup listener for "app-ready" custom event on page load
await page.goto('http://example.com'); // N.B! Do not use { waitUntil: 'networkidle0' } as that may cause a race condition
});
console.log(`${evt.type} fired`, evt.detail || '');
Built upon the example at https://github.com/GoogleChrome/puppeteer/blob/master/examples/custom-event.js