Retrieving JavaScript Rendered HTML with Puppeteer - javascript

I am attempting to scrape the html from this NCBI.gov page. I need to include the #see-all URL fragment so that I am guaranteed to get the searchpage instead of retrieving the HTML from an incorrect gene page https://www.ncbi.nlm.nih.gov/gene/119016.
URL fragments are not passed to the server, and are instead used by the javascript of the page client-side to (in this case) create entirely different HTML, which is what you get when you go to the page in a browser and "View page source", which is the HTML I want to retrieve. R readLines() ignores url tags followed by #
I tried using phantomJS first, but it just returned the error described here ReferenceError: Can't find variable: Map, and it seems to result from phantomJS not supporting some feature that NCBI was using, thus eliminating this route to solution.
I had more success with Puppeteer using the following Javascript evaluated with node.js:
const puppeteer = require('puppeteer');
(async() => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(
'https://www.ncbi.nlm.nih.gov/gene/?term=AGAP8#see-all');
var HTML = await page.content()
const fs = require('fs');
var ws = fs.createWriteStream(
'TempInterfaceWithChrome.js'
);
ws.write(HTML);
ws.end();
var ws2 = fs.createWriteStream(
'finishedFlag'
);
ws2.end();
browser.close();
})();
however this returned what appeared to be the pre-rendered html. how do I (programmatically) get the final html that I get in browser?

You can try to change this:
await page.goto(
'https://www.ncbi.nlm.nih.gov/gene/?term=AGAP8#see-all');
into this:
await page.goto(
'https://www.ncbi.nlm.nih.gov/gene/?term=AGAP8#see-all', {waitUntil: 'networkidle'});
Or, you can create a function listenFor() to listen to a custom event on page load:
function listenFor(type) {
return page.evaluateOnNewDocument(type => {
document.addEventListener(type, e => {
window.onCustomEvent({type, detail: e.detail});
});
}, type);
}`
await listenFor('custom-event-ready'); // Listen for "custom-event-ready" custom event on page load.
LE:
This also might come in handy:
await page.waitForSelector('h3'); // replace h3 with your selector

Maybe try to wait
await page.waitForNavigation(5);
and after
let html = await page.content();

I had success using the following to get html content that was generated after the page has been loaded.
const browser = await puppeteer.launch();
try {
const page = await browser.newPage();
await page.goto(url);
await page.waitFor(2000);
let html_content = await page.evaluate(el => el.innerHTML, await page.$('.element-class-name'));
console.log(html_content);
} catch (err) {
console.log(err);
}
Hope this helps.

Waiting for network idle was not enough in my case, so I used dom loaded event:
await page.goto(url, {waitUntil: 'domcontentloaded', timeout: 60000} );
const data = await page.content();

Indeed you need innerHTML:
fs.writeFileSync( "test.html", await (await page.$("html")).evaluate( (content => content.innerHTML ) ) );

If you want to actually await a custom event, you can do it this way.
const page = await browser.newPage();
/**
* Attach an event listener to page to capture a custom event on page load/navigation.
* #param {string} type Event name.
* #return {!Promise}
*/
function addListener(type) {
return page.evaluateOnNewDocument(type => {
// here we are in the browser context
document.addEventListener(type, e => {
window.onCustomEvent({ type, detail: e.detail });
});
}, type);
}
const evt = await new Promise(async resolve => {
// Define a window.onCustomEvent function on the page.
await page.exposeFunction('onCustomEvent', e => {
// here we are in the node context
resolve(e); // resolve the outer Promise here so we can await it outside
});
await addListener('app-ready'); // setup listener for "app-ready" custom event on page load
await page.goto('http://example.com'); // N.B! Do not use { waitUntil: 'networkidle0' } as that may cause a race condition
});
console.log(`${evt.type} fired`, evt.detail || '');
Built upon the example at https://github.com/GoogleChrome/puppeteer/blob/master/examples/custom-event.js

Related

I get this error when running my index.js file... throw new Error('Execution context was destroyed, most likely because of a navigation.');

I've provided the code below, can you tell me why I would get this error ? I am trying to web-scrape some information from one website to put on a website I am creating, I already have permission to do so. The information I am trying to web-scrape is the name of the event, the time of the event, the location of the event, and the description of the event... I seen this tutorial on YouTube, but for some reason I get this error running ming.
sync function scrapeProduct(url){
const browser = await puppeteer.launch();
const page = await browser.newPage();
page.goto(url);
const [el] = await page.$x('//*[#id="calendar-events-day"]/ul/li[1]/h3/a');
const txt = await el.getProperty('textContent')
const rawTxt = await txt.jsonValue();
const [el1] = await page.$x('//*[#id="calendar-events-day"]/ul/li[1]/time[1]/span[2]');
const txt1 = await el1.getProperty('textContent')
const rawTxt1 = await txt1.jsonValue();
console.log({rawTxt, rawTxt1});
browser.close();
}
scrapeProduct('https://events.ucf.edu');

How to set cookies in a variable and use it again: PUPPETEER

i'm trying to insert ALL browser cookies in a variable and then use it again later.
My attempt:
const puppeteer = require('puppeteer')
const fs = require('fs').promises;
(async () => {
const browser = await puppeteer.launch({
headless: false,
executablePath: 'C:\\Program Files\\Google\\Chrome\\Application\\chrome.exe',
const page = await browser.newPage();
await page.goto('https://www.google.com/');
Below, it's my code to get cookies, and it work, but now it print a error message.
const client = await page.target().createCDPSession();
const all_browser_cookies = (await client.send('Network.getAllCookies')).cookies;
const current_url_cookies = await page.cookies();
var third_party_cookies = all_browser_cookies.filter(cookie => cookie.domain !== current_url_cookies[0].domain);
and below it's the seccond page (that will use cookies)
(async () => {
const browser2 = await puppeteer.launch({
});
const url = 'https://www.google.com/';
const page2 = await browser2.newPage();
try{
await page2.setCookie(...third_party_cookies);
await page2.goto(url);
}catch(e){
console.log(e);
}
await browser2.close()
})();
})();
until yesterday it works, but today it's appearing this message error:
Error: Protocol error (Network.setCookies): Invalid parameters Failed to deserialize params.cookies.expires - BINDINGS: double value expected at position 662891
Anyone know what is it?
Ok, the solution is here: https://github.com/puppeteer/puppeteer/issues/7029
You are saving an array of cookies and page.setCookie only works with one cookie. Yo have to iterate trought your array like that:
for (let i = 0; i < third_party_cookies.length; i++) {
await page2.setCookie(third_party_cookies[i]);
}
#julio-codesal's suggestion is ok, but according to the puppeteer documentaion it is ok to use page.setCookie with an array, as long as you destructure it, e.g.:
await page.setCookie(...arr)
source: https://devdocs.io/puppeteer/index#pagesetcookiecookies
Not exactly sure what it is, but the error the OP has is something else.

Cheerio selector after page loaded

I want to scrape a url value of iframe in this website: https://lk21online.digital/nonton-profile-2021-subtitle-indonesia/
When i search iframe from view page source its not found, i think iframe is loaded after page loaded by javascript
Or my selector is wrong?
Please somebody help me to check my selector or what i need to do for my code
Sorry for my poor english...
There is my code:
async function getDetail(res, url) {
try {
const html = await scraping(res, url)
const $ = cheerio.load(html)
const article = $('#site-container #content .gmr-maincontent #primary #main .gmr-box-content #muvipro_player_content_id #player1-tab-content')
let result = []
setTimeout(() => {
article.each(function () {
const title = $(this).find('.item-article h2').text()
const watch = $(this).find('iframe').attr('src')
result.push({
title,
watch,
})
})
res.json({ result })
}, 5000)
}
catch (err) {
console.log(err)
}
}
this is video iframe
You can't use cheerio for this. Cheerio is not dynamic and just loads whatever html is coming back from the request.
Looking at your webpage, most content is loaded async, so the initial html will be pretty empty.
In addition the video source is lazily loaded when it enters the browser window. So you have to use an actual headless browser to accomplish the task. Here's an example:
// iframeUrl.js
const puppeteer = require("puppeteer");
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
// Goto page
await page.goto("https://lk21online.digital/nonton-profile-2021-subtitle-indonesia/");
// Scroll down
page.evaluate((_) => window.scrollBy(0, 1000));
// Wait a bit
await new Promise((resolve) => setTimeout(resolve, 5000));
// Get the src of the iframe
const iframeUrl = await page.evaluate(`$("#player1-tab-content iframe").attr("src")`);
console.log(iframeUrl);
await browser.close();
process.exit(0);
})();

Scraping filmography from a wikipedia page using nodejs and puppeteer

I'm trying to get the filmography from wikipedia. Usin puppeteer, I'm selecting the filmography section from inspect element and copying the XPath. However, I'm not getting any film data.
scrapers.js
const puppeteer = require("puppeteer")
const scrapeProduct = async (url) => {
const browser = await puppeteer.launch()
const page = await browser.newPage()
await page.goto(url)
const [el] = await page.$x(`//*[#id="mw-content-text"]/div[1]/div[8]/div`)
console.log("el=>", el)
browser.close()
}
scrapeProduct("https://en.wikipedia.org/wiki/Werner_Herzog")
Here's what I'm getting in console.log(el):
https://hastebin.com/usozakisen.yaml
el is an ElementHandle, not the content itself. You can try getting the innerText of that handle:
console.log(await el.evaluate(el => el.innerText));

Puppeter - Link inside an iFrame

I have to get the advertising link below the bullet points of this page.
I am trying with Puppeter but I am having trouble because the Ad is an iframe!
I can successfully get what I need using Chrome console:
document.querySelector('#adContainer a').href
Puppetter
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
page.setViewport({width: 1440, height: 1000})
await page.goto('https://www.amazon.co.uk/dp/B07DDDB34D', {waitUntil: 'networkidle2'})
await page.waitFor(2500);
const elementHandle = await page.$eval('#adContainer a', el => el.href);
console.log(elementHandle);
await page.screenshot({path: 'example.png', fullPage: false});
await browser.close();
})();
Error: Error: failed to find element matching selector "#adContainer a"
EDIT:
const browser = await puppeteer.launch();
const page = await browser.newPage();
page.setViewport({width: 1440, height: 1000})
await page.goto('https://www.amazon.co.uk/dp/B07DDDB34D', {waitUntil: 'networkidle2'})
const adFrame = page.frames().find(frame => frame.name().includes('"adServer":"cs'))
const urlSelector = '#sp_hqp_shared_inner > div > a';
const url = await adFrame.$eval(urlSelector, element => element.textContent);
console.log(url);
await browser.close();
Run: https://try-puppeteer.appspot.com/
You need to do that query inside the frame itself, which can be accessed via page.frames():
const adFrame = page.frames().find(frame => frame.name().includes('<some text only appearing in name of this iFrame>');
const urlSelector = '#sp_hqp_shared_inner > div > a';
const url = await adFrame.$eval(urlSelector, element => element.textContent);
console.log(url);
How I got the selector of that url:
Discaimer
I haven't tried this myself. Also, I think the appropriate way to get that url inside the iFrame is something more like this:
const url = await adFrame.evaluate((sel) => {
return document.querySelectorAll(sel)[0].href;
}, urlSelector);
You have to switch to the frame you want to work on every time the page loads.
async getRequiredLink() {
return await this.page.evaluate(() => {
let iframe = document.getElementById('frame_id'); //pass id of your frame
let doc = iframe.contentDocument; // changing the context to the working frame
let ele = doc.querySelector('you-selector'); // selecting the required element
return ele.href;
});
}

Categories