Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 11 hours ago.
Improve this question
I'm trying to scrape YouTube Shorts from a specific YouTube Channel, using Puppeteer running on MeteorJs Galaxy.
Here's the code that I've done so far:
import puppeteer from 'puppeteer';
import { YouTubeShorts } from '../imports/api/youTubeShorts'; //meteor mongo local instance
let URL = 'https://www.youtube.com/#ummahtoday1513/shorts'
const processShortsData = (iteratedData) => {
let documentExist = YouTubeShorts.findOne({ videoId:iteratedData.videoId })
if(documentExist === undefined) { //undefined meaning this incoming shorts in a new one
YouTubeShorts.insert({
videoId: iteratedData.videoId,
title: iteratedData.title,
thumbnail: iteratedData.thumbnail,
height: iteratedData.height,
width: iteratedData.width
})
}
}
const fetchShorts = () => {
puppeteer.launch({
headless:true,
args:[
'--no-sandbox',
'--disable-setuid-sandbox',
'--disable-dev-shm-usage',
'--single-process'
]
})
.then( async function(browser){
async function fetchingData(){
new Promise(async function(resolve, reject){
const page = await browser.newPage();
await Promise.all([
await page.setDefaultNavigationTimeout(0),
await page.waitForNavigation({waitUntil: "domcontentloaded"}),
await page.goto(URL, {waitUntil:["domcontentloaded", "networkidle2"]}),
await page.waitForSelector('ytd-rich-grid-slim-media', { visible:true }),
new Promise(async function(resolve,reject){
page.evaluate(()=>{
const trialData = document.getElementsByTagName('ytd-rich-grid-slim-media');
const titles = Array.from(trialData).map(i => {
const singleData = {
videoId: i.data.videoId,
title: i.data.headline.simpleText,
thumbnail: i.data.thumbnail.thumbnails[0].url,
height: i.data.thumbnail.thumbnails[0].height,
width: i.data.thumbnail.thumbnails[0].width,
}
return singleData
})
resolve(titles);
})
}),
])
await page.close()
})
await browser.close()
}
async function fetchAndProcessData(){
const datum = await fetchingData()
console.log('DATUM:', datum)
}
await fetchAndProcessData()
})
}
fetchShorts();
I am struggling with two things here:
Async, await, and promises, and
Finding reason behind why Puppeteer output the ProtocolError: Protocol error (Target.createTarget): Target closed. error in the console.
I'm new to puppeteer and trying to learn from various examples on StackOverflow and Google in general, but I'm still having trouble getting it right.
A general word of advice: code slowly and test frequently, especially when you're in an unfamiliar domain. Try to minimize problems so you can understand what's failing. There are many issues here, giving the impression that the code was written in one fell swoop without incremental validation. There's no obvious entry point to debugging this.
Let's examine some failing patterns.
First, basically never use new Promise() when you're working with a promise-based API like Puppeteer. This is discussed in the canonical What is the explicit promise construction antipattern and how do I avoid it? so I'll avoid repeating the answers there.
Second, don't mix async/await and then. The point of promises is to flatten code and avoid pyramids of doom. If you find you have 5-6 deeply nested functions, you're misusing promises. In Puppeteer, there's basically no need for then.
Third, setting timeouts to infinity with page.setDefaultNavigationTimeout(0) suppresses errors. It's fine if you want a long delay, but if a navigation is taking more than a few minutes, something is wrong and you want an error so you can understand and debug it rather than having the script wait silently until you kill it, with no clear diagnostics as to what went wrong or where it failed.
Fourth, watch out for pointless calls to waitForNavigation. Code like this doesn't make much sense:
await page.waitForNavigation(...);
await page.goto(...);
What navigation are you waiting for? This seems ripe for triggering timeouts, or worse yet, infinite hangs after you've set navs to never timeout.
Fifth, avoid premature abstractions. You have various helper functions but you haven't established functionally correct code, so these just add to the confused state of affairs. Start with correctness, then add abstractions once the cut points become obvious.
Sixth, avoid Promise.all() when all of the contents of the array are sequentially awaited. In other words:
await Promise.all([
await foo(),
await bar(),
await baz(),
await quux(),
garply(),
]);
is identical to:
await foo();
await bar();
await baz();
await quux();
await garply();
Seventh, always return promises if you have them:
const fetchShorts = () => {
puppeteer.launch({
// ..
should be:
const fetchShorts = () => {
return puppeteer.launch({
// ..
This way, the caller can await the function's completion. Without it, it gets launched into the void and can never be connected with the caller's flow.
Eighth, evaluate doesn't have access to variables in Node, so this pattern doesn't work:
new Promise(resolve => {
page.evaluate(() => resolve());
});
Instead, avoid the new promise antipattern and use the promise that Puppeteer already returns to you:
await page.evaluate(() => {});
Better yet, use $$eval here since it's an abstraction of the common pattern of selecting elements first thing in evaluate.
Putting all of this together, here's a rewrite:
const puppeteer = require("puppeteer"); // ^19.6.3
const url = "<Your URL>";
let browser;
(async () => {
browser = await puppeteer.launch();
const [page] = await browser.pages();
await page.goto(url, {waitUntil: "domcontentloaded"});
await page.waitForSelector("ytd-rich-grid-slim-media");
const result = await page.$$eval("ytd-rich-grid-slim-media", els =>
els.map(({data: {videoId, headline, thumbnail: {thumbnails}}}) => ({
videoId,
title: headline.simpleText,
thumbnail: thumbnails[0].url,
height: thumbnails[0].height,
width: thumbnails[0].width,
}))
);
console.log(result);
})()
.catch(err => console.error(err))
.finally(() => browser?.close());
Note that I ensure browser cleanup with finally so the process doesn't hang in case the code throws.
Now, all we want is a bit of text, so there's no sense in loading much of the extra stuff YouTube downloads. You can speed up the script by blocking anything unnecessary to your goal:
const [page] = await browser.pages();
await page.setRequestInterception(true);
page.on("request", req => {
if (
req.url().startsWith("https://www.youtube.com") &&
["document", "script"].includes(req.resourceType())
) {
req.continue();
}
else {
req.abort();
}
});
// ...
Note that ["domcontentloaded", "networkidle2"] is basically the same as "networkidle2" since "domcontentloaded" will happen long before "networkidle2". But please avoid "networkidle2" here since all you need is some text, which doesn't depend on all network resources.
Once you've established correctness, if you're ready to factor this to a function, you can do so:
const fetchShorts = async () => {
const url = "<Your URL>";
let browser;
try {
browser = await puppeteer.launch();
const [page] = await browser.pages();
await page.goto(url, {waitUntil: "domcontentloaded"});
await page.waitForSelector("ytd-rich-grid-slim-media");
return await page.$$eval("ytd-rich-grid-slim-media", els =>
els.map(({data: {videoId, headline, thumbnail: {thumbnails}}}) => ({
videoId,
title: headline.simpleText,
thumbnail: thumbnails[0].url,
height: thumbnails[0].height,
width: thumbnails[0].width,
}))
);
}
finally {
await browser?.close();
}
};
fetchShorts()
.then(shorts => console.log(shorts))
.catch(err => console.error(err));
But keep in mind, making the function responsible for managing the browser resource hampers its reusability and slows it down considerably. I usually let the caller handle the browser and make all of my scraping helpers accept a page argument:
const fetchShorts = async page => {
const url = "<Your URL>";
await page.goto(url, {waitUntil: "domcontentloaded"});
await page.waitForSelector("ytd-rich-grid-slim-media");
return await page.$$eval("ytd-rich-grid-slim-media", els =>
els.map(({data: {videoId, headline, thumbnail: {thumbnails}}}) => ({
videoId,
title: headline.simpleText,
thumbnail: thumbnails[0].url,
height: thumbnails[0].height,
width: thumbnails[0].width,
}))
);
};
(async () => {
let browser;
try {
browser = await puppeteer.launch();
const [page] = await browser.pages();
console.log(await fetchShorts(page));
}
catch (err) {
console.error(err);
}
finally {
await browser?.close();
}
})();
Here is the example code:
"use strict";
const puppeteer = require("puppeteer");
(async () => {
try {
const browser = await puppeteer.launch();
console.log(`browser=${browser}`);
var cnt_pages = (await browser.pages()).length;
console.log(`${cnt_pages} pages`);
} catch (error) {
console.error(error);
console.error(`can not launch`);
process.exit();
}
console.log(`browser=${browser}`);
var cnt_pages = (await browser.pages()).length;
console.log(`cnt_pages ${cnt_pages}`);
input("continue?");
})();
As a result, I get
(node:13408) UnhandledPromiseRejectionWarning: ReferenceError: browser is not defined
at S:\!kyxa\!code\play_chrome_cdp\nodejs_1\!node_tutorial\!play_async\try_catch_browser.js:15:26
at processTicksAndRejections (internal/process/task_queues.js:93:5)
at emitUnhandledRejectionWarning (internal/process/promises.js:168:15)
at processPromiseRejections (internal/process/promises.js:247:11)
at processTicksAndRejections (internal/process/task_queues.js:94:32)
(node:13408) ReferenceError: browser is not defined
at S:\!kyxa\!code\play_chrome_cdp\nodejs_1\!node_tutorial\!play_async\try_catch_browser.js:15:26
at processTicksAndRejections (internal/process/task_queues.js:93:5)
(node:13408) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.
at emitDeprecationWarning (internal/process/promises.js:180:11)
at processPromiseRejections (internal/process/promises.js:249:13)
at processTicksAndRejections (internal/process/task_queues.js:94:32)
browser=[object Object]
1 pages
As I see, the browser is available and working in the try block. But after the try-catch block it is not available.
Explain me please what happens?
I've explored the issue. I define the browser value in the try but I also use it in the catch. consts are block-scoped, so they are tied to the block. –
This is the working code:
"use strict";
const puppeteer = require("puppeteer");
(async () => {
var browser = null;
try {
browser = await puppeteer.launch();
console.log(`browser=${browser}`);
var cnt_pages = (await browser.pages()).length;
console.log(`${cnt_pages} pages`);
} catch (error) {
console.error(error);
console.error(`can not launch`);
process.exit();
}
console.log(`browser=${browser}`);
var cnt_pages = (await browser.pages()).length;
console.log(`cnt_pages ${cnt_pages}`);
})();
You can elevate let browser out of the block and remove the const, but even after fixing this scoping issue, the browser resource still isn't closed, and any errors that might occur after the try/catch blocks are uncaught. Here's my preferred Puppeteer boilerplate that handles these situations:
const puppeteer = require("puppeteer");
const scrape = async page => {
// write your code here
const url = "https://www.example.com";
await page.goto(url, {waitUntil: "domcontentloaded"});
console.log(await page.title());
};
let browser;
(async () => {
browser = await puppeteer.launch();
const [page] = await browser.pages();
await scrape(page);
})()
.catch(err => console.error(err))
.finally(() => browser?.close());
I have a simple piece of code
describe('My First Puppeeteer Test', () => {
it('Should launch the browser', async function() {
const browser = await puppeteer.launch({ headless: false})
const page = await browser.newPage()
await page.goto('https://github.com/login')
await page.type('#login_field', testLogin)
await page.type('#password', testPassword)
await page.click('[name="commit"]')
await page.waitForNavigation()
let [element] = await page.$x('//h3[#class="text-normal"]')
let helloText = await page.evaluate(element => element.textContent, element);
console.log(helloText);
browser.close();
})
})
Everything worked before but today I get an error + my stacktrace:
Error: Evaluation failed: TypeError: Cannot read properties of undefined (reading 'textContent')
at puppeteer_evaluation_script:1:21
at ExecutionContext._evaluateInternal (node_modules\puppeteer\lib\cjs\puppeteer\common\ExecutionContext.js:221:19)
at processTicksAndRejections (node:internal/process/task_queues:96:5)
at async ExecutionContext.evaluate (node_modules\puppeteer\lib\cjs\puppeteer\common\ExecutionContext.js:110:16)
at async Context. (tests\example.tests.js:16:22)
How I can resolve this?
Kind regards
While I haven't tested the code due to the login and I assume your selectors are correct, the main problem is almost certainly that
await page.click('[name="commit"]')
await page.waitForNavigation()
creates a race condition. The docs clarify:
Bear in mind that if click() triggers a navigation event and there's a separate page.waitForNavigation() promise to be resolved, you may end up with a race condition that yields unexpected results. The correct pattern for click and wait for navigation is the following:
const [response] = await Promise.all([
page.waitForNavigation(waitOptions),
page.click(selector, clickOptions),
]);
As a side point, it's probably better to do waitForXPath rather than $x, although this seems less likely the root problem. Don't forget to await all promises such as browser.close().
const puppeteer = require("puppeteer");
let browser;
(async () => {
browser = await puppeteer.launch({headless: true});
const [page] = await browser.pages();
await page.goto('https://github.com/login');
await page.type('#login_field', testLogin);
await page.type('#password', testPassword);
// vvvvvvvvvvv
await Promise.all([
page.click('[name="commit"]'),
page.waitForNavigation(),
]);
const el = await page.waitForXPath('//h3[#class="text-normal"]');
// ^^^^^^^^^^^^
//const el = await page.waitForSelector("h3.text-normal"); // ..or
const text = await el.evaluate(el => el.textContent);
console.log(text);
//await browser.close();
//^^^^^ missing await, or use finally as below
})()
.catch(err => console.error(err))
.finally(() => browser?.close())
;
Additionally, if you're using Jest, once you get things working, you might want to move the browser and page management to beforeEach/afterEach or beforeAll/afterAll blocks. It's faster to use the same browser instance for all test cases, and pages can be opened and closed before/after each case.
I'm trying to write a test for the sad path of this function:
const awaitFirstStreamForPage = async page => {
try {
await page.waitForSelector('[data-stream="true"]', {
timeout: MAX_DELAY_UNTIL_FIRST_STREAM,
})
} catch (e) {
throw new Error(`no stream found for ${MAX_DELAY_UNTIL_FIRST_STREAM}ms`)
}
}
I managed to write a test that passes, but it takes 10 seconds to run because it actually waits for the test to finish.
describe('awaitFirstStreamForPage()', () => {
it('given a page and no active stream appearing: should throw', async () => {
jest.setTimeout(15000)
const browser = await puppeteer.launch({ headless: true })
const page = await getPage(browser)
let error
try {
await awaitFirstStreamForPage(page)
} catch (err) {
error = err
}
const actual = error.message
const expected = 'no stream found for 10000ms'
expect(actual).toEqual(expected)
await browser.close()
jest.setTimeout(5000)
})
})
There is probably a way to solve it using Jest's fake timers, but I couldn't get it to work. Here is my best attempt:
const flushPromises = () => new Promise(res => process.nextTick(res))
describe('awaitFirstStreamForPage()', () => {
it('given a page and no active stream appearing: should throw', async () => {
jest.useFakeTimers()
const browser = await puppeteer.launch({ headless: true })
const page = await getPage(browser)
let error
try {
awaitFirstStreamForPage(page)
jest.advanceTimersByTime(10000)
await flushPromises()
} catch (err) {
error = err
}
const actual = error.message
const expected = 'no stream found for 10000ms'
expect(actual).toEqual(expected)
await browser.close()
jest.useRealTimers()
})
})
which fails and throws with
(node:9697) UnhandledPromiseRejectionWarning: Error: no stream found for 10000ms
Even though I wrapped the failing function in a try/catch. How do you test a function like this using fake timers?
It's impossible to catch a rejection from awaitFirstStreamForPage(page) with try..catch if it's not awaited.
A rejection should be caught but after calling advanceTimersByTime and potentially after flushPromises.
It can be:
const promise = awaitFirstStreamForPage(page);
promise.catch(() => { /* suppress UnhandledPromiseRejectionWarning */ });
jest.advanceTimersByTime(10000)
await flushPromises();
await expect(promise).rejects.toThrow('no stream found for 10000ms');
The problem doesn’t seem to be the use of fake timers: the error you expected is the one being thrown. However, when testing functions that throw errors in Jest, you should wrap the error-throwing code in a function, like this:
expect(()=> {/* code that will throw error */}).toThrow()
More details here: https://jestjs.io/docs/en/expect#tothrowerror
Edit: For an async function, you should use rejects before toThrow; see this example: Can you write async tests that expect toThrow?
Here is my code:
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://google.com/');
await page.screenshot({path: 'example.png'});
await browser.close();
})();
No matter what website I attempt to screenshot, I always get the following error:
(node:9548) UnhandledPromiseRejectionWarning: TimeoutError: Navigation Timeout Exceeded: 30000ms exceeded
I'm running node version 8.16.0. I have no idea why I always get this timeout. Any help is appreciated.
EDIT:
It does seem to work when I run it with headless mode turned off, but I need it to run as a headless browser.
Try to increase the navigation timeout:
await page.goto('https://google.com/', { waitUntil: 'load', timeout: 50000 });
and add try/catch:
try {
await page.goto('https://google.com/', { waitUntil: 'load', timeout: 50000 });
} catch(e) {
console.log('Error--->', e);
}