How to use evaluateOnNewDocument and exposeFunction? - javascript

Recently, I used Puppeteer for a new project.
I have a few questions about thea part of the API I don't understand. The documentation is very simple for these API introductions:
page.exposeFunction
page.evaluateOnNewDocument
Can I have a detailed demo to gain a better understanding?

Summary:
The Puppeteer function page.exposeFunction() essentially allows you to access Node.js functionality within the Page DOM Environment.
On the other hand, page.evaluateOnNewDocument() evaluates a predefined function when a new document is created and before any of its scripts are executed.
The Puppeteer Documentation for page.exposeFunction() states:
page.exposeFunction(name, puppeteerFunction)
name <string> Name of the function on the window object
puppeteerFunction <function> Callback function which will be called in Puppeteer's context.
returns: <Promise>
The method adds a function called name on the page's window object. When called, the function executes puppeteerFunction in node.js and returns a Promise which resolves to the return value of puppeteerFunction.
If the puppeteerFunction returns a Promise, it will be awaited.
NOTE Functions installed via page.exposeFunction survive navigations.
An example of adding an md5 function into the page:
const puppeteer = require('puppeteer');
const crypto = require('crypto');
puppeteer.launch().then(async browser => {
const page = await browser.newPage();
page.on('console', msg => console.log(msg.text()));
await page.exposeFunction('md5', text =>
crypto.createHash('md5').update(text).digest('hex')
);
await page.evaluate(async () => {
// use window.md5 to compute hashes
const myString = 'PUPPETEER';
const myHash = await window.md5(myString);
console.log(`md5 of ${myString} is ${myHash}`);
});
await browser.close();
});
An example of adding a window.readfile function into the page:
const puppeteer = require('puppeteer');
const fs = require('fs');
puppeteer.launch().then(async browser => {
const page = await browser.newPage();
page.on('console', msg => console.log(msg.text()));
await page.exposeFunction('readfile', async filePath => {
return new Promise((resolve, reject) => {
fs.readFile(filePath, 'utf8', (err, text) => {
if (err)
reject(err);
else
resolve(text);
});
});
});
await page.evaluate(async () => {
// use window.readfile to read contents of a file
const content = await window.readfile('/etc/hosts');
console.log(content);
});
await browser.close();
});
Furthermore, the Puppeteer Documentation for page.evaluateOnNewDocument explains:
page.evaluateOnNewDocument(pageFunction, ...args)
pageFunction <function|string> Function to be evaluated in browser context
...args <...Serializable> Arguments to pass to pageFunction
returns: <Promise>
Adds a function which would be invoked in one of the following scenarios:
whenever the page is navigated
whenever the child frame is attached or navigated. In this case, the function is invoked in the context of the newly attached frame
The function is invoked after the document was created but before any of its scripts were run. This is useful to amend the JavaScript environment, e.g. to seed Math.random.
An example of overriding the navigator.languages property before the page loads:
// preload.js
// overwrite the `languages` property to use a custom getter
Object.defineProperty(navigator, "languages", {
get: function() {
return ["en-US", "en", "bn"];
}
});
// In your puppeteer script, assuming the preload.js file is in same folder of our script
const preloadFile = fs.readFileSync('./preload.js', 'utf8');
await page.evaluateOnNewDocument(preloadFile);

Related

How do you return an object from the browser environment to the Node environment in Puppeteer?

I have the following code that attempts to scrape all the 'Add to basket' button elements from the page, put them in an array and return that array to the Node environment.
const puppeteer = require('puppeteer');
let getArrayofButtons = async () => {
const browser = await puppeteer.launch({
devtools: 'true',
});
const page = await browser.newPage();
await page.setViewport({ width: 1280, height: 1800 });
await page.goto('http://books.toscrape.com/', {
waitUntil: 'domcontentloaded',
});
await page.waitForSelector('.product_pod');
let buttons = [];
await page.evaluate(() => {
buttons = [...document.querySelectorAll('*')].filter(e =>
[...e.childNodes].find(n => n.nodeValue?.match('basket'))
);
console.log(buttons);
});
// browser.close();
};
getArrayofButtons().then(returnedButtons => {
console.log(returnedButtons);
});
When I console.log(buttons); I can see the array of button elements in the browser environment, but when I try to return that array to the Node environment I get undefined.
My understanding is that page.evaluate() will return the value of the function passed to it, so if I replace:
articles = [...document.querySelectorAll('*')].filter(e => [...e.childNodes].find(n => n.nodeValue?.match('basket')) );
with:
return [...document.querySelectorAll('*')].filter(e => [...e.childNodes].find(n => n.nodeValue?.match('basket')) );
it seems like it should work. Am I not resolving the Promise correctly?
You can call evaluateHandle to get a pointer to that result.
const arrayHandle = await page.evaluateHandle(() => {
buttons = [...document.querySelectorAll('*')].filter(e =>
[...e.childNodes].find(n => n.nodeValue?.match('basket'))
);
return buttons;
});
Notice that arrayHandle is not an array. It is an ElementHandle pointing to the array in the browser.
If you want to process each button on your side you will need to process that handle calling the getProperties function.
const properties = await arrayHandle.getProperties();
await arrayHandle.dispose();
const buttons = [];
for (const property of properties.values()) {
const elementHandle = property.asElement();
if (elementHandle)
buttons.push(elementHandle);
}
Yes, it's quite a boilerplate. But you could grab that handle and pass it to an evaluate function.
page.evaluate((elements) => elements[0].click(), arrayHandle);
Unfortunately, page.evaluate() can only transfer serializable data (roughly, the data JSON can handle). DOM elements are not serializable. Consider returning an array of strings or something like that (HTML markup, attributes, text content etc).
Also, buttons is declared in the puppeteer (Node.js) context and is not available in browser context (in page.evaluate() function argument context). So you need const buttons = await page.evaluate() here.

How to use top level Async await released in v8 typescript

I'm trying hard to understand what exactly this new feature (top level async await) means from v8 features list
When I try to run in vanila JS the results seems quite same to me here's what I try to do in vanilla js.
(() => {
let test1 = async() =>
async() => {
return 'true';
};
(async() => {
let result = await test1();
result = await result();
console.log('r', result)
})();
})()
I want to know what exactly this feature means and how to use it.
Here is v8's document. To me it is pretty self descriptive and a very handy feature for me personally.
Previously, you couldn't just write await someAsyncFunction() out of no where because for awaiting a function you must call the await inside an async function.
Example:
main.js
const fs = require('fs');
const util = require('util');
const unlink = util.promisify(fs.unlink); // promisify unlink function
await unlink('file_path'); // delete file
The above code would not work. The last line would give you an error. So, what we did previously is something like this:
async function main() {
const fs = require('fs');
const util = require('util');
const unlink = util.promisify(fs.unlink); // promisify unlink function
await unlink('file_path'); // delete file
}
main();
But, now you don't (!) have to do this. The first code would work.
THIS ANSWER IS BASED ON MY UNDERSTANDING
Top level async await allows you to await Promises returned by async functions at the top level of a module, without having to declare a separate async function. Most importantly, you can now conveniently export values returned by async functions.
For example, without this feature, you need to create a separate async function (the usual "main" async function), or use Promise.then in order to do something with the returned value at the top-level, and cannot simply export the returned value.
let test = async () => 'true';
test().then(result => console.log('r', result));
// or even more verbose
(async () => {
console.log(await test());
})();
// This exports a Promise, not the returned value, "true".
export let result = test();
// This throws an Error because export should be at the top-level.
(async () => {
export let result = await test();
})();
But with this new feature, you can simply do:
let test = async () => 'true';
export let result = await test();
console.log(result);
This feature is especially useful when you want to export a value that has to be obtained asynchronously; for example, a value you get from network at run-time, or a module like a big encryption suite that is large and loads slowly and asynchronously.

How to execute Node.js code in the browser context?

How do I execute client-side JS code within the page.evaluate() statement (not just browser JavaScript code, Node.js code)?
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://example.com');
await page.screenshot({path: 'example.png'});
await page.evaluate(() => {
document.querySelector('button[type=submit]').click();
});
console.log('yes')
await browser.close();
})();
The first parameter passed to page.evaluate() should be a function which will be evaluated in the page context in the browser.
Node.js is server-side code, and is meant to be executed on the server.
You can pass arguments from the Node.js environment to the page function using the following method:
// Node.js Environment
const hello_world = 'Hello, world! (from Node.js)';
await page.evaluate(hello_world => {
// Browser Page Environment
console.log(hello_world);
}, hello_world);
You can listen for the 'console' event to occur in the page context and print the result using page.on():
page.on('console', msg => {
for (let i = 0; i < msg.args().length; i++) {
console.log(`${i}: ${msg.args()[i]}`);
}
});

Issue with NodeJS async/await - accessing function parameter

I am trying to do some scraping using a library and my code uses Node's
async/await pattern.
I have defined a variable 'page' in function named 'sayhi' and I pass the same variable to function ex, I get error while running the code.
const puppeteer = require('puppeteer');
async function sayhi() {
const browser = await puppeteer.launch({headless: false});
const page = await browser.newPage();
await page.goto('https://www.example.com/'); //
ex(page); //FAILS
var frames2 = await newpage.frames(); // WORKS
}
function ex(newpage){
var frames = await newpage.frames(); // FAILING
}
sayhi();
You're using await in a function that isn't an async function. Try this instead:
async function ex(newpage) {
If you need frames2 to run only after ex is finished completely, you'll also want to await ex(page); in sayhi.

How can I dynamically inject functions to evaluate using Puppeteer? [duplicate]

This question already has answers here:
How to pass a function in Puppeteers .evaluate() method
(5 answers)
Closed 5 months ago.
I am using Puppeteer for headless Chrome. I wish to evaluate a function inside the page that uses parts of other functions, defined dynamically elsewhere.
The code below is a minimal example / proof. In reality functionToInject() and otherFunctionToInject() are more complex and require the pages DOM.
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(someURL);
var functionToInject = function(){
return 1+1;
}
var otherFunctionToInject = function(input){
return 6
}
var data = await page.evaluate(function(functionToInject, otherFunctionToInject){
console.log('woo I run inside a browser')
return functionToInject() + otherFunctionToInject();
});
return data
When I run the code, I get:
Error: Evaluation failed: TypeError: functionToInject is not a function
Which I understand: functionToInject isn't being passed into the page's JS context. But how do I pass it into the page's JS context?
You can add function to page context with addScriptTag:
const browser = await puppeteer.launch();
const page = await browser.newPage();
function functionToInject (){
return 1+1;
}
function otherFunctionToInject(input){
return 6
}
await page.addScriptTag({ content: `${functionToInject} ${otherFunctionToInject}`});
var data = await page.evaluate(function(){
console.log('woo I run inside a browser')
return functionToInject() + otherFunctionToInject();
});
console.log(data);
await browser.close();
This example is a dirty way of solving this problem with string concatenation. More clean would be using a url or path in the addScriptTag method.
Or use exposeFunction (but now functions are wrapped in Promise):
const browser = await puppeteer.launch();
const page = await browser.newPage();
var functionToInject = function(){
return 1+1;
}
var otherFunctionToInject = function(input){
return 6
}
await page.exposeFunction('functionToInject', functionToInject);
await page.exposeFunction('otherFunctionToInject', otherFunctionToInject);
var data = await page.evaluate(async function(){
console.log('woo I run inside a browser')
return await functionToInject() + await otherFunctionToInject();
});
console.log(data);
await browser.close();
working example accessible by link, in the same repo you can see the tested component.
it("click should return option value", async () => {
const optionToReturn = "ClickedOption";
const page = await newE2EPage();
const mockCallBack = jest.fn();
await page.setContent(
`<list-option option='${optionToReturn}'></list-option>`
);
await page.exposeFunction("functionToInject", mockCallBack); // Inject function
await page.$eval("list-option", (elm: any) => {
elm.onOptionSelected = this.functionToInject; // Assign function
});
await page.waitForChanges();
const element = await page.find("list-option");
await element.click();
expect(mockCallBack.mock.calls.length).toEqual(1); // Check calls
expect(mockCallBack.mock.calls[0][0]).toBe(optionToReturn); // Check argument
});
You can also use page.exposeFunction() which will make your function return a Promise (requiring the use of async and await). This happens because your function will not be running inside your browser, but inside your nodejs application and its results are being send back and forth into/to the browser code.
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(someURL);
var functionToInject = function(){
return 1+1;
}
var otherFunctionToInject = function(input){
return 6
}
await page.exposeFunction("functionToInject", functionToInject)
await page.exposeFunction("otherFunctionToInject", otherFunctionToInject)
var data = await page.evaluate(async function(){
console.log('woo I run inside a browser')
return await functionToInject() + await otherFunctionToInject();
});
return data
Related questions:
exposeFunction() does not work after goto()
exposed function queryseldtcor not working in puppeteer
How to use evaluateOnNewDocument and exposeFunction?
exposeFunction remains in memory?
Puppeteer: pass variable in .evaluate()
Puppeteer evaluate function
allow to pass a parameterized funciton as a string to page.evaluate
Functions bound with page.exposeFunction() produce unhandled promise rejections
How to pass a function in Puppeteers .evaluate() method?
Why can't I access 'window' in an exposeFunction() function with Puppeteer?

Categories