Target an iFrame using Node.js/Puppeteer

Target an iFrame using Node.js/Puppeteer - javascript

Is it possible to target an iFrame when using the GUI Workflow builder in AWS Cloudwatch Synthetics?
I've set up the canary to log in to a website and redirect the page which has run successfully, but one of the elements I need to check with Node.js is within an iFrame which isn't being recognised.
This is the iframe code. It loads from Javascript, but all content is from the same domain:
<iframe id="paramsFrame" src="empty.htm" frameborder="0" ppTabId="-1"
onload="paramsDocumentLoaded('paramsFrame', true);"></iframe>
This is the code I'm using for this section, but it's just returning a timeout error:
await synthetics.executeStep('verifyText', async function() {
const elementHandle = await page.waitForSelector('#paramsFrame');
const frame = await elementHandle.contentFrame();
await frame.waitForXPath("//div[#class=\'css7\'][contains(text(),'Specificity')]", { timeout: 30000 });
})
This code is trying to target a div with class css7 found within an iframe with id paramsFrame
Edit: I did a null check on frame and it came back as not null, not sure if that is relevant.
I also tried to target an element directly:
const next = await frame.waitForSelector('.protocol-name-link');
but I got the error message:
TimeoutError: waiting for selector .protocol-name-link

If the iframe is on a different origin (e.g. different domain), you cannot access it through Puppeteer.
You can try to disable some security features of Puppeteer, although this is not advised.
Specifically, you'd probably want to add these args to puppeteer.launch
--disable-web-security
--disable-features=IsolateOrigins,site-per-process

I tried running similar code on a website which had a youtube iframe and I didnt need the puppeteer launch args
i.e
args: [
"--disable-web-security",
"--disable-features=IsolateOrigins,site-per-process",
],
But, First I would like to suggest is for the iframe try to confirm it is the same iframe that you need maybe by logging, debugging or even just going on dev console.
And the second is to use full xpath of the element in the frame.
Here is my code which I tried running.
const page = await browser.newPage();
console.log("open page");
await page.goto("https://captioncrusher.com/");
console.log("page opened");
// use this if you want to wait for all the requests to be done.
// await page.waitForNetworkIdle();
const elementHandle = await page.waitForSelector("iframe.yt");
const frame = await elementHandle.contentFrame();
//These both work for me
const aLink = await frame.waitForXPath("/html/body/div/div/a");
const classLink = await frame.waitForSelector(".ytp-impression-link");
await browser.close();

Related

Browser console and Puppeteer cannot find certain selector

I am trying to build a simple scraper for this website by using puppeteer.
The code goes as follows:
const browser = await puppeteer.launch({
headless: false
});
const page = await browser.newPage();
let pagelink = "https://www.speisekarte.de/berlin/restaurants?page=1"
await page.waitFor(3 * 1000);
await page.goto(pagelink);
await page.waitFor(3 * 1000);
await page.waitForSelector("#notice")
However, I cannot access the overlay notice for the cookies which should have the Id "notice".
This does not work either for await page.waitForSelector("#notice")
in my puppeteer code.
Nor with document.getElementById("notice") in Chromium, if I use the console of Chromium during the session manually. Also, it does not work, if I use it in Firefox's console. Funnily enough, chunks like
document.querySelectorAll("button")
work as expected. I checked with a colleague and she can access the element using the above mentioned queries in her Chrome and in her Firefox browser. She also uses a Mac. Any idea, what is happening here? Any help would be much appreciated.

How to execute a command in an iframe of a popup

I'm trying to bypass a captcha on a website and for that I need to execute a command in an iframe of a popup and i cannot find a way to do that. Here is my code:
const cookie = {
name: 'login_email',
value: 'example#domain.com',
domain: '.paypal.com',
url: 'https://www.paypal.com/',
path: '/',
httpOnly: true,
secure: true
}
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch({ headless: false, defaultViewport: null });
const page = await browser.newPage();
await page.setCookie(cookie)
await page.goto('https://www.paypal-dobijeni.cz/');
await page.waitForSelector('#login');
await page.click('#login');
const newPagePromise = new Promise(x => page.once('popup', x));
const popup = await newPagePromise;
await popup.waitForSelector('#password');
await popup.type('#password', 'examplepassword');
await popup.click('#btnLogin');
await popup.waitForSelector('form[name="challenge"]');
})();
The command that I need to execute is verifyCallback('<g-recaptcha-response>')
UPDATE: That's how I do it in the console:
First i select the iframe
Then I execute the command with the g-recaptcha-response I get from my captcha solving service

This isnt really the solution you are looking for but I'll post it in case you decide you want to use it.
First I use argv to parse arguments passed to the script. One of these arguments the user can pass is headless.
When the script runs, I find someway to detect when captchas pop up, and if one is detected and the browser is headless, I log something close to "Captcha appearred, run script with headless set to false and solve the captcha".
When the script is executed with headless set to false and captcha is detected, I await a Promise that holds a one second interval, which checks to see if the captcha has left the page. With the browser no longer being headless, you can manually solve the captcha. When the captcha is gone, the interval is cleared and the Promise is resolved and the rest of the script will execute.
If you are lucky, the captcha won't need to be solved again for that ip address

Can the browser turned headless mid-execution when it was started normally, or vice-versa?

I want to start a chromium browser instant headless, do some automated operations, and then turn it visible before doing the rest of the stuff.
Is this possible to do using Puppeteer, and if it is, can you tell me how? And if it is not, is there any other framework or library for browser automation that can do this?
So far I've tried the following but it didn't work.
const browser = await puppeteer.launch({'headless': false});
browser.headless = true;
const page = await browser.newPage();
await page.goto('https://news.ycombinator.com', {waitUntil: 'networkidle2'});
await page.pdf({path: 'hn.pdf', format: 'A4'});

Short answer: It's not possible
Chrome only allows to either start the browser in headless or non-headless mode. You have to specify it when you launch the browser and it is not possible to switch during runtime.
What is possible, is to launch a second browser and reuse cookies (and any other data) from the first browser.
Long answer
You would assume that you could just reuse the data directory when calling puppeteer.launch, but this is currently not possible due to multiple bugs (#1268, #1270 in the puppeteer repo).
So the best approach is to save any cookies or local storage data that you need to share between the browser instances and restore the data when you launch the browser. You then visit the website a second time. Be aware that any state the website has in terms of JavaScript variable, will be lost when you recrawl the page.
Process
Summing up, the whole process should look like this (or vice versa for headless to headfull):
Crawl in non-headless mode until you want to switch mode
Serialize cookies
Launch or reuse second browser (in headless mode)
Restore cookies
Revisit page
Continue crawling

As mentioned, this isn't currently possible since the headless switch occurs via Chromium launch flags.
I usually do this with userDataDir, which the Chromium docs describe as follows:
The user data directory contains profile data such as history, bookmarks, and cookies, as well as other per-installation local state.
Here's a simple example. This launches a browser headlessly, sets a local storage value on an arbitrary page, closes the browser, re-opens it headfully, retrieves the local storage value and prints it.
const puppeteer = require("puppeteer"); // ^18.0.4
const url = "https://www.example.com";
const opts = {userDataDir: "./data"};
let browser;
(async () => {
{
browser = await puppeteer.launch({...opts, headless: true});
const [page] = await browser.pages();
await page.goto(url, {waitUntil: "domcontentloaded"});
await page.evaluate(() => localStorage.setItem("hello", "world"));
await browser.close();
}
{
browser = await puppeteer.launch({...opts, headless: false});
const [page] = await browser.pages();
await page.goto(url, {waitUntil: "domcontentloaded"});
const result = await page.evaluate(() => localStorage.getItem("hello"));
console.log(result); // => world
}
})()
.catch(err => console.error(err))
.finally(() => browser?.close())
;
Change const opts = {userDataDir: "./data"}; to const opts = {}; and you'll see null print instead of world; the user data doesn't persist.
The answer from a few years ago mentions issues with userDataDir and suggests a cookies solution. That's fine, but I haven't had any issues with userDataDir so either they've been resolved on the Puppeteer end or my use cases haven't triggered the issues.
There's a useful-looking answer from a reputable source in How to turn headless on after launch? but I haven't had a chance to try it yet.

Load any url content and follow XPATH in JS

What i would like to do, is loading a page, and getting the content of something trough XPath or Selector or JS Path to then use a value got by that into my program. How could i do that ?
For instance on this page, doing a request using the url of the page and following that path (while also targeting the type somehow, here it is the class) :
//*[#id="question-header"]/h1/a
Would give me 'Load any url content and follow XPATH in JS'
As i am getting the text inside this :
Load any url content and follow XPATH in JS

If you need the most reliable way to get some data from a web page — i.e. including the data that can be generated by a JavaScript execution on the client side — you can use some manager of a headless browser. For example, the described task can be accomplished with Node.js and puppeteer in this script (selectors and XPath are supported as well as all the Web API via evaluation of code fragments in browser context and exchanging the data between Node.js and browser contexts):
'use strict';
const puppeteer = require('puppeteer');
(async function main() {
try {
const browser = await puppeteer.launch();
const [page] = await browser.pages();
await page.goto('https://stackoverflow.com/questions/54847748/load-any-url-content-and-follow-xpath-in-js');
const data = await page.evaluate(() => {
return document.querySelector('#question-header > h1 > a').innerText;
});
console.log(data);
await browser.close();
} catch (err) {
console.error(err);
}
})();

Well, you could use something like
document.getElementById('question-header').children[0].children[0].href;
It's not as dynamic as XPATH (redundancy of the children), but should do the trick of you're facing a static structure. For Node.js there are several libraries that could as well do it, such as libxmljs or parse5 - more on this here.

How can use puppeteer with my current chrome (keeping my credentials)

i'm actually trying to use puppeteer for scraping and i need to use my current chrome to keep all my credentials and use it instead of relogin and type password each time which is a really time lose !
is there a way to connect it ? how to do that ?
i'm actually using node v11.1.0
and puppeteer 1.10.0
let scrape = async () => {
const browser = await log()
const page = await browser.newPage()
const delayScroll = 200
// Login
await page.goto('somesite.com');
await page.type('#login-email', '*******);
await page.type('#login-password', "******");
await page.click('#login-submit');
// Wait to login
await page.waitFor(1000);
}
and now it will be perfect if i do not need to use that and go on page (headless, i dont wan't to see the page opening i'm just using the info scraping in node) but with my current chrome who does not need to login to have information i need. (because at the end i want to use it as an extension of chrome)
thx in advance if someone knows how to do that

First welcome to the community.
You can use Chrome instead of Chromium but sincerely in my case, I get a lot of errors and cause a mess with my personal tabs. So you can create and save a profile, then you can login with a current or a new account.
In your code you have a function called "log" I'm guessing that there you set launch puppeeteer.
const browser = await log()
Into that function use arguments and create a relative directory for your profile data:
const browser = await puppeteer.launch({
args: ["--user-data-dir=./Google/Chrome/User Data/"]
});
Run your application, login with an account and the next time you enter you should see your credentials
Any doubt please add a comment.

We Keep Coding

JavaScript is the programming language of the Web.

Target an iFrame using Node.js/Puppeteer - javascript

Related

Browser console and Puppeteer cannot find certain selector

How to execute a command in an iframe of a popup

Can the browser turned headless mid-execution when it was started normally, or vice-versa?

Load any url content and follow XPATH in JS

How can use puppeteer with my current chrome (keeping my credentials)

Categories

Resources