For my tests, I would like to login to this page: https://www.ebay-kleinanzeigen.de/m-einloggen.html
When first requested, this page returns a page like the following:
<html><head><meta charset="utf-8">
<script>
function(){/* some logic*/}();
</script>
</head><body></body></html>
This script has functions and an anonymous function that should be executed when the browser loads the page.
In a normal browser, this function fires a xhr request (where the server will set cookies) and then reloads the same page, that thanks to the cookies will contain the login form.
To see this in action, open a private tab in your favorite browser, open the dev tools, set the networking logs to persist and visit the page. The first network requests will look like this:
Using the following Puppeteer script, the browser doesn't execute the anonymous function and gets stuck waiting for the login form, that never appears:
import puppeteer from 'puppeteer';
const main = async () => {
try {
const browser = await puppeteer.launch({devtools: true});
const page = await browser.newPage();
await page.goto('https://www.ebay-kleinanzeigen.de/m-einloggen.html');
await page.waitForSelector('#login-form', { visible: true });
await page.screenshot({path: 'login.png', fullPage: true})
await browser.close();
} catch (e) {
console.log('error',e);
}
}
main();
I can't use page.evaluate because the content of the function is dynamically created by the server.
Is there a way to let this anonymous function get executed at page load?
Related
I am using Puppeteer to get page data, but unfortunately there is no way to make all requests.
Therefore, the question arose - How, after opening the site, get from all Fetch / XHR requests with the name v2 JSON contained in their responses?
In this case, as I understand it, need to use waiting.
It is not possible to peep into the request and the body and repeat a similar request, since the body uses code that is generated randomly each time - therefore this is not an option, it was in connection with this that it became necessary to simply display all json responses from requests with names v2.
I am attaching a screenshot and my code, I beg you - point me in the right direction, I will be grateful for any help!
// puppeteer-extra is a drop-in replacement for puppeteer,
// it augments the installed puppeteer with plugin functionality
import puppeteer from "puppeteer-extra";
// add stealth plugin and use defaults (all evasion techniques)
import StealthPlugin from 'puppeteer-extra-plugin-stealth'
export async function ProductAPI() {
try {
puppeteer.use(StealthPlugin())
const browser = await puppeteer.launch({headless: true});
const page = await browser.newPage();
await page.goto('here goes link for website');
const pdata = await page.content() // this just prints HTML
console.log(pdata)
browser.close();
} catch (err) {
throw err
}
}(ProductAPI())
link for image: https://i.stack.imgur.com/ZR6T1.png
I know that the code I wrote just returns html. I'm just trying to figure out how to get the data I need, I googled for a very long time, but could not find the answer I needed.
It is very important that the execution is on node js (javscript) and it doesn’t matter if it’s a puppeteer or something else.
This works!
import puppeteer from "puppeteer-extra";
import StealthPlugin from 'puppeteer-extra-plugin-stealth'
async function SomeFunction () {
puppeteer.use(StealthPlugin())
const browser = await puppeteer.launch({headless: true});
const page = await browser.newPage();
page.on('response', async (response) => {
if(response.url().includes('write_link_here')){
console.log('XHR response received');
const HTMLdata = await response.text()
console.log(HTMLdata)
};});
await page.goto('some_website_link');}
Here's the scoop.
I'm trying to use Puppeteer v18.0.5 with the bundled chromium browser against a specific website. I'm using Node v16.16.0 However, when I enable request interception via page.setRequestInterception(true), all of the HTTPRequests for any image resources are lost. My handler is invoked far less while intercepting than when not intercepting. The page never fires any requests for images. But when I disable the interception, the page loads normally. Yes, I know about invoking continue() on all requests. I'm currently doing that in the request handler on the page.
I've also poured over the Puppeteer issues pages and have found similar symptoms on some of the earlier Puppeteer versions, but they were all different issues that have all been resolved since those early versions. This seems unique.
I've looked through Puppeteer source code as well as CDP events to try and find any explanation, but have found none.
As an important note for anyone trying to reproduce this, you must be proxied through a server in the London general area in order to successfully load this site.
Here's my code to reproduce:
const puppeteer = require('puppeteer');
(async () => {
const options = {
browserWidth: 1366,
browserHeight: 983,
intercepting: false
};
const browser = await puppeteer.launch(
{
args: [`--window-size=${options.browserWidth},${options.browserHeight}`],
defaultViewport: {width: options.browserWidth, height: options.browserHeight},
headless: false
}
);
const page = (await browser.pages())[0];
page.on('request', async (request) => {
console.log(`Request: ${request.method()} | ${request.url()} | ${request.resourceType()} | ${request._requestId}`);
if (options.intercepting) await request.continue();
});
await page.setRequestInterception(options.intercepting);
await page.goto('https://vegas.williamhill.com', {waitUntil: 'networkidle2', timeout: 65000});
// To give a moment to view the page in headful mode before closing browser.
await new Promise(resolve => setTimeout(resolve, 5000));
await browser.close();
})();
Here's what the page looks like with intercepting disabled:
Expected Page Load
Here's what the page looks like with intercepting enabled and continuing all requests.
Page load while intercepting and continuing all requests
With request interception disabled my handler is invoked for 104 different requests. But with the interception enabled it's only invoked 22 times. I'm not hitting a navigation timeout as the .goto() method returns before my timeout each time.
Any insight into what configuration/strategy I'm missing would be immensely appreciated.
Maybe you are incepting some javascript files that initiate the requests that you are not seeing?
Is it possible to target an iFrame when using the GUI Workflow builder in AWS Cloudwatch Synthetics?
I've set up the canary to log in to a website and redirect the page which has run successfully, but one of the elements I need to check with Node.js is within an iFrame which isn't being recognised.
This is the iframe code. It loads from Javascript, but all content is from the same domain:
<iframe id="paramsFrame" src="empty.htm" frameborder="0" ppTabId="-1"
onload="paramsDocumentLoaded('paramsFrame', true);"></iframe>
This is the code I'm using for this section, but it's just returning a timeout error:
await synthetics.executeStep('verifyText', async function() {
const elementHandle = await page.waitForSelector('#paramsFrame');
const frame = await elementHandle.contentFrame();
await frame.waitForXPath("//div[#class=\'css7\'][contains(text(),'Specificity')]", { timeout: 30000 });
})
This code is trying to target a div with class css7 found within an iframe with id paramsFrame
Edit: I did a null check on frame and it came back as not null, not sure if that is relevant.
I also tried to target an element directly:
const next = await frame.waitForSelector('.protocol-name-link');
but I got the error message:
TimeoutError: waiting for selector .protocol-name-link
If the iframe is on a different origin (e.g. different domain), you cannot access it through Puppeteer.
You can try to disable some security features of Puppeteer, although this is not advised.
Specifically, you'd probably want to add these args to puppeteer.launch
--disable-web-security
--disable-features=IsolateOrigins,site-per-process
I tried running similar code on a website which had a youtube iframe and I didnt need the puppeteer launch args
i.e
args: [
"--disable-web-security",
"--disable-features=IsolateOrigins,site-per-process",
],
But, First I would like to suggest is for the iframe try to confirm it is the same iframe that you need maybe by logging, debugging or even just going on dev console.
And the second is to use full xpath of the element in the frame.
Here is my code which I tried running.
const page = await browser.newPage();
console.log("open page");
await page.goto("https://captioncrusher.com/");
console.log("page opened");
// use this if you want to wait for all the requests to be done.
// await page.waitForNetworkIdle();
const elementHandle = await page.waitForSelector("iframe.yt");
const frame = await elementHandle.contentFrame();
//These both work for me
const aLink = await frame.waitForXPath("/html/body/div/div/a");
const classLink = await frame.waitForSelector(".ytp-impression-link");
await browser.close();
I'm using Puppeteer in my Node JS app to get the URLs in a redirect chain, e.g: going from one URL to the next. Up until this point I've been creating ngrok URLs which use simple PHP header functions to redirect a user with 301 and 302 requests, and my starting URL is a page that redirects to one of the ngrok URL's after a few seconds.
However, it appears that Network.requestWillBeSent exits if it comes across a page that uses a Javascript redirection, and I need it to somehow wait and pick up these ones as well.
Example journey of URLs:
START -> https://example.com/ <-- setTimeout and redirects to an ngrok
ngrok url uses PHP to redirect with a 301
some other ngrok that uses a JS setTimeout to redirect to, for example, another https://example.com/
FINISH -> https://example.com/
In this situation, Network.requestWillBeSent picks up 1 and 2, but finishes on 3 and thus doesn't get to 4.
So rather than it console logging all four URLs, I only get two.
It's difficult to create a reproduction since I can't set up all ngrok urls etc, but here's a Codesandbox link and a Github link, attached below is my code:
const dayjs = require('dayjs');
const AdvancedFormat = require('dayjs/plugin/advancedFormat');
dayjs.extend(AdvancedFormat);
const puppeteer = require('puppeteer');
async function runEmulation () {
const goToUrl = 'https://example.com/';
// vars
const journey = [];
let hopDataToReturn;
// initiate a Puppeteer instance with options and launch
const browser = await puppeteer.launch({
headless: false
});
// launch a new page
const page = await browser.newPage();
// initiate a new CDP session
const client = await page.target().createCDPSession();
await client.send('Network.enable');
await client.on('Network.requestWillBeSent', async (e) => {
// if not a document, skip
if (e.type !== 'Document') return;
console.log(`adding URL to journey: ${e.documentURL}`)
// the journey
journey.push({
url: e.documentURL,
type: e.redirectResponse ? e.redirectResponse.status : 'JS Redirection',
duration_in_ms: 0,
duration_in_sec: 0,
loaded_at: dayjs().valueOf()
});
});
await page.goto(goToUrl);
await page.waitForNavigation();
await browser.close();
console.log('=== JOURNEY ===')
console.log(journey)
}
// init
runEmulation()
What am I missing inside Network.requestWillBeSent or what do I need to add in order to pick up websites in the middle that use JS to redirect to another site after a few seconds.
Since, client.on("Network.requestWillBeSent") takes a callback function, you cannot use await on this. await is only valid for methods that return a Promise. Every async function returns a Promise.
As you need to wait for the callback function to finish execution, you can put your code inside the callback function as
client.on('Network.requestWillBeSent', async (e) => {
// if not a document, skip
if (e.type !== 'Document') return;
console.log(`adding URL to journey: ${e.documentURL}`)
// the journey
journey.push({
url: e.documentURL,
type: e.redirectResponse ? e.redirectResponse.status : 'JS Redirection',
duration_in_ms: 0,
duration_in_sec: 0,
loaded_at: dayjs().valueOf()
});
await page.goto(goToUrl);
await page.waitForNavigation();
await browser.close();
console.log('=== JOURNEY ===')
console.log(journey)
});
I'm trying to bypass a captcha on a website and for that I need to execute a command in an iframe of a popup and i cannot find a way to do that. Here is my code:
const cookie = {
name: 'login_email',
value: 'example#domain.com',
domain: '.paypal.com',
url: 'https://www.paypal.com/',
path: '/',
httpOnly: true,
secure: true
}
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch({ headless: false, defaultViewport: null });
const page = await browser.newPage();
await page.setCookie(cookie)
await page.goto('https://www.paypal-dobijeni.cz/');
await page.waitForSelector('#login');
await page.click('#login');
const newPagePromise = new Promise(x => page.once('popup', x));
const popup = await newPagePromise;
await popup.waitForSelector('#password');
await popup.type('#password', 'examplepassword');
await popup.click('#btnLogin');
await popup.waitForSelector('form[name="challenge"]');
})();
The command that I need to execute is verifyCallback('<g-recaptcha-response>')
UPDATE: That's how I do it in the console:
First i select the iframe
Then I execute the command with the g-recaptcha-response I get from my captcha solving service
This isnt really the solution you are looking for but I'll post it in case you decide you want to use it.
First I use argv to parse arguments passed to the script. One of these arguments the user can pass is headless.
When the script runs, I find someway to detect when captchas pop up, and if one is detected and the browser is headless, I log something close to "Captcha appearred, run script with headless set to false and solve the captcha".
When the script is executed with headless set to false and captcha is detected, I await a Promise that holds a one second interval, which checks to see if the captcha has left the page. With the browser no longer being headless, you can manually solve the captcha. When the captcha is gone, the interval is cleared and the Promise is resolved and the rest of the script will execute.
If you are lucky, the captcha won't need to be solved again for that ip address