I have a script which uses Puppeteer to automatically log in to a corporate portal. The login uses SAML. So, when puppeteer opens up an instance of chromium and visits the page, a popup appears on screen to confirm the identity of the user. All I need to do is either manually click on "OK" button or press Enter from keyboard.
I have tried to simulate the pressing of the Enter key using puppeteer but it does not work.
The login screen -
Script -
const puppeteer = require('puppeteer');
async function startDb() {
const browser = await puppeteer.launch({
headless:false,
defaultViewport:null
});
const page = await browser.newPage();
await page.goto("https://example.com");
await page.waitFor(3000);
await page.keyboard.press('Enter');
console.log('Opened')
};
startDb();
**EDIT **
There is a solution proposed in this issue:
Basically just intercept the request, then fire the request off yourself using your favorite httpclient lib, and repond to the intercepted request with the response info.
const puppeteer = require('puppeteer');
const request = require('request');
const fs = require('fs');
(async () => {
const browser = await puppeteer.launch();
let page = await browser.newPage();
// Enable Request Interception
await page.setRequestInterception(true);
// Client cert files
const cert = fs.readFileSync('/path/to/cert.crt.pem');
const key = fs.readFileSync('/path/to/cert.key.pem');
page.on('request', interceptedRequest => {
// Intercept Request, pull out request options, add in client cert
const options = {
uri: interceptedRequest.url(),
method: interceptedRequest.method(),
headers: interceptedRequest.headers(),
body: interceptedRequest.postData(),
cert: cert,
key: key
};
// Fire off the request manually (example is using using 'request' lib)
request(options, function(err, resp, body) {
// Abort interceptedRequest on error
if (err) {
console.error(`Unable to call ${options.uri}`, err);
return interceptedRequest.abort('connectionrefused');
}
// Return retrieved response to interceptedRequest
interceptedRequest.respond({
status: resp.statusCode,
contentType: resp.headers['content-type'],
headers: resp.headers,
body: body
});
});
});
await page.goto('https://client.badssl.com/');
await browser.close();
})();
Before page.goto(), put this code:
page.on('dialog', async dialog => {
await dialog.accept();
});
Related
I am trying to send PUT request to the final URL but before final URL, there is a redirect. Also sending a fetch request inside final URL page is also fine. when I go to devtools console, write fetch from there also works but I need to do it inside the code, of course.
When I set await page.setRequestInterception(true); and page.once('request', (req) => {...}) it sends put request to the first page which I dont want it to do that.
Let's say first URL is https://example.com/first --> this redirects to final URL
final URL https://example.com/final --> this is where I want to send PUT request and retrieve status code. I have tried setting a timer or getting current url with page.url() and trying some if else statements, but did not work.
here is my current code;
app.get('/cookie', async (req, res) => {
puppeteer.use(StealthPlugin());
const browser = await puppeteer.launch({
headless: false,
executablePath: `C:/Program Files (x86)/Google/Chrome/Application/chrome.exe`,
defaultViewport: null,
args: ['--start-maximized'],
slowMo: 150,
});
const page = await browser.newPage();
await page.setUserAgent(randomUserAgent.getRandom());
page.setDefaultNavigationTimeout(0);
page.setJavaScriptEnabled(true);
await page.goto(
'finalURL',
{ waitUntil: 'load', timeout: 0 }
);
await delay(5000);
await page.setRequestInterception(true);
page.once('request', (request) => {
request.continue({
method: 'PUT',
});
page.setRequestInterception(false);
});
let statusCode;
await page.waitForResponse((response) => {
statusCode = response.status();
return true;
});
res.json(statusCode);
});
My app starts with a simple html form. the inputs are PIN# and Date Of Birth.
My express server runs on the same port 3000, when the user submits their data, puppeteer starts and logs into a specific webpage. Then I scrape the image on that webpage. Google Api takes the text from that image and saves it in an array. I then post that array string to src/results.html. But as soon as the user hits submit, they are redirected to /resuts route right immediately and the page says cannot post the data. but when I see in the console (roughly a minute later) that the post was successful, I refresh the page and I get the array of text I wanted to see.
How can I await for the data to finish being posted to the route before the page loads the data? Im using react for client side. below is my server side code. client side is just a basic react page login and a static /results page meant for the data.
const puppeteer = require("puppeteer");
const express = require("express");
const app = express();
const morgan = require("morgan");
const fs = require("fs");
const cors = require("cors");
const request = require("request-promise-native").defaults({ Jar: true });
const poll = require("promise-poller").default;
app.use(morgan("combined"));
const port = 3000;
// Imports the Google Cloud client library
const vision = require("#google-cloud/vision");
require("dotenv").config();
app.use(cors());
const textArray = [];
const App = (pinNum, dateOfB) => {
const config = {
sitekey: process.env.SITEKEY,
pageurl: process.env.PAGEURL,
apiKey: process.env.APIKEY,
apiSubmitUrl: "http://2captcha.com/in.php",
apiRetrieveUrl: "http://2captcha.com/res.php",
};
const chromeOptions = {
executablePath: "/Program Files/Google/Chrome/Application/chrome.exe",
headless: true,
slowMo: 60,
defaultViewport: null,
};
async function main() {
const browser = await puppeteer.launch(chromeOptions);
const page = await browser.newPage();
console.log(`Navigating to ${config.pageurl}`);
await page.goto(config.pageurl);
try {
const requestId = await initiateCaptchaRequest(config.apiKey);
// const pin = getPIN();
console.log(`Typing PIN ${pinNum}`);
await page.type("#PIN", pinNum);
// const dob = getDOB();
console.log(`Typing DOB ${dateOfB}`);
const input = await page.$("#DOB");
await input.click({ clickCount: 3 });
await input.type(dateOfB);
const response = await pollForRequestResults(config.apiKey, requestId);
console.log(`Entering recaptcha response ${response}`);
await page.evaluate(
`document.getElementById("g-recaptcha-response").innerHTML="${response}";`
);
console.log(`Submitting....`);
page.click("#Submit");
} catch (error) {
console.log(
"Your request could not be completed at this time, please check your pin number and date of birth. Also make sure your internet connection is working and try again."
);
console.error(error);
}
await page.waitForSelector(
"body > div.container.body-content > div:nth-child(1) > div:nth-child(2) > p"
);
const image = await page.$(
"body > div.container.body-content > div:nth-child(1) > div:nth-child(2) > p"
);
await image.screenshot({
path: "testResults.png",
});
await getImageText();
await page.close(); // Close the website
await browser.close(); //close browser
await deleteImage();
}
main();
//This section grabs the text off the image that was gathered from the web scraper.
async function getImageText() {
// Creates a client
const client = new vision.ImageAnnotatorClient();
console.log(`Looking for text in image`);
// Performs label detection on the image file
const [result] = await client.textDetection("./testResults.png");
const [annotation] = result.textAnnotations;
const text = annotation ? annotation.description : "";
console.log("Extracted text from image:", text);
//Pushed the text into a globally available array.
textArray.push(text);
//Sent a NOTIFICATION ALERT to the client with the text gathered from the image.
var axios = require("axios");
var data = JSON.stringify({
to: "dp8vGNkcYKb-k-72j7t4Mo:APA91bEfrI3_ht89t5X1f3_Y_DACZc9DbWI4VzcYehaQoXtD_IHIFSwm9H1hgXHNq46BQwDTlCKzkWNAHbBGauEXZNQtvhQc8glz4sHQr3JY3KM7OkUEcNB7qMMpCPxRe5GzzHbe3rkE",
notification: {
body: text,
title: "AverHealth Schedule",
},
});
var config = {
method: "post",
url: "https://fcm.googleapis.com/fcm/send",
headers: {
"Content-Type": "application/json",
Authorization: `key=${process.env.FCM_SERVER_KEY}`,
},
data: data,
};
axios(config)
.then(function (response) {
console.log(JSON.stringify(response.data));
})
.catch(function (error) {
console.log(error);
});
}
//Captcha Solver for the web scraper
async function initiateCaptchaRequest(apiKey) {
const formData = {
key: apiKey,
method: "userrecaptcha",
googlekey: config.sitekey,
json: 1,
pageurl: config.pageurl,
};
console.log(
`Submitting recaptcha request to 2captcha for ${config.pageurl}`
);
const response = await request.post(config.apiSubmitUrl, {
form: formData,
});
console.log(response);
return JSON.parse(response).request;
}
async function pollForRequestResults(
key,
id,
retries = 90,
interval = 5000,
delay = 1500
) {
console.log(`Waiting for ${delay} milliseconds....`);
await timeout(delay);
return poll({
taskFn: requestCaptchaResults(key, id),
interval,
retries,
});
}
function requestCaptchaResults(apiKey, requestId) {
const url = `${config.apiRetrieveUrl}?key=${apiKey}&action=get&id=${requestId}&json=1`;
console.log(url);
return async function () {
return new Promise(async function (resolve, reject) {
console.log(`Polling for response...`);
const rawResponse = await request.get(url);
console.log(rawResponse);
const resp = JSON.parse(rawResponse);
console.log(resp);
if (resp.status === 0) return reject(resp.request);
console.log("Response received");
console.log(resp);
resolve(resp.request);
});
};
}
// DELETES THE FILE CREATED BY GOOGLEAPI
function deleteImage() {
const path = "./testResults.png";
try {
fs.unlinkSync(path);
console.log("File removed:", path);
} catch (err) {
console.error(err);
}
}
const timeout = (ms) => new Promise((res) => setTimeout(res, ms));
};
app.use(express.urlencoded({ extended: false }));
// Route to results Page
app.get("/results", (req, res) => {
res.sendFile(__dirname + "/src/results.html");
res.send(textArray);
});
app.post("/results", (req, res) => {
// Insert Login Code Here
let username = req.body.username;
let password = req.body.password;
App(username, password);
});
app.listen(port, () => {
console.log(`Scraper app listening at http://localhost:${port}`);
});
I think I got the problem.
In the react app, maybe you are not using e.preventDefault() when you click submit. The browser, by default, redirects to a page where the form action is directing, if the action attribute is empty then the browser reloads the same page. I would recommend you to use e.preventDefault() on form submission and then use fetch API to make the request.
In the express server, on the route POST "results", you are not sending any response back to the user. You should always send a response to the user. In your case you are calling the App function - which has many async functions, but you are not awaiting for App() to complete in the POST route, express is sending default response to the user as soon as it parses App() - it is not waiting for the App() to complete - express will get to this later.
You can make the (req, res) => { ... } function in the route as async function async (req, res) => { ... }, then you can make the App as async function as well. Then you can await App(...) in the route function. Also, you need to await for the main() function as well inside the App() function. Then once App() call has finished, you can send redirect response to the user.
I built an app that uses Puppeteer to scrape data from LinkedIn. I log in using email and password but would like to pass in cookies to authenticate. Here is what I currently use:
const puppeteer = require("puppeteer");
(async () => {
try {
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
await page.goto("https://www.linkedin.com/login");
await page.waitForSelector(loginBtn);
await page.type("#username", username);
await page.type("#password", password);
await page.click(loginBtn, { delay: 30 });
await browser.close();
} catch (error) {
console.log(`Our error = ${error}`);
}
})();
I've seen websites like Phantombuster that use "li_at" cookies to authenticate. https://i.imgur.com/PI8fzao.png
How can I authenticate using cookies?
Disclaimer: I work at Phantombuster ;)
Since logging in sets a cookie in your browser on success, you can replace that step with the direct result:
await page.setCookie({ name: "li_at", value: "[cookie here]", domain: "www.linkedin.com" })
You should then be able to goto any of the website page as if you were authenticated by the login form.
Using Puppeteer (https://github.com/GoogleChrome/puppeteer), I have a page that's a application/pdf. With headless: false, the page is loaded though the Chromium PDF viewer, but I want to use headless. How can I download the original .pdf file or use as a blob with another library, such as (pdf-parse https://www.npmjs.com/package/pdf-parse)?
Since Puppeteer does not currently support navigation to a PDF document in headless mode via page.goto() due to the upstream issue, you can use page.setRequestInterception() to enable request interception, and then you can listen for the 'request' event and detect whether the resource is a PDF before using the request client to obtain the PDF buffer.
After obtaining the PDF buffer, you can use request.abort() to abort the original Puppeteer request, or if the request is not for a PDF, you can use request.continue() to continue the request normally.
Here's a full working example:
'use strict';
const puppeteer = require('puppeteer');
const request_client = require('request-promise-native');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.setRequestInterception(true);
page.on('request', request => {
if (request.url().endsWith('.pdf')) {
request_client({
uri: request.url(),
encoding: null,
headers: {
'Content-type': 'applcation/pdf',
},
}).then(response => {
console.log(response); // PDF Buffer
request.abort();
});
} else {
request.continue();
}
});
await page.goto('https://example.com/hello-world.pdf').catch(error => {});
await browser.close();
})();
Grant Miller's solution didn't work for me because I was logged in the website. But if the pdf is public this solution works out well.
The solution for my case was to add the cookies
await page.setRequestInterception(true);
page.on('request', async request => {
if (request.url().indexOf('exibirFat.do')>0) { //This condition is true only in pdf page (in my case of course)
const options = {
encoding: null,
method: request._method,
uri: request._url,
body: request._postData,
headers: request._headers
}
/* add the cookies */
const cookies = await page.cookies();
options.headers.Cookie = cookies.map(ck => ck.name + '=' + ck.value).join(';');
/* resend the request */
const response = await request_client(options);
//console.log(response); // PDF Buffer
buffer = response;
let filename = 'file.pdf';
fs.writeFileSync(filename, buffer); //Save file
} else {
request.continue();
}
});
I'm having a hard time navigating relative urls with puppeteer for a specific use case. Below you can see the basic setup and an pseudo example describing the problem.
Essentially I want to change the current url the browser thinks he is at.
What I already tried:
Manipulating the response body by resolving all relative URLs by myself. Collides with some javascript based links.
Triggering a new page.goto(response.url) if request url doesn't match response url and returning the response from the previous request. Can't seem to input custom options, so I don't know which request is a fake page.goto.
Can somebody lend me a helping hand? Thanks in advance.
Setup:
const browser = await puppeteer.launch({
headless: false,
});
const [page] = await browser.pages();
await page.setRequestInterception(true);
page.on('request', (request) => {
const resourceType = request.resourceType();
if (['document', 'xhr', 'script'].includes(resourceType)) {
// fetching takes place on an different instance and handles redirects internally
const response = await fetch(request);
request.respond({
body: response.body,
statusCode: response.statusCode,
url: response.url // no effect
});
} else {
request.abort('aborted');
}
});
Navigation:
await page.goto('https://start.de');
// redirects to https://redirect.de
await page.click('a');
// relative href '/demo.html' resolves to https://start.de/demo.html instead of https://redirect.de/demo.html
await page.click('a');
Update 1
Solution
Manipulating the browser history direction via window.location.
await page.goto('https://start.de');
// redirects to https://redirect.de internally
await page.click('a');
// changing current window location
await page.evaluate(() => {
window.location.href = 'https://redirect.de';
});
// correctly resolves to https://redirect.de/demo.html instead of https://start.de/demo.html
await page.click('a');
When you match the request that you want to edit its body, just get the URL and make a call using "node-fetch" or "request" modules, when you receive the body edit it then sends it as a response to the original request.
for example:
const requestModule = require("request");
const cheerio = require("cheerio");
page.on("request", async (request) => {
// Match the url that you want
const isMatched = /page-12/.test(request.url());
if (isMatched) {
// Make a new call
requestModule({
url: request.url(),
resolveWithFullResponse: true,
})
.then((response) => {
const { body, headers, statusCode, statusMessage } = response;
const contentType = headers["content-type"];
// Edit body using cheerio module
const $ = cheerio.load(body);
$("a").each(function () {
$(this).attr("href", "/fake_pathname");
});
// Send response
request.respond({
ok: statusMessage === "OK",
status: statusCode,
contentType,
body: $.html(),
});
})
.catch(() => request.continue());
} else request.continue();
});