I'm practicing with headless browser and I plan to make a little viewerbot. The goal would be to be able to put a site where a stream is broadcasted and to be able to choose a number of viewers to send on the stream with the possibility to increase the number or to reduce it without relaunching the app.
Currently I have some problems with the use of puppeteer-cluster.
1/ I can't find a way to handle the number of active tasks at the same time, how to add or remove at any time. That is to say my number of viewers in this case. Would Puppeteer be better than Puppeteer-cluster for my use?
2/ When the tasks of the cluster go live if I have a timeout problem on a single task it's all the others that crash as well. How can I fix that?
3/ Once the task is launched how to make sure that it never ends, that the viewer is on the page without being detected AFK or that the task is finished.
const {Cluster} = require('puppeteer-cluster');
const vanillaPuppeteer = require('puppeteer')
const {addExtra} = require('puppeteer-extra')
const Stealth = require('puppeteer-extra-plugin-stealth')
async function main() {
const puppeteer = addExtra(vanillaPuppeteer)
puppeteer.use(Stealth())
let viewers = 3;
let live = 'https://a-live-stream.com';
const browserArgs = [
'--no-sandbox',
'--disable-setuid-sandbox',
'--disable-infobars'
];
const proxies = [
'proxy:port',
'proxy:port',
'proxy:port',
];
let perBrowserOptions = [];
for (let i = 0; i < viewers; i++) {
perBrowserOptions = [...perBrowserOptions, {args: browserArgs.concat(['--proxy-server=' + proxies[i]])}]
}
const cluster = await Cluster.launch({
puppeteerOptions: {
headless: false,
args: browserArgs,
executablePath: 'C:/Program Files (x86)/Google/Chrome/Application/chrome.exe'
},
monitor: false,
puppeteer,
concurrency: Cluster.CONCURRENCY_BROWSER,
maxConcurrency: viewers,
perBrowserOptions: perBrowserOptions
});
cluster.on('taskerror', (err, data) => {
console.log(`Error crawling ${data}: ${err.message}`);
});
const viewer = async ({page, data: url}) => {
await page.goto(url, {waitUntil: 'networkidle2'})
const element = await page.$('iframe')
await element.click()
console.log('#Viewer live')
await page.waitFor(3000000)
console.log('#Closed')
};
cluster.queue(live, viewer)
cluster.queue(live, viewer)
cluster.queue(live, viewer)
await cluster.idle()
await cluster.close()
}
main().catch(console.warn)
Related
I've accessed the spotify website (open.spotify.com) through playwright fine, but when i manually navigate around trying to play a song and click play nothing happens. I've tried both Firefox and Chromium and it seems that they meet the version requirement for spotify. So im not sure what is going wrong.
const baseURL = 'https://open.spotify.com/'
const playwright = require('playwright')
const browserType = ['firefox']
function sleep(ms) {
Atomics.wait(new Int32Array(new SharedArrayBuffer(4)), 0, 0, ms);
}
async function login(email, password, proxy) {
const browser = await playwright[browserType].launch({headless:false})
/*const browser = await playwright[browserType].launch({
headless: false,
proxy: {
server: proxy,
},
});*/
const page = await browser.newPage();
await page.goto('https://accounts.spotify.com/en/login?continue=https%3A%2F%2Fopen.spotify.com%2F');
let emailField = page.getByPlaceholder('Email address or username')
let passwordField = page.getByPlaceholder('Password')
await emailField.type(email, {delay: 5})
await passwordField.type(password, {delay: 5})
let loginButton = page.getByTestId('login-button')
await loginButton.click()
}
I am trying to build a scraper to monitor web projects automatically.
So far so good, the script is running, but now I want to add a feature that automatically analyses what libraries I used in the projects. The most powerful script for this job is wappalyser. They have a node package (https://www.npmjs.com/package/wappalyzer) and it's written that you can use it combined with pupperteer.
I managed to run pupperteer and to log the source code of the sites in the console, but I don't get the right way to pass the source code to the wappalyzer analyse function.
Do you guys have a hint for me?
I tryed this code but a am getting a TypeError: url.split is not a function
function getLibarys(url) {
(async () => {
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
await page.goto(url);
// get source code with puppeteer
const html = await page.content();
const wappalyzer = new Wappalyzer();
(async function () {
try {
await wappalyzer.init()
// Optionally set additional request headers
const headers = {}
const site = await wappalyzer.open(page, headers)
// Optionally capture and output errors
site.on('error', console.error)
const results = await site.analyze()
console.log(JSON.stringify(results, null, 2))
} catch (error) {
console.error(error)
}
await wappalyzer.destroy()
})()
await browser.close()
})()
}
Fixed it by using the sample code from wappalyzer.
function getLibarys(url) {
const Wappalyzer = require('wappalyzer');
const options = {
debug: false,
delay: 500,
headers: {},
maxDepth: 3,
maxUrls: 10,
maxWait: 5000,
recursive: true,
probe: true,
proxy: false,
userAgent: 'Wappalyzer',
htmlMaxCols: 2000,
htmlMaxRows: 2000,
noScripts: false,
noRedirect: false,
};
const wappalyzer = new Wappalyzer(options)
;(async function() {
try {
await wappalyzer.init()
// Optionally set additional request headers
const headers = {}
const site = await wappalyzer.open(url, headers)
// Optionally capture and output errors
site.on('error', console.error)
const results = await site.analyze()
console.log(JSON.stringify(results, null, 2))
} catch (error) {
console.error(error)
}
await wappalyzer.destroy()
})()
}
I do not know if you still need an answer to this. But this is what a wappalyzer collaborator told me:
Normally you'd run Wappalyzer like this:
const Wappalyzer = require('wappalyzer')
const wappalyzer = new Wappalyzer()
await wappalyzer.init() // Launches a Puppeteer instance
const site = await wappalyzer.open(url)
If you want to use your own browser instance, you can skip wappalyzer.init() and assign the instance to wappalyzer.browser:
const Wappalyzer = require('wappalyzer')
const wappalyzer = new Wappalyzer()
wappalyzer.browser = await puppeteer.launch() // Use your own Puppeteer launch logic
const site = await wappalyzer.open(url)
You can find the discussion here.
Hope this helps.
I am using tessearct.js library in my angular code.
I want to preserve the white spaces, the indentation as it is. How to do it?
Currently I am using this piece of code to do it.
async doOCR {
const worker = createWorker({
logger: m => console.log(m),
});
await worker.load();
await worker.loadLanguage('eng');
await worker.initialize('eng');
const value = await worker.recognize(this.selectedFile);
}
I am looking a method to do it on client side only, that's why not using its python library.
You can give it a try after version (3.04), they have added the preserve_interword_spaces`. You can try this and check if this works:
async doOCR {
const worker = createWorker({
logger: m => console.log(m),
});
await worker.load();
await worker.loadLanguage('eng');
await worker.initialize('eng');
// there is no proper documentation, but they have added this flag
// to run it as a command
await worker.setParameters({
preserve_interword_spaces: 1,
});
const value = await worker.recognize(this.selectedFile);
}
I'm using Puppeteer.js to crawl some URL. I'm using the default Chromium browser of Puppeteer.All is working well, but the problem is, that when I run the crawling script, and doing other things in the background and the focus is no longer on the Chromium browser of Puppeteer, it's not working: waiting for elements way too long, and abort operations, or in other words: puppeteer is paused (or freeze).
P.S, I'm also using puppeteer-extra and puppeteer-extra-plugin-stealth NPM packages for advance options.
Here is how I create the browser and the page:
async initiateCrawl(isDisableAsserts) {
// Set the browser.
this.isPlannedClose = false;
const browser = await puppeteerExtra.launch({
headless: false,
args: [
'--no-sandbox',
'--disable-setuid-sandbox',
'--disable-dev-shm-usage',
'--start-maximized',
'--disable-background-timer-throttling',
'--disable-backgrounding-occluded-windows',
'--disable-renderer-backgrounding'
]
});
const pid = browser.process().pid;
browser.on('disconnected', () => {
systemUtils.killProcess(pid);
if (!this.isPlannedClose) {
systemUtils.exit(Status.BROWSER_CLOSE, Color.RED, 0);
}
});
process.on('SIGINT', () => {
this.close(browser, true);
});
// Set the page and close the first empty tab.
const page = await browser.newPage();
const pages = await browser.pages();
if (pages.length > 1) {
await pages[0].close();
}
await page.setRequestInterception(true);
await page.setJavaScriptEnabled(false);
await page.setDefaultNavigationTimeout(this.timeout);
page.on('request', (request) => {
if (isDisableAsserts && ['image', 'stylesheet', 'font', 'script'].indexOf(request.resourceType()) !== -1) {
request.abort();
} else {
request.continue();
}
});
return {
browser: browser,
page: page
};
}
I already looked at:
https://github.com/puppeteer/puppeteer/issues/3339
https://github.com/GoogleChrome/chrome-launcher/issues/169
https://www.gitmemory.com/issue/GoogleChrome/puppeteer/3339/530620329
Not working solutions:
const session = await page.target().createCDPSession();
await session.send('Page.enable');
await session.send('Page.setWebLifecycleState', {state: 'active'});
const chromeArgs = [
'--disable-background-timer-throttling',
'--disable-backgrounding-occluded-windows',
'--disable-renderer-backgrounding'
];
var ops = {args:[
'--kiosks',
'--disable-background-timer-throttling',
'--disable-backgrounding-occluded-windows',
'--disable-renderer-backgrounding',
'--disable-canvas-aa',
'--disable-2d-canvas-clip-aa',
'--disable-gl-drawing-for-tests',
'--disable-dev-shm-usage',
'--no-zygote',
'--use-gl=desktop',
'--enable-webgl',
'--hide-scrollbars',
'--mute-audio',
'--start-maximized',
'--no-first-run',
'--disable-infobars',
'--disable-breakpad',
'--user-data-dir='+tempFolder,
'--no-sandbox',
'--disable-setuid-sandbox'
], headless: false, timeout:0 };
puppeteer = require('puppeteer');
browser = await puppeteer.launch(ops);
page = await browser.newPage();
Has anyone faced this issue before and have any idea how to solve this? Thanks.
My issue was solved when I updated to the latest puppeteer version (9.0.0).
I got this code from another Stackoverflow Question:
import electron from "electron";
import puppeteer from "puppeteer-core";
const delay = (ms: number) =>
new Promise(resolve => {
setTimeout(() => {
resolve();
}, ms);
});
(async () => {
try {
const app = await puppeteer.launch({
executablePath: electron,
args: ["."],
headless: false,
});
const pages = await app.pages();
const [page] = pages;
await page.setViewport({ width: 1200, height: 700 });
await delay(5000);
const image = await page.screenshot();
console.log(image);
await page.close();
await delay(2000);
await app.close();
} catch (error) {
console.error(error);
}
})();
Typescript compiler complains about executablePath property of launch method options object cause it needs to be of type string and not Electron. So how to pass electron chromium executable path to puppeteer?
You cannot use electron executable with Puppeteer directly without some workarounds and flag changes. They have tons of differences in the API. Specially electron doesn't have all of the chrome.* API which is needed for chromium browser to work properly, many flags still doesn't have proper replacements such as the headless flag.
Below you will see two ways to do it. However you need to make sure of two points,
Make sure the puppeteer is connected before the app is initiated.
Make sure you get the correct version puppeteer or puppeteer-core for the version of Chrome that is running in Electron!
Use puppeteer-in-electron
There are lots of workarounds, but most recently there is a puppeteer-in-electron package which allows you to run puppeteer within electron app using the electron.
First, install the dependencies,
npm install puppeteer-in-electron puppeteer-core electron
Then run it.
import {BrowserWindow, app} from "electron";
import pie from "puppeteer-in-electron";
import puppeteer from "puppeteer-core";
const main = async () => {
const browser = await pie.connect(app, puppeteer);
const window = new BrowserWindow();
const url = "https://example.com/";
await window.loadURL(url);
const page = await pie.getPage(browser, window);
console.log(page.url());
window.destroy();
};
main();
Get the debugging port and connect to it
The another way is to get the remote-debugging-port of the electron app and connect to it. This solution is shared by trusktr on electron forum.
import {app, BrowserWindow, ...} from "electron"
import fetch from 'node-fetch'
import * as puppeteer from 'puppeteer'
app.commandLine.appendSwitch('remote-debugging-port', '8315')
async function test() {
const response = await fetch(`http://localhost:8315/json/versions/list?t=${Math.random()}`)
const debugEndpoints = await response.json()
let webSocketDebuggerUrl = debugEndpoints['webSocketDebuggerUrl ']
const browser = await puppeteer.connect({
browserWSEndpoint: webSocketDebuggerUrl
})
// use puppeteer APIs now!
}
// ... make your window, etc, the usual, and then: ...
// wait for the window to open/load, then connect Puppeteer to it:
mainWindow.webContents.on("did-finish-load", () => {
test()
})
Both solution above uses webSocketDebuggerUrl to resolve the issue.
Extra
Adding this note because most people uses electron to bundle the app.
If you want to build the puppeteer-core and puppeteer-in-electron, you need to use hazardous and electron-builder to make sure get-port-cli works.
Add hazardous on top of main.js
// main.js
require ('hazardous');
Make sure the get-port-cli script is unpacked, add the following on package.json
"build": {
"asarUnpack": "node_modules/get-port-cli"
}
Result after building:
the toppest answer dones't work for me use electron 11 and puppeteer-core 8.
but start puppeteer in main process other then in the renderer process works for me.you can use ipcMain and ipcRenderer to comunicate each other.the code below
main.ts(main process code)
import { app, BrowserWindow, ipcMain } from 'electron';
import puppeteer from 'puppeteer-core';
async function newGrabBrowser({ url }) {
const browser = await puppeteer.launch({
headless: false,
executablePath:
'/Applications/Google Chrome.app/Contents/MacOS/Google Chrome',
});
const page = await browser.newPage();
page.goto(url);
}
ipcMain.on('grab', (event, props) => {
newGrabBrowser(JSON.parse(props));
});
home.ts (renderer process code)
const { ipcRenderer } = require('electron');
ipcRenderer.send('grab',JSON.stringify({url: 'https://www.google.com'}));
There is also another option, which works for electron 5.x.y and up (currently up to 7.x.y, I did not test it on 8.x.y beta yet):
// const assert = require("assert");
const electron = require("electron");
const kill = require("tree-kill");
const puppeteer = require("puppeteer-core");
const { spawn } = require("child_process");
let pid;
const run = async () => {
const port = 9200; // Debugging port
const startTime = Date.now();
const timeout = 20000; // Timeout in miliseconds
let app;
// Start Electron with custom debugging port
pid = spawn(electron, [".", `--remote-debugging-port=${port}`], {
shell: true
}).pid;
// Wait for Puppeteer to connect
while (!app) {
try {
app = await puppeteer.connect({
browserURL: `http://localhost:${port}`,
defaultViewport: { width: 1000, height: 600 } // Optional I think
});
} catch (error) {
if (Date.now() > startTime + timeout) {
throw error;
}
}
}
// Do something, e.g.:
// const [page] = await app.pages();
// await page.waitForSelector("#someid")//
// const text = await page.$eval("#someid", element => element.innerText);
// assert(text === "Your expected text");
// await page.close();
};
run()
.then(() => {
// Do something
})
.catch(error => {
// Do something
kill(pid, () => {
process.exit(1);
});
});
Getting the pid and using kill is optional. For running the script on some CI platform it does not matter, but for local environment you would have to close the electron app manually after each failed try.
Please see this sample repo.