I want to launch only one Chromium instance from first script and then attach to it from other scripts. I know about puppeteer.connect() but the problem is that I start the script which is supposed to launch Chromium:
const puppeteer = require('puppeteer');
const fs = require('fs');
const logger = fs.createWriteStream('log.txt', {
flags: 'a' // 'a' means appending (old data will be preserved)
});
(async() => {
const browser = await puppeteer.launch({ headless: false});
logger.write('-----Browser is launched\n');
logger.write(browser.wsEndpoint());
})();
...and it never ends because I didn`t do browser.close(). Thus, I can`t start running other scripts. How can I launch Chromium, obtain its endpoint and end the script remaining Chromium launched.
(This one doesn`t contain an appropriate answer)
Answering question
Basically you can spawn child_process with detached set to true. Then exit your main script with process.exit() to launch Chromium see 1.js.
Script that responsible to launch Chromium and saving the web socket see chromiumLauncher.js
When the web socket are saved, you can connect via puppeteer.launch see 2.js
Here i push it on github (dirty code).
Related
I'm trying to run a simple javascript file from the terminal (Ubuntu) that clicks a button on a website. However, I haven't been able to find how to do so, since I've learned that you can't interact with the browser in Node (for doing things like running commands such as window.location.href).
(source: ReferenceError : window is not defined at object. <anonymous> Node.js).
For example, I'd like to be able to create a script (let's call it test.js) where when I run ./test.js or node test.js in the terminal, it will:
go to www.google.com
Click on the "Images" button in the top right.
I wrote out how I understand to do that below:
window.location.href = "https://www.google.com"
document.getElementById('the id of the image button').click()
It seems extremely straightforward, but I am a beginner to Javascript and am not aware of its limitations and could most definitely be wrong about Node. Could someone help explain how I should go about doing something as simple as this? Thanks
EDIT: For clarification on the context, this is just a part of me trying to automate form submissions. I also want to be able to enter specified text into input fields and so on.
Node.js does not include a browser or any browser-like control required to execute the code you posted. Fortunately, this is fairly straightforward with the addition of some extra Node.js software.
What you're looking for is Puppeteer. It's a Node.js library that comes with a small Chrome browser and allows you to remote control that browser from some very easy Node.js functions / methods.
In a directory of your choosing, install puppeteer with npm like so:
npm install -S puppeteer
This will install the library locally into a node_modules/ directory.
Then, you'll need a single javascript file (like test.js in your example) in which you write code like the example in the README (linked above):
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://www.google.com');
await page.click('the id of the "images" link or some selector');
await page.screenshot({path: 'example.png'});
await browser.close();
})();
Is there a way in which you can get a screenshot of another websites pages?
e.g: you introduce a url in an input, hit enter, and a script gives you a screenshot of the site you put in. I manage to do it with headless browsers, but I fear that could take too much resources and time, to launch. let's say phantomjs each time the input is used the headless browser would need to get the new data, I investigate HotJar, it does something similar to what I'm looking for, but it gives you a script that you must put into the page header, which is fine by me, afterwards, you get a preview, how does it work?, and how can one replicate it?
Do you want a print screen of your page or someone else's?
Own page
Use puppeteer or phantomJS with Beverly build of your site, this way you will only run it when it changes, and have a screenshot ready at any time.
Foreign page
You have access to it (the owner runs your script)
Either try to get into his build pipeline, and use solution from above.
Or use this solution Using HTML5/Canvas/JavaScript to take in-browser screenshots.
You don't have any access
Use some long-running process that will give you screenshot when asked.
Imagine a server with one URL endpoint: screenshot.example.com?facebook.com.
The long-running server has a puppeteer/phantomJS instance ready to go when given URL, it will flood that page, get the screenshot and send it back. The browser will actually think of it as a slow ping image request.
You can make this with puppeteer
install with: npm i puppeteer
save the following code to example.js
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://example.com');
await page.screenshot({path: 'example.png'});
await browser.close();
})();
and run it with:
node example.js
Is it somehow possible to attach puppeteer to a running Chrome instance (manually started browser) and then takeover control within a tab? I'm assuming that it's eventually related to start the Chrome browser using the --no-sandbox flag but don't know how to continue from there.
Thanks for any help
You can use puppeteer.connect(options) (see here):
const puppeteer = require('puppeteer');
const browserWSEndpoint = 'a browser websocket endpoint to connect to';
const browser = await puppeteer.connect({browserWSEndpoint});
//continue from here
I am trying to use Puppeteer for end-to-end tests. These tests require accessing the network emulation capabilities of DevTools (e.g. to simulate offline browsing).
So far I am using chrome-remote-interface, but it is too low-level for my taste.
As far as I know, Puppeteer does not expose the network DevTools features (emulateNetworkConditions in the DevTools protocol).
Is there an escape hatch in Puppeteer to access those features, e.g. a way to execute a Javascript snippet in a context in which the DevTools API is accessible?
Thanks
Edit:
OK, so it seems that I can work around the lack of an API using something like this:
const client = page._client;
const res = await client.send('Network.emulateNetworkConditions',
{ offline: true, latency: 40, downloadThroughput: 40*1024*1024,
uploadThroughput: 40*1024*1024 });
But I suppose it is Bad Form and may slip under my feet at any time?
Update: headless Chrome now supports network throttling!
In Puppeteer, you can emulate devices (https://github.com/GoogleChrome/puppeteer/blob/master/docs/api.md#pageemulateoptions) but not network conditions. It's something we're considering, but headless Chrome needs to support network throttling first.
To emulate a device, I'd use the predefined devices found in DeviceDescriptors:
const puppeteer = require('puppeteer');
const devices = require('puppeteer/DeviceDescriptors');
const iPhone = devices['iPhone 6'];
puppeteer.launch().then(async browser => {
const page = await browser.newPage();
await page.emulate(iPhone);
await page.goto('https://www.google.com');
// other actions...
browser.close();
});
I'm trying to fetch an entire webpage using JavaScript by plugging in the URL. However, the website is built as a Single Page Application (SPA) that uses JavaScript / backbone.js to dynamically load most of it's contents after rendering the initial response.
So for example, when I route to the following address:
https://connect.garmin.com/modern/activity/1915361012
And then enter this into the console (after the page has loaded):
var $page = $("html")
console.log("%c✔: ", "color:green;", $page.find(".inline-edit-target.page-title-overflow").text().trim());
console.log("%c✔: ", "color:green;", $page.find("footer .details").text().trim());
Then I'll get the dynamically loaded activity title as well as the statically loaded page footer:
However, when I try to load the webpage via an AJAX call with either $.get() or .load(), I only get delivered the initial response (the same as the content when over view-source):
view-source:https://connect.garmin.com/modern/activity/1915361012
So if I use either of the the following AJAX calls:
// jQuery.get()
var url = "https://connect.garmin.com/modern/activity/1915361012";
jQuery.get(url,function(data) {
var $page = $("<div>").html(data)
console.log("%c✖: ", "color:red;", $page.find(".page-title").text().trim());
console.log("%c✔: ", "color:green;", $page.find("footer .details").text().trim());
});
// jQuery.load()
var url = "https://connect.garmin.com/modern/activity/1915361012";
var $page = $("<div>")
$page.load(url, function(data) {
console.log("%c✖: ", "color:red;", $page.find(".page-title").text().trim() );
console.log("%c✔: ", "color:green;", $page.find("footer .details").text().trim());
});
I'll still get the initial footer, but won't get any of the other page contents:
I've tried the solution here to eval() the contents of every script tag, but that doesn't appear robust enough to actually load the page:
jQuery.get(url,function(data) {
var $page = $("<div>").html(data)
$page.find("script").each(function() {
var scriptContent = $(this).html(); //Grab the content of this tag
eval(scriptContent); //Execute the content
});
console.log("%c✖: ", "color:red;", $page.find(".page-title").text().trim());
console.log("%c✔: ", "color:green;", $page.find("footer .details").text().trim());
});
Q: Any options to fully load a webpage that will scrapable over JavaScript?
You will never be able to fully replicate by yourself what an arbitrary (SPA) page does.
The only way I see is using a headless browser such as PhantomJS or Headless Chrome, or Headless Firefox.
I wanted to try Headless Chrome so let's see what it can do with your page:
Quick check using internal REPL
Load that page with Chrome Headless (you'll need Chrome 59 on Mac/Linux, Chrome 60 on Windows), and find page title with JavaScript from the REPL:
% chrome --headless --disable-gpu --repl https://connect.garmin.com/modern/activity/1915361012
[0830/171405.025582:INFO:headless_shell.cc(303)] Type a Javascript expression to evaluate or "quit" to exit.
>>> $('body').find('.page-title').text().trim()
{"result":{"type":"string","value":"Daily Mile - Round 2 - Day 27"}}
NB: to get chrome command line working on a Mac I did this beforehand:
alias chrome="'/Applications/Google Chrome.app/Contents/MacOS/Google Chrome'"
Using programmatically with Node & Puppeteer
Puppeteer is a Node library (by Google Chrome developers) which provides a high-level API to control headless Chrome over the DevTools Protocol. It can also be configured to use full (non-headless) Chrome.
(Step 0 : Install Node & Yarn if you don't have them)
In a new directory:
yarn init
yarn add puppeteer
Create index.js with this:
const puppeteer = require('puppeteer');
(async() => {
const url = 'https://connect.garmin.com/modern/activity/1915361012';
const browser = await puppeteer.launch();
const page = await browser.newPage();
// Go to URL and wait for page to load
await page.goto(url, {waitUntil: 'networkidle'});
// Wait for the results to show up
await page.waitForSelector('.page-title');
// Extract the results from the page
const text = await page.evaluate(() => {
const title = document.querySelector('.page-title');
return title.innerText.trim();
});
console.log(`Found: ${text}`);
browser.close();
})();
Result:
$ node index.js
Found: Daily Mile - Round 2 - Day 27
First off: avoid eval - your content security policy should block it and it leaves you open to easy XSS attacks. Scraping bots definitely won't run it.
The problem you're describing is common to all SPAs - when a person visits they get your app shell script, which then loads in the rest of the content - all good. When a bot visits they ignore the scripts and return the empty shell.
The solution is server side rendering. One way to do this is if you're using a JS renderer (say React) and Node.js on the server you can fairly easily build the JS and serve it statically.
However, if you aren't then you'll need to run a headless browser on your server that executes all the JS a user would and then serves up the result to the bot.
Fortunately someone else has already done all the work here. They've put a demo online that you can try out with your site:
I think you should know the concept of SPA,
SPA is Single Page Application, it is only static html file. when the route changs, the page will create or modify DOM nodes dynamically to achieve the effect of switch page by using Javascript.
Therefore, if you use $.get(), the server will response a static html file that has a stable page, so you won't load what you want.
If you wants to use $.get() , it has two ways, the first is using headless browser, for example, headless chrome, phantomJS and etc. It will help you load the page and you can get dom nodes of the loaded page.The second is SSR (Server Slide Render), if you use SSR, you will get HTML data of page directly by $.get, because the server response HTML data of correspond page when requesting different routes.
Reference:
SSR
the SRR frame of vue: Nuxt.js
PhantomJS
Node API of Headless Chrome