How to get a screenshot/preview of another website - javascript

Is there a way in which you can get a screenshot of another websites pages?
e.g: you introduce a url in an input, hit enter, and a script gives you a screenshot of the site you put in. I manage to do it with headless browsers, but I fear that could take too much resources and time, to launch. let's say phantomjs each time the input is used the headless browser would need to get the new data, I investigate HotJar, it does something similar to what I'm looking for, but it gives you a script that you must put into the page header, which is fine by me, afterwards, you get a preview, how does it work?, and how can one replicate it?

Do you want a print screen of your page or someone else's?
Own page
Use puppeteer or phantomJS with Beverly build of your site, this way you will only run it when it changes, and have a screenshot ready at any time.
Foreign page
You have access to it (the owner runs your script)
Either try to get into his build pipeline, and use solution from above.
Or use this solution Using HTML5/Canvas/JavaScript to take in-browser screenshots.
You don't have any access
Use some long-running process that will give you screenshot when asked.
Imagine a server with one URL endpoint: screenshot.example.com?facebook.com.
The long-running server has a puppeteer/phantomJS instance ready to go when given URL, it will flood that page, get the screenshot and send it back. The browser will actually think of it as a slow ping image request.

You can make this with puppeteer
install with: npm i puppeteer
save the following code to example.js
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://example.com');
await page.screenshot({path: 'example.png'});
await browser.close();
})();
and run it with:
node example.js

Related

Interacting with browser using javascript

I'm trying to run a simple javascript file from the terminal (Ubuntu) that clicks a button on a website. However, I haven't been able to find how to do so, since I've learned that you can't interact with the browser in Node (for doing things like running commands such as window.location.href).
(source: ReferenceError : window is not defined at object. <anonymous> Node.js).
For example, I'd like to be able to create a script (let's call it test.js) where when I run ./test.js or node test.js in the terminal, it will:
go to www.google.com
Click on the "Images" button in the top right.
I wrote out how I understand to do that below:
window.location.href = "https://www.google.com"
document.getElementById('the id of the image button').click()
It seems extremely straightforward, but I am a beginner to Javascript and am not aware of its limitations and could most definitely be wrong about Node. Could someone help explain how I should go about doing something as simple as this? Thanks
EDIT: For clarification on the context, this is just a part of me trying to automate form submissions. I also want to be able to enter specified text into input fields and so on.
Node.js does not include a browser or any browser-like control required to execute the code you posted. Fortunately, this is fairly straightforward with the addition of some extra Node.js software.
What you're looking for is Puppeteer. It's a Node.js library that comes with a small Chrome browser and allows you to remote control that browser from some very easy Node.js functions / methods.
In a directory of your choosing, install puppeteer with npm like so:
npm install -S puppeteer
This will install the library locally into a node_modules/ directory.
Then, you'll need a single javascript file (like test.js in your example) in which you write code like the example in the README (linked above):
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://www.google.com');
await page.click('the id of the "images" link or some selector');
await page.screenshot({path: 'example.png'});
await browser.close();
})();

Puppeteer getting response from pdf download link

I'm automating regression testing for a website and one of the tasks is to verify pdf downloads. I'm using Puppeteer and Chromium for this. I've found that it's rather difficult to download files in headless mode. Instead of downloading the file, I thought it might be prudent to look for a response from the page and the size of the file. My issue: when I try to navigate to the page, nothing seems to happen. I receive a timeout error. Here is the code I'm attempting to use:
const filename = new RegExp('\S*(\.pdf)');
await page.waitForSelector('#download-pdf', {timeout: timeout});
console.log('Clicking on "Download PDF" button');
const link = await page.$eval('#download-pdf', el => el.href);
await Promise.all([
page.goto(link),
page.on('response', response => {
if(response._headers['content-disposition'] === `attachment;filename=${filename}`){
console.log('Size: ', response._headers['content-length']);
}
})
]);
EDIT
If anyone understands how page.goto() ignores .pdf pages, that will be very useful to me.
Let me define the problem better. Upon clicking the download pdf button on the webpage, an event is triggered that generates the pdf file and sends the user along a unique url. This url is destroyed after a short period. In order to get to this point, I believe that I must use page.click() to trigger the event and generate the url. However, page.click() is also attempting to navigate to the pdf url, which is rejected in headless mode. What I need to do is get the url and test for a response from it.
I figured out a solution. I'll post it here for anyone else who encounters a similar problem in the days ahead. The idea here is to create an event listener to listen for any and all responses. Since I only cared about responses from pages ending with .pdf I only act on those responses.
page.on('response', intercept=>{
if(intercept.url().endsWith('.pdf')){
console.log(intercept.url());
console.log('HTTP status code: %d', intercept.status());
console.log(intercept.headers());
}
});

Capturing application screen with JavaScript

Is it possible to capture the entire window as screenshot using JavaScript?
The application might contain many iframes and div's where content are loaded asynchronously.
I have explored canvas2image but it works on an html element, using the same discards any iframe present on the page.
I am looking for a solution where the capture will take care of all the iframes present.
The only way to capture the contents of an iframe using ONLY JavaScript in the webpage (No extensions, or application running outside the browser on a users system) is to use the HTMLIFrameElement.getScreenshot() API in Firefox. This API is non-standard, and ONLY works in Firefox.
For any other browser, no. An iframe is typically sandboxed, and as such it is not accessible by the browser by design.
The best way to get a screenshot of a webpage that I have found and use, is an instance of Headless Chrome or Headless Firefox. These will take a screenshot of everything on the page, just as a user would see it.
Yes, widh Puppeteer it is possible.
1 - Just install the dependency:
npm i puppeteer-core
2 - Create JavaScript file, screenshot.js
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://yourweb.com');
await page.screenshot({path: 'screenshot.png'});
await browser.close();
})();
3 - Run:
node screenshot.js
Source
Web pages are not the best things to be "screenshoted", because of their nature; they can include async elements, frames or something like that, they are usually responsive etc...
For your purpose the best way is to use external api or an external service, I think is not a good idea to try doing that with JS.
You should try https://www.url2png.com/

Load a SPA webpage via AJAX

I'm trying to fetch an entire webpage using JavaScript by plugging in the URL. However, the website is built as a Single Page Application (SPA) that uses JavaScript / backbone.js to dynamically load most of it's contents after rendering the initial response.
So for example, when I route to the following address:
https://connect.garmin.com/modern/activity/1915361012
And then enter this into the console (after the page has loaded):
var $page = $("html")
console.log("%c✔: ", "color:green;", $page.find(".inline-edit-target.page-title-overflow").text().trim());
console.log("%c✔: ", "color:green;", $page.find("footer .details").text().trim());
Then I'll get the dynamically loaded activity title as well as the statically loaded page footer:
However, when I try to load the webpage via an AJAX call with either $.get() or .load(), I only get delivered the initial response (the same as the content when over view-source):
view-source:https://connect.garmin.com/modern/activity/1915361012
So if I use either of the the following AJAX calls:
// jQuery.get()
var url = "https://connect.garmin.com/modern/activity/1915361012";
jQuery.get(url,function(data) {
var $page = $("<div>").html(data)
console.log("%c✖: ", "color:red;", $page.find(".page-title").text().trim());
console.log("%c✔: ", "color:green;", $page.find("footer .details").text().trim());
});
// jQuery.load()
var url = "https://connect.garmin.com/modern/activity/1915361012";
var $page = $("<div>")
$page.load(url, function(data) {
console.log("%c✖: ", "color:red;", $page.find(".page-title").text().trim() );
console.log("%c✔: ", "color:green;", $page.find("footer .details").text().trim());
});
I'll still get the initial footer, but won't get any of the other page contents:
I've tried the solution here to eval() the contents of every script tag, but that doesn't appear robust enough to actually load the page:
jQuery.get(url,function(data) {
var $page = $("<div>").html(data)
$page.find("script").each(function() {
var scriptContent = $(this).html(); //Grab the content of this tag
eval(scriptContent); //Execute the content
});
console.log("%c✖: ", "color:red;", $page.find(".page-title").text().trim());
console.log("%c✔: ", "color:green;", $page.find("footer .details").text().trim());
});
Q: Any options to fully load a webpage that will scrapable over JavaScript?
You will never be able to fully replicate by yourself what an arbitrary (SPA) page does.
The only way I see is using a headless browser such as PhantomJS or Headless Chrome, or Headless Firefox.
I wanted to try Headless Chrome so let's see what it can do with your page:
Quick check using internal REPL
Load that page with Chrome Headless (you'll need Chrome 59 on Mac/Linux, Chrome 60 on Windows), and find page title with JavaScript from the REPL:
% chrome --headless --disable-gpu --repl https://connect.garmin.com/modern/activity/1915361012
[0830/171405.025582:INFO:headless_shell.cc(303)] Type a Javascript expression to evaluate or "quit" to exit.
>>> $('body').find('.page-title').text().trim()
{"result":{"type":"string","value":"Daily Mile - Round 2 - Day 27"}}
NB: to get chrome command line working on a Mac I did this beforehand:
alias chrome="'/Applications/Google Chrome.app/Contents/MacOS/Google Chrome'"
Using programmatically with Node & Puppeteer
Puppeteer is a Node library (by Google Chrome developers) which provides a high-level API to control headless Chrome over the DevTools Protocol. It can also be configured to use full (non-headless) Chrome.
(Step 0 : Install Node & Yarn if you don't have them)
In a new directory:
yarn init
yarn add puppeteer
Create index.js with this:
const puppeteer = require('puppeteer');
(async() => {
const url = 'https://connect.garmin.com/modern/activity/1915361012';
const browser = await puppeteer.launch();
const page = await browser.newPage();
// Go to URL and wait for page to load
await page.goto(url, {waitUntil: 'networkidle'});
// Wait for the results to show up
await page.waitForSelector('.page-title');
// Extract the results from the page
const text = await page.evaluate(() => {
const title = document.querySelector('.page-title');
return title.innerText.trim();
});
console.log(`Found: ${text}`);
browser.close();
})();
Result:
$ node index.js
Found: Daily Mile - Round 2 - Day 27
First off: avoid eval - your content security policy should block it and it leaves you open to easy XSS attacks. Scraping bots definitely won't run it.
The problem you're describing is common to all SPAs - when a person visits they get your app shell script, which then loads in the rest of the content - all good. When a bot visits they ignore the scripts and return the empty shell.
The solution is server side rendering. One way to do this is if you're using a JS renderer (say React) and Node.js on the server you can fairly easily build the JS and serve it statically.
However, if you aren't then you'll need to run a headless browser on your server that executes all the JS a user would and then serves up the result to the bot.
Fortunately someone else has already done all the work here. They've put a demo online that you can try out with your site:
I think you should know the concept of SPA,
SPA is Single Page Application, it is only static html file. when the route changs, the page will create or modify DOM nodes dynamically to achieve the effect of switch page by using Javascript.
Therefore, if you use $.get(), the server will response a static html file that has a stable page, so you won't load what you want.
If you wants to use $.get() , it has two ways, the first is using headless browser, for example, headless chrome, phantomJS and etc. It will help you load the page and you can get dom nodes of the loaded page.The second is SSR (Server Slide Render), if you use SSR, you will get HTML data of page directly by $.get, because the server response HTML data of correspond page when requesting different routes.
Reference:
SSR
the SRR frame of vue: Nuxt.js
PhantomJS
Node API of Headless Chrome

JavaScript: screenshot rendered web page Browser style

I assume this question might have been asked before but after hours of searching, I haven't found anything satisfying.
Here's my question: Is it possible to screenshot a fully rendered web page using JavaScript? A little like what most browsers do on windows on the press of ctrl+p.
I have looked into a lot of alternative solutions like html2Canvas.js
but none suits my needs. The biggest issue being my web page almost entirely rendered on client side using Javascript. This is also why server side solution like PhantomJS are hardly applicable.
I need the screenshots to be printed as image or PDF.
Any idea?
Thanks.
Have you looked into Puppeteer by Google?
If you're able to run it on your server, it might be exactly what you're looking for. See their example code:
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://example.com');
await page.screenshot({path: 'example.png'});
await browser.close();
})();

Categories