Capture HTML canvas of third-party site as image, from command line - javascript

I know one can use tools such as wget or curl to perform HTTP requests from the command line, or use HTTP client requests from various programming languages. These tools also support fetching images or other files that are referenced in the HTML code.
What I'm searching for is a mechanism that also executes the JavaScript of that web page that renders an image into an HTML canvas. I then want to extract that rendered image as an image file. The goal to achieve is to grab a time series of those images, e.g. weather maps or other diagrams that plot time-variant data into a constant DOM object, via a cron job.
I'd prefer a solution that works from a script. How could this be done?

You can use puppeteer to load the page inside a headless chrome instance
Open the page and wait for it to load
Using page.evaluate return the dataUrl of the canvas
Convert the dataUrl to a buffer and write the result to a file
const puppeteer = require('puppeteer');
const fs = require('fs');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://games.novatoz.com/jigsaw-puzzle');
const dataUrl = await page.evaluate(async () => {
const sleep = (time) => new Promise((resolve) => setTimeout(resolve, time));
await sleep(5000);
return document.getElementById('canvas').toDataURL();
});
const data = Buffer.from(dataUrl.split(',').pop(), 'base64');
fs.writeFileSync('image.png', data);
await browser.close();
})();

Related

Puppeteer to save image open in the browser

I have a link for a (gif) image, obtained manually via 'open in new tab'. I want Puppeteer to open the image and then save it to a file. If doing it in a normal browser I would click right button and choose 'save' from the context menu. Is there a simple way to perform this action in Puppeteer?
These lines of codes below will save Wikipedia image logo as filename logo.png
import * as fs from 'fs'
import puppeteer from 'puppeteer'
;(async () => {
const wikipedia = 'https://www.wikipedia.org/'
const browser = await puppeteer.launch()
const page = (await browser.pages())[0]
const get = await page.goto(wikipedia)
const image = await page.waitForSelector('img[src][alt="Wikipedia"]')
const imgURL = await image.evaluate(img => img.getAttribute('src'))
const pageNew = await browser.newPage()
const response = await pageNew.goto(wikipedia + imgURL, {timeout: 0, waitUntil: 'networkidle0'})
const imageBuffer = await response.buffer()
await fs.promises.writeFile('./logo.png', imageBuffer)
await page.close()
await pageNew.close()
await browser.close()
})()
Please select this as the right answer if this help you.
In Puppeteer it's possible to right click, but it's not possible to automate the navigation through the "save as" menu. However, there is a solution outlined in the top answer here:
How can I download images on a page using puppeteer?
You can write the images to disk directly from the page response.

Web scraping, i can't select the tags that i want

i was trying to do some web scraping and i found a problem, i have this JS script:
const request = require('request');
const cheerio = require('cheerio');
const url = 'https://www.sisal.it/scommesse-matchpoint?filtro=0&schede=man:1:21' // this is an
italian betting site
request( url, (error, response, html) => {
if (!error && response.statusCode == 200) {
const $ = cheerio.load(html);
let squadre = $("div");
console.log(squadre.text())
}
})
This returns me a very long string with all the web site's divs text but in this string there isn't the text i want. I made this script because after doing:
const $("div.*class*")
It returned me nothing even if the selectors were correct, do you have any ideas on why i can't select the divs i want?
This page is dynamically created, means, if you make request with cheerio, you get boilerplate code for SPA, and data you need uploaded later.
To scrape this kind of sites you need something more advanced than cheerio.
Easy to use option - puppeteer
And the code would look something like this:
(async() => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
// Use here waitUntil to wait until additional requests would be made and the page would be fully loaded.
await page.goto('https://www.sisal.it/scommesse-matchpoint?filtro=0&schede=man:1:21', {waitUntil: 'networkidle2'});
const data = await page.evaluate(() => {
// Make here all your JS actions and return JSON.stringify data.
// You can access DOM with document.querySelector
// and other JS methods for DOM manipulation
return JSON.stringify({})
});
await browser.close()
})()
Just play around with puppeteer API and find out your way to handle this task.

How to write data to a file using Puppeteer?

Puppeteer exposes a page.screenshot() method for saving a screenshot locally on your machine. Here are the docs.
See: https://github.com/GoogleChrome/puppeteer
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://example.com');
await page.screenshot({path: 'example.png'});
Is there a way to save a data file in a similar fashion. I'm seeking something analogous to...
page.writeToFile({data, path,});
Since any puppeteer script is an ordinary node.js script you can use anything you would use in node, say the good old fs module:
const fs = require('fs');
fs.writeFileSync('path/to/file.json', data);

Way to scrape a JS-Rendered page?

I'm currently scraping a list of URLs on my site using the request-promise npm module.
This works well for what I need, however, I'm noticing that not all of my divs are appearing because some are rendered after the fact with JS. I know I can't run that JS code remotely to force the render, but is there any ways to be able to scrape the pages only after those elements are added in?
I'm doing this currently with Node, and would prefer to keep using Node if possible.
Here is what I have:
const urls ['fake.com/link-1', 'fake.com/link-2', 'fake.com/link-3']
urls.forEach(url => {
request(url)
.then(function(html){
//get dummy dom
const d_dom = new JSDOM(html);
....
}
});
Any thoughts on how to accomplish this? Or if there is currently an alternative to Selenium as an npm module?
You will want to use puppeteer which is a Chrome headless browser (owned and maintained by Chrome/Google) for loading and parsing dynamic web pages.
Use page.goto() to goto a specific page, then use page.content() to load the html content from the rendered page.
Here is an example of how to use it:
const { JSDOM } = require("jsdom");
const puppeteer = require('puppeteer')
const urls = ['fake.com/link-1', 'fake.com/link-2', 'fake.com/link-3']
urls.forEach(async url => {
let dom = new JSDOM(await makeRequest(url))
console.log(dom.window.document.title)
});
async function makeRequest(url) {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(url);
let html = await page.content()
await browser.close();
return html
}

Extract all CSS with Puppeteer?

I am performing some analysis on website complexity. What is the best way to extract all CSS (external stylesheets, <style> tags, and inline CSS), for all nodes in a web page, using headless Chrome/Puppeteer?
I'm ideally looking for compiled CSS, in similar format to the "Styles" tab in the Chrome dev-tools.
You ask for two different things:
Scraping
For web scraping in nodejs better use cheerio package.
Sniffing network requests
If you want to get css files requested you can go something like:
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
page.on('response',async response => {
if(response.request().resourceType() === 'stylesheet') {
console.log(await response.text());
}
});
await page.goto('https://myurl.com');
await browser.close();
})();

Categories