Using Puppeteer to screenshot an EJS template with NodeJS and Express - javascript

I have NodeJs/Express app in which I would like to open new browser window and render local EJS view into it. I am trying to do it using Puppeteer.
const puppeteer = require('puppeteer');
router.post('/new_window', async (req, res) => {
try {
const browser = await puppeteer.launch({headless: false});
const page = await browser.newPage();
const pageContent = ejs.render('../views/mypage.ejs', {})
await page.setContent(pageContent)
//await page.screenshot({path: 'example.png'});
// await browser.close();
} catch (err) {
res.status(500)
console.log(err)
res.send(err.message)
}
})
In the browser instead of page layout I get:
../views/mypage.ejs

Instead of:
await page.goto(...); // This code is acting like your browser's address bar
Try
const pageContent = ejs.render('../views/mypage.ejs', {data to populate your .ejs page}) //This is sudo code. Check ejs docs on how to do this
await page.setContent(pageContent)
The code above will let you create your page on your server.
With page.setContent(..)you can load any string of HTML.

OP made an edit that correctly uses page.setContent rather than page.goto, however, there's still an issue. ejs.render() is used to run EJS on a template in string form, so it's treating the file path as the template itself. If you want to read the file into a string first (possibly when your app starts, if the template never changes), ejs.render() will work.
The other approach is to use the EJS method that accepts a path, ejs.renderFile(). Here's a minimal example showing the usage:
const ejs = require("ejs"); // 3.1.8
const express = require("express"); // ^4.18.1
const puppeteer = require("puppeteer"); // ^19.1.0
express()
.get("/greet", (req, res) => {
let browser;
(async () => {
browser = await puppeteer.launch();
const [page] = await browser.pages();
const html = await ejs.renderFile("greet.ejs", {name: "world"});
await page.setContent(html);
const buf = await page.screenshot();
res.contentType("image/png");
res.send(buf);
})()
.catch(err => {
console.error(err);
res.sendStatus(500);
})
.finally(() => browser?.close());
})
.listen(3000);
Where greet.ejs contains:
<!DOCTYPE html>
<html>
<body>
<h1>Hello, <%- name %></h1>
</body>
</html>
To make a PDF with Express, EJS and Puppeteer, see Express/Puppeteer: generate PDF from EJS template and send as response.
To reuse the browser across routes, see this answer for a possible approach.

Related

Scrape multiple domains with axios, cheerio and handlebars on node js

I am trying to make a webscraper, that outputs certain data from node js into the javascript, or html file im working on. Its important that the data of multiple sub pages can be scraped (that I have no code access to) and be displayed in the same html or js file. The problem is that I cant output the results I get from the axios function into global. If i could my problem would be solved.
So far I have been trying to use axios to get the data I need and cheerio to modify it. I created a const named "articles" where I pushed in every title I needed from the website im scraping.
const axios = require('axios')
const cheerio = require('cheerio')
const express = require('express')
const hbs = require('hbs')
const url = 'https://www.google.com/'
const articles = []
axios(url)
.then(response => {
const html = response.data
const $ = cheerio.load(html)
$('.sprite', html).parent().children('a').each(function() {
const text = $(this).attr('title')
articles.push({
text
})
})
console.log(articles)
const finalArray = articles.map(a => a.text);
console.log(finalArray)
}).catch(err => console.log(err))
That works well so far. If I ouput the finalArray I get the array I want to. But once im outside of the axios function the array is empty. Only way it worked for me is when I put the following code inside the axios function, but in this case I wont be able to scrape multiple websides.
console.log(finalArray) //outputs empty array
// with this function I want to get the array displayed in my home.hbs file.
app.get('/', function(req, res){
res.render('views/home', {
array: finalArray
})
})
Basicly all I need is to get the finalArray into global so I can use it in the app.get function to render the Website with the scraped data.
There are two cases here. Either you want to re-run your scraping code on each request, or you want to run the scraping code once when the app starts and re-use the cached result.
New request per request:
const axios = require("axios");
const cheerio = require("cheerio");
const express = require("express");
const scrape = () =>
axios
.get("https://www.example.com")
.then(({data}) => cheerio.load(data)("h1").text());
express()
.get("/", (req, res) => {
scrape().then(text => res.json({text}));
})
.listen(3000);
Up-front, one-off request:
const scrapingResultP = axios
.get("https://www.example.com")
.then(({data}) => cheerio.load(data)("h1").text());
express()
.get("/", (req, res) => {
scrapingResultP.then(text => res.json({text}));
})
.listen(3000);
Result:
$ curl localhost:3000
{"text":"Example Domain"}
It's also possible to do a one-off request without a callback or promise that uses a race condition to populate a variable in scope of the request handlers as well as the scraping response handler. Realistically, the server should be up by the time the request resolves, though, so it's common to see this:
let result;
axios
.get("https://www.example.com")
.then(({data}) => (result = cheerio.load(data)("h1").text()));
express()
.get("/", (req, res) => {
res.json({text: result});
})
.listen(3000);
Eliminating the race by chaining your Express routes and listener from the axios response handler:
axios.get("https://www.example.com").then(({data}) => {
const text = cheerio.load(data)("h1").text();
express()
.get("/", (req, res) => {
res.json({text});
})
.listen(3000);
});
If you have multiple requests you need to complete before you start the server, try Promise.all. Top-level await or an async IIFE can work too.
Error handling has been left as an exercise.
Problem has been resolved. I used this code, instead of the normal axios.get(url) function:
axios.all(urls.map((endpoint) => axios.get(endpoint))).then(
axios.spread(({data:user}, {data:repos}) => {
with "user", and "repos" I am now able to enter both URL data and can execute code regarding the URL i like to chose in that one function.

Read data from url in javascript

i'm trying to read data stored in google drive, on drive or other cloud storage using javascript
`
let btn = document.getElementById('action-btn');
btn.addEventListener('click', () => {
// let baseurl = document.getElementById('inputurl').value;
// let guidDcoument = baseurl.split('#guid=')[1];
// const query = encodeURIComponent('Select *')
// fetch('', {
// mode: "no-cors",
// method: "get",
// }).then(Response => console.log(Response))
fetch('https://docs.google.com/spreadsheets/d/1kwfv6L2lBrPw8OjHGyhO7YHOXFNwHYyPI_noM5TUMLw/edit?pli=1#gid=1097732605',{
mode:"no-cors"
})
.then((response) => { console.log(Response.error); })
.catch(console.error())
})
`
What I need is that given a url, I can read the data from the file, and then show them on my website.
I think that when I try to access any other cloud storage I will have the same problem. That it will be necessary to access the account to be able to read the data that would be a csv document.
First of all, what you're trying to achieve is called 'web scraping' and you can't use fetch for that, instead you should use puppeteer (in the server side), which is the most popular library for web scraping.
Run npm init -y to initialize a npm project and install puppeteer npm i puppeteer and also install express and cors npm i express cors in order to create a server that scraps data and sets it back to your client. So, instead of trying to scrap the information directly from the client you do it from the server with puppeteer.
Try the following .js server code:
const express = require('express')
const cors = require('cors')
const puppeteer = require('puppeteer')
const app = express()
app.use(express.json())
app.use(cors())
;(async () => {
const browser = await puppeteer.launch({headless: false})
const page = await browser.newPage()
app.use('/', async (req, res) => {
await page.goto('url-to-the-website')
const data = {}
return res.json(data)
})
})()
app.listen(3000)
And learn more about puppeteer: https://pptr.dev/.
And your client code should connect to this server to send scrape requests to it like this:
(async () => {
const res = await fetch('http://localhost:3000')
const rawjson = await res.json()
const json = JSON.parse(rawjson)
console.log(json)
})
Note: we wrap the code in anonymous functions with the reserved word async in order to use async and await syntax. More information: https://javascript.info/async-await.

how access returned values from client side and display them

I was experimenting with puppeteer and I built a simple scraper that gets information from youtube and it works fine what I was trying to add was to display that scraped information on my web page with <p> tags. Is there any way to do this? Where I'm am stuck is my name and avatarUrl variables are inside my scrape function as a local variable so how can I get those values and insert them in my <p> tag. For a rough sketch of what I tried, I did: document.getElementById('nameId')=name; after importing my js script(on HTML side) but this wont work because name is a local variable and it can't be accessed outside the scope. Any help is appreciated. Thanks in advance
const puppeteer = require('puppeteer');
async function scrapeChannel(url) {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(url);
const [el] = await page.$x('/html/body/ytd-app/div/ytd-page-manager/ytd-browse/div[3]/ytd-c4-tabbed-header-renderer/tp-yt-app-header-layout/div/tp-yt-app-header/div[2]/div[2]/div/div[1]/div/div[1]/ytd-channel-name/div/div/yt-formatted-string');
const text = await el.getProperty('textContent');
const name = await text.jsonValue();
const [el2] = await page.$x('//*[#id="img"]');
const src = await el2.getProperty('src');
const avatarURL = await src.jsonValue();
browser.close();
console.log({
name,
avatarlURL
})
return {
name,
avatarURL
}
}
scrapeChannel('https://www.youtube.com/channel/UCQOtt1RZbIbBqXhRa9-RB5g')
module.exports = {
scrapeChannel,
}
<body onload="scrapeChannel()">
<p id="nameId">'put the scraped name here'</p>
<p id="avatarUrlId">'put the scraped avatar url here'</p>
<!--
document.getElementById('nameId')=name;
document.getElementById('avatartUrlId')=avatarURL;
-->
</body>
I have used cheerio in one of my projects and this is what I did in the backend and in the front end.
Node & Express JS Backend
In order to access your backend from the frontend, you need to set Routes in your backend. All your frontend requests are redirected to these routes. For more information read this Express Routes.
E.g Route.js code
const router = require("express").Router();
const { callscrapeChannel } = require("../scrape-code/scrape");
router.route("/scrapedata").get(async (req, res) => {
const Result = await callscrapeChannel();
return res.json(Result);
});
module.exports = router;
scrapeChannel.js file
const puppeteer = require('puppeteer');
async function scrapeChannel(url) {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(url);
const [el] = await page.$x('/html/body/ytd-app/div/ytd-page-manager/ytd-browse/div[3]/ytd-c4-tabbed-header-renderer/tp-yt-app-header-layout/div/tp-yt-app-header/div[2]/div[2]/div/div[1]/div/div[1]/ytd-channel-name/div/div/yt-formatted-string');
const text = await el.getProperty('textContent');
const name = await text.jsonValue();
const [el2] = await page.$x('//*[#id="img"]');
const src = await el2.getProperty('src');
const avatarURL = await src.jsonValue();
browser.close();
console.log({
name,
avatarURL
})
return {
name,
avatarURL
}
}
async function callscrapeChannel() {
const data = await scrapeChannel('https://www.youtube.com/channel/UCQOtt1RZbIbBqXhRa9-RB5g')
return data
}
module.exports = {
callscrapeChannel,
}
in your server.js file
const express = require("express");
const cors = require("cors");
const scrapeRoute = require("./Routes/routes");
require("dotenv").config({ debug: process.env.DEBUG });
const port = process.env.PORT || 5000;
const app = express();
app.use(cors());
app.use(express.json());
app.use("/api", scrapeRoute);
app.listen(port, () => {
console.log(`server is running on port: http://localhost:${port}`);
});
dependencies you need (package.json)
"dependencies": {
"axios": "^0.21.1",
"body-parser": "^1.19.0",
"cors": "^2.8.5",
"cross-env": "^7.0.3",
"dotenv": "^8.2.0",
"esm": "^3.2.25",
"express": "^4.17.1",
"nodemon": "^2.0.7",
"puppeteer": "^8.0.0"
}
Frontend
In the front-end, I have used fetch. You need to send a get request to your backend. All you have to do is
<html>
<head>
<script>
async function callScrapeData(){
await fetch(`http://localhost:5000/api/scrapedata`)
.then((res) => {
return new Promise((resolve, reject) => {
setTimeout(()=> {
resolve(res.json())
}, 1000)
})
}).then((response) => {
console.log(response)
document.getElementById("nameId").innerHTML = response.name
document.getElementById("avatartUrlId").innerHTML = response.avatarURL
}
)
}
</script>
</head>
<body>
<div>
<h1>scrape</h1>
<p id="nameId"></p>
<p id="avatartUrlId"></p>
<button onclick="callScrapeData()">click</button>
</div>
</body>
</html>
Remember, my backend server is running on port 5000
output
The above code is just an example and I have modified it to fit your question. I hope this helps you to some extent. It's straightforward. Let me know if you have any questions.
Note: I assume you have a server.js file in your backend and it is configured properly.

How to use debugger in express for async functions?

I'm using express and I have the following end-point:
router.get('/', async (_req, res) => {
const customers = await Customer.find().sort({ name: 1 });
res.status(200).json(customers);
});
I would like to set a debugger here:
router.get('/', async (_req, res) => {
debugger // <--- Adding this here
const customers = await Customer.find().sort({ name: 1 });
res.status(200).json(customers);
});
Now I run my app like so:
node inspect app.js
Once in the debugger, I try to use the repl and/or use the inspect console in Chrome... but when I try to interact with my Customer model with the await keyword, I get:
SyntaxError: await is only valid in async function
I would like to play with this query in real-time while debugging the current state of the app. Is there a way to achieve this in node/js?

Reconnect To Previous Puppeteer Session In Hosted Environment

I'm currently working on a project that uses Puppeteer to control headless chrome. Right now I'm hosting my app using Firebase functions. This is working well if I have to do all my browsing in one session, but if I have to come back at a later time I am having trouble reestablishing a connection and resuming where I left off.
Here is my current script.
const express = require('express');
const functions = require('firebase-functions');
const puppeteer = require('puppeteer');
const admin = require('firebase-admin');
admin.initializeApp(functions.config().firebase);
const db = admin.database();
const app = express();
app.get('/openpage', async (req, res) => {
try {
const browser = await puppeteer.launch({ args: ['--no-sandbox'] });
const page = await browser.newPage();
const url = 'https://www.reddit.com/';
await page.goto(url, { waitUntil: 'networkidle2' });
await page.evaluate(() => {
document.querySelector('input[name="q"]').value = 'dog';
document.querySelector('[action="/search"]').submit();
});
// Here I want to save the current state of the browser
const endpoint = browser.wsEndpoint();
console.log('Endpoint', endpoint);
db.ref('test/').update({ endpoint });
await browser.close();
} catch (err) {
console.log(err);
}
res.send('Finished');
});
app.get('/screenshot', async (req, res) => {
try {
const endpoint = await db.ref('test/endpoint').once('value').then(snap => snap.val());
const browser = await puppeteer.connect({ browserWSEndpoint: endpoint }); // This is where it fails
const pages = await browser.pages();
const page = pages[0];
await page.screenshot({ path: 'picture' });
await browser.close();
} catch (err) {
console.error(err);
}
res.send('Finished');
})
exports.test = functions.runWith({ memory: '2GB', timeoutSeconds: 60 }).https.onRequest(app);
With this setup, I can make a request to the /openpage endpoint and everything works fine and I store the browser endpoint to the firebase realtime database. But when I try to resume the session by calling /screenshot I get an error that the connection gets refused on the browser.connect() method. Is there a different way I should be going about this? Is this a firebase limitation or am I missing something about how the connections are reestablished in Puppeteer?
Error message: Error: connect ECONNREFUSED 127.0.0.1:62222
On a side note you have to add "engines": { "node": "8" }, to your package.json to be able to run Puppeteer with Firebase Functions.
This is because you are closing your browser with this line, await browser.close();. This will disconnect and close the browser and you won't be able to connect again.
You should use browser.disconnect() instead.

Categories