Puppeteer: Remove links from page

Puppeteer: Remove links from page - javascript

I am converting a webpage into a .pdf-file with the help of Node.js and Puppeteer.
This works fine, but I want to remove all links on this page before converting it to a .pdf-file because otherwise the .pdf-file includes these links which can't be opened in my app when someone clicks on them. Is there a way to do so?
The page is an .aspx page which uses javascript. The links all start with "javascript:__". It is an intranet page which shows our meals and I just want to display the mealplan as a .pdf.
What I have in my .js-file looks like this:
const puppeteer = require('puppeteer');
let url = 'http://my-url.de/meals.aspx'
let browser = await puppeteer.launch()
let page = await browser.newPage()
await page.goto(url, {waitUntil: 'networkidle2' })
await page.pdf({
format:"A4",
path:files[0],
displayHeaderFooter: false,
printBackground:true
})
In my app it says "URL can't be opened", thats why I want these links to be removed.

It seems that these are not proper links, at least they are not <a> tags with href pointing to a website.
Instead, you are dealing with links that require javascript to navigate and that's why these are not working in the pdf.
What you could do is transform all these invalid hrefs to something valid for a pdf before capturing the page.
Check my attempt below. Its possible that you need to modify it a bit to suit your case since I don't have access to the actual website you try to parse.
const puppeteer = require('puppeteer');
let url = 'http://my-url.de/meals.aspx'
(async() => {
let browser = await puppeteer.launch()
let page = await browser.newPage()
await page.goto(url, {
waitUntil: 'networkidle2'
})
// Modifing the page here
await page.evaluate(_ => {
// Capture all links that start with javascript on the href property
// and change it to # instead.
document.querySelectorAll('a[href^="javascript"]')
.forEach(a => {
a.href = '#'
})
});
await page.pdf({
format: "A4",
path: files[0],
displayHeaderFooter: false,
printBackground: true
})
})()

Related

Puppeteer to save image open in the browser

I have a link for a (gif) image, obtained manually via 'open in new tab'. I want Puppeteer to open the image and then save it to a file. If doing it in a normal browser I would click right button and choose 'save' from the context menu. Is there a simple way to perform this action in Puppeteer?

These lines of codes below will save Wikipedia image logo as filename logo.png
import * as fs from 'fs'
import puppeteer from 'puppeteer'
;(async () => {
const wikipedia = 'https://www.wikipedia.org/'
const browser = await puppeteer.launch()
const page = (await browser.pages())[0]
const get = await page.goto(wikipedia)
const image = await page.waitForSelector('img[src][alt="Wikipedia"]')
const imgURL = await image.evaluate(img => img.getAttribute('src'))
const pageNew = await browser.newPage()
const response = await pageNew.goto(wikipedia + imgURL, {timeout: 0, waitUntil: 'networkidle0'})
const imageBuffer = await response.buffer()
await fs.promises.writeFile('./logo.png', imageBuffer)
await page.close()
await pageNew.close()
await browser.close()
})()
Please select this as the right answer if this help you.

In Puppeteer it's possible to right click, but it's not possible to automate the navigation through the "save as" menu. However, there is a solution outlined in the top answer here:
How can I download images on a page using puppeteer?
You can write the images to disk directly from the page response.

Get current page url with Playwright Automation tool?

How can I retrieve the current URL of the page in Playwright?
Something similar to browser.getCurrentUrl() in Protractor?

const {browser}=this.helpers.Playwright;
await browser.pages(); //list pages in the browser
//get current page
const {page}=this.helpers.Playwright;
const url=await page.url();//get the url of the current page

To get the URL of the current page as a string (no await needed):
page.url()
Where "page" is an object of the Page class. You should already have a Page object, and there are various ways to instantiate it, depending on how your framework is set up: https://playwright.dev/docs/api/class-page
It can be imported with
import Page from '#playwright/test';
or this
const { webkit } = require('playwright');
(async () => {
const browser = await webkit.launch();
const context = await browser.newContext();
const page = await context.newPage();
}

Puppeteer is not considering the value of localStorage if I set the value using addEventListener

I am trying to convert an html web page into a pdf file by using puppeteer. I am storing a value in localStorage and getting the value back to change the font size of h1. The problem is if I store the value in localStorage via eventListeners, puppeteer seems to be ignoring the localStorage value and converting the web page with the default font-size. But if I store value in localStorage with calling the setItem method outside of any eventlisteners, puppeteer is considering those localstorage values and converting the page with new font-size. I want it to work when I call the setItem method inside of eventListeners.
I have tried changing the event listener to 'beforeprint' but I got the same results.
let link = document.querySelector('a');
let heading = document.querySelector('h1');
window.addEventListener('DOMContentLoaded', () => {
let fSize = localStorage.getItem('size');
heading.style.fontSize = `${fSize}px`
})
localStorage.setItem('size', 500); // If I call it here puppeteer is considering the localStorage value
link.addEventListener('click', (e) => {
localStorage.setItem('size', 500); //but if I call it here it is not considering the localStorage value
})
<h1>A Heading</h1>
download
//puppeteer code snippet
let printPDF = async() => {
const filePath = path.resolve('./file.pdf')
const fileUrl = 'http://127.0.0.1:3000'
const browser = await puppeteer.launch({
args: ['--no-sandbox'],
headless: true
});
const page = await browser.newPage();
try {
await page.goto(fileUrl)
await page.pdf({
format: 'A4',
path: filePath,
printBackground: true
});
await page.close();
await browser.close();
} catch (error) {
await page.close();
await browser.close();
}
}
app
.route('/')
.get(getIndexPage);
app.route('/download').get((req, res) => {
printPDF().then(() => {
res.sendFile('./downloadPage.html', {
root: __dirname
})
}).catch(e => console.log(e))
})

Puppeteer opens an entirely new browser when you run it and that browser doesn't have the same localStorage data as the browser that you used to click your link. Your browser and the browser puppeteer spins up each have their own localStorage.
The reason it worked in the first case is that that code ran every time the page was loaded, even when the puppeteer browser loaded it.
Any changes you make to a web page after it loads (like a click event), won't be there when puppeteer loads the page on its own a few seconds later. It's like a refresh.
Could you just pass the data you need from the client to the server?

Puppeteer creates bad pdf

I am using puppeteer to create a pdf from my static local html file. The PDF is created but it's corrupted. Adobe reader can't open the file and says - 'Bad file handle'. any suggestions?
I am using below standard code:
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('local_html_file', {waitUntil: 'networkidle2'});
await page.pdf({path: 'hn.pdf', format: 'A4'});
await browser.close();
})();
I also tried setContent() but same result. The page.screenshot() function works however.

Probably your code triggers exception. You should check pdf file size is not "zero" and you can read your pdf file with less or cat command. Sometimes pdf creators software can write errors to top of the pdf file content.
const puppeteer = require('puppeteer');
(async () => {
try{
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('local_html_file', {waitUntil: 'networkidle2'});
await page.pdf({path: 'hn.pdf', format: 'A4'});
await browser.close();
}catch(e){
console.log(e);
}
})();

The issue was the pdf filename I gave - 'con.pdf'
This seems to be a reserved name in windows and hence bad file handle. :D
What a coincidence !!!
Thanks everyone.

Extract text from a font tag on nodeJs

I'm using Cheerio to extract informations from html code of different webpages.
However there is a website in which the text that I wanna extract is included in a script tag; therefore that piece of code wasn't accessible by Cheerio methods.
So, looking for a solution, I found on the web the possibility to run that script using puppeteer, that is an API node to handle a chrome instance.
Using this, even if not in the best way because I discovered it some days ago, finally I obtained the html code that I need.
Unfortunately I am not able to extract the information that I need.
This is the html code from which I wanna extract the data:
<h2 class="property-price">
<a href="blablabla">
<strong>
<font style="vertical-align: inherit;">
<font style="vertical-align: inherit;">Text that I wanna extract</font>
</font>
<small></small>
</strong>
</a>
</h2>
This is instead the code that I used to extract the text data without success:
var cheerio = require("cheerio");
const puppeteer = require('puppeteer');
var $;
const POST_LINK_SELECTOR = 'div.property-title';
(async() => {
const browser = await puppeteer.launch({ headless: false });
const page = await browser.newPage();
await page.goto('myUrl',{
timeout: 0
});
$=cheerio.load(renderedContent);
console.log($('h2.property-price').find('font').children().text());
await browser.close();
})();
I'm sure that this is not the best way to obtain the data text that I need, so if you have some suggestions I will acccept them happily.
Furthermore I would know if is possible to extract what I need using directly the puppeteer API or if I need to use Cheerio(like I did in my case and that anyway doesn't work).
Thank you

You can find the needed data right with the puppeteer, with the help of page.evaluate method:
(async() => {
const browser = await puppeteer.launch({ headless: false });
const page = await browser.newPage();
await page.goto('myUrl',{waitUntil: "networkidle0"});
const text = await page.evaluate(() => document.querySelector("h2.property-price a").textContent.trim() )
console.log(text);
await browser.close();
})();
If you'd like to continue using jQuery-like syntax of Cheerio, that can be done too, just add jQuery to the page (if the site doesn't use it aready)
await page.goto(...);
await page.addScriptTag({url: 'https://code.jquery.com/jquery-3.2.1.min.js'});

We Keep Coding

JavaScript is the programming language of the Web.

Puppeteer: Remove links from page - javascript

Related

Puppeteer to save image open in the browser

Get current page url with Playwright Automation tool?

Puppeteer is not considering the value of localStorage if I set the value using addEventListener

Puppeteer creates bad pdf

Extract text from a font tag on nodeJs

Categories

Resources