Puppeteer getting response from pdf download link - javascript

I'm automating regression testing for a website and one of the tasks is to verify pdf downloads. I'm using Puppeteer and Chromium for this. I've found that it's rather difficult to download files in headless mode. Instead of downloading the file, I thought it might be prudent to look for a response from the page and the size of the file. My issue: when I try to navigate to the page, nothing seems to happen. I receive a timeout error. Here is the code I'm attempting to use:
const filename = new RegExp('\S*(\.pdf)');
await page.waitForSelector('#download-pdf', {timeout: timeout});
console.log('Clicking on "Download PDF" button');
const link = await page.$eval('#download-pdf', el => el.href);
await Promise.all([
page.goto(link),
page.on('response', response => {
if(response._headers['content-disposition'] === `attachment;filename=${filename}`){
console.log('Size: ', response._headers['content-length']);
}
})
]);
EDIT
If anyone understands how page.goto() ignores .pdf pages, that will be very useful to me.
Let me define the problem better. Upon clicking the download pdf button on the webpage, an event is triggered that generates the pdf file and sends the user along a unique url. This url is destroyed after a short period. In order to get to this point, I believe that I must use page.click() to trigger the event and generate the url. However, page.click() is also attempting to navigate to the pdf url, which is rejected in headless mode. What I need to do is get the url and test for a response from it.

I figured out a solution. I'll post it here for anyone else who encounters a similar problem in the days ahead. The idea here is to create an event listener to listen for any and all responses. Since I only cared about responses from pages ending with .pdf I only act on those responses.
page.on('response', intercept=>{
if(intercept.url().endsWith('.pdf')){
console.log(intercept.url());
console.log('HTTP status code: %d', intercept.status());
console.log(intercept.headers());
}
});

Related

How to get a screenshot/preview of another website

Is there a way in which you can get a screenshot of another websites pages?
e.g: you introduce a url in an input, hit enter, and a script gives you a screenshot of the site you put in. I manage to do it with headless browsers, but I fear that could take too much resources and time, to launch. let's say phantomjs each time the input is used the headless browser would need to get the new data, I investigate HotJar, it does something similar to what I'm looking for, but it gives you a script that you must put into the page header, which is fine by me, afterwards, you get a preview, how does it work?, and how can one replicate it?
Do you want a print screen of your page or someone else's?
Own page
Use puppeteer or phantomJS with Beverly build of your site, this way you will only run it when it changes, and have a screenshot ready at any time.
Foreign page
You have access to it (the owner runs your script)
Either try to get into his build pipeline, and use solution from above.
Or use this solution Using HTML5/Canvas/JavaScript to take in-browser screenshots.
You don't have any access
Use some long-running process that will give you screenshot when asked.
Imagine a server with one URL endpoint: screenshot.example.com?facebook.com.
The long-running server has a puppeteer/phantomJS instance ready to go when given URL, it will flood that page, get the screenshot and send it back. The browser will actually think of it as a slow ping image request.
You can make this with puppeteer
install with: npm i puppeteer
save the following code to example.js
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://example.com');
await page.screenshot({path: 'example.png'});
await browser.close();
})();
and run it with:
node example.js

Load a SPA webpage via AJAX

I'm trying to fetch an entire webpage using JavaScript by plugging in the URL. However, the website is built as a Single Page Application (SPA) that uses JavaScript / backbone.js to dynamically load most of it's contents after rendering the initial response.
So for example, when I route to the following address:
https://connect.garmin.com/modern/activity/1915361012
And then enter this into the console (after the page has loaded):
var $page = $("html")
console.log("%c✔: ", "color:green;", $page.find(".inline-edit-target.page-title-overflow").text().trim());
console.log("%c✔: ", "color:green;", $page.find("footer .details").text().trim());
Then I'll get the dynamically loaded activity title as well as the statically loaded page footer:
However, when I try to load the webpage via an AJAX call with either $.get() or .load(), I only get delivered the initial response (the same as the content when over view-source):
view-source:https://connect.garmin.com/modern/activity/1915361012
So if I use either of the the following AJAX calls:
// jQuery.get()
var url = "https://connect.garmin.com/modern/activity/1915361012";
jQuery.get(url,function(data) {
var $page = $("<div>").html(data)
console.log("%c✖: ", "color:red;", $page.find(".page-title").text().trim());
console.log("%c✔: ", "color:green;", $page.find("footer .details").text().trim());
});
// jQuery.load()
var url = "https://connect.garmin.com/modern/activity/1915361012";
var $page = $("<div>")
$page.load(url, function(data) {
console.log("%c✖: ", "color:red;", $page.find(".page-title").text().trim() );
console.log("%c✔: ", "color:green;", $page.find("footer .details").text().trim());
});
I'll still get the initial footer, but won't get any of the other page contents:
I've tried the solution here to eval() the contents of every script tag, but that doesn't appear robust enough to actually load the page:
jQuery.get(url,function(data) {
var $page = $("<div>").html(data)
$page.find("script").each(function() {
var scriptContent = $(this).html(); //Grab the content of this tag
eval(scriptContent); //Execute the content
});
console.log("%c✖: ", "color:red;", $page.find(".page-title").text().trim());
console.log("%c✔: ", "color:green;", $page.find("footer .details").text().trim());
});
Q: Any options to fully load a webpage that will scrapable over JavaScript?
You will never be able to fully replicate by yourself what an arbitrary (SPA) page does.
The only way I see is using a headless browser such as PhantomJS or Headless Chrome, or Headless Firefox.
I wanted to try Headless Chrome so let's see what it can do with your page:
Quick check using internal REPL
Load that page with Chrome Headless (you'll need Chrome 59 on Mac/Linux, Chrome 60 on Windows), and find page title with JavaScript from the REPL:
% chrome --headless --disable-gpu --repl https://connect.garmin.com/modern/activity/1915361012
[0830/171405.025582:INFO:headless_shell.cc(303)] Type a Javascript expression to evaluate or "quit" to exit.
>>> $('body').find('.page-title').text().trim()
{"result":{"type":"string","value":"Daily Mile - Round 2 - Day 27"}}
NB: to get chrome command line working on a Mac I did this beforehand:
alias chrome="'/Applications/Google Chrome.app/Contents/MacOS/Google Chrome'"
Using programmatically with Node & Puppeteer
Puppeteer is a Node library (by Google Chrome developers) which provides a high-level API to control headless Chrome over the DevTools Protocol. It can also be configured to use full (non-headless) Chrome.
(Step 0 : Install Node & Yarn if you don't have them)
In a new directory:
yarn init
yarn add puppeteer
Create index.js with this:
const puppeteer = require('puppeteer');
(async() => {
const url = 'https://connect.garmin.com/modern/activity/1915361012';
const browser = await puppeteer.launch();
const page = await browser.newPage();
// Go to URL and wait for page to load
await page.goto(url, {waitUntil: 'networkidle'});
// Wait for the results to show up
await page.waitForSelector('.page-title');
// Extract the results from the page
const text = await page.evaluate(() => {
const title = document.querySelector('.page-title');
return title.innerText.trim();
});
console.log(`Found: ${text}`);
browser.close();
})();
Result:
$ node index.js
Found: Daily Mile - Round 2 - Day 27
First off: avoid eval - your content security policy should block it and it leaves you open to easy XSS attacks. Scraping bots definitely won't run it.
The problem you're describing is common to all SPAs - when a person visits they get your app shell script, which then loads in the rest of the content - all good. When a bot visits they ignore the scripts and return the empty shell.
The solution is server side rendering. One way to do this is if you're using a JS renderer (say React) and Node.js on the server you can fairly easily build the JS and serve it statically.
However, if you aren't then you'll need to run a headless browser on your server that executes all the JS a user would and then serves up the result to the bot.
Fortunately someone else has already done all the work here. They've put a demo online that you can try out with your site:
I think you should know the concept of SPA,
SPA is Single Page Application, it is only static html file. when the route changs, the page will create or modify DOM nodes dynamically to achieve the effect of switch page by using Javascript.
Therefore, if you use $.get(), the server will response a static html file that has a stable page, so you won't load what you want.
If you wants to use $.get() , it has two ways, the first is using headless browser, for example, headless chrome, phantomJS and etc. It will help you load the page and you can get dom nodes of the loaded page.The second is SSR (Server Slide Render), if you use SSR, you will get HTML data of page directly by $.get, because the server response HTML data of correspond page when requesting different routes.
Reference:
SSR
the SRR frame of vue: Nuxt.js
PhantomJS
Node API of Headless Chrome

Page loading takes time

My website uses ajax to fetch and display data often..Everytime a request is made , it displays data fast but keep on loading. I'm unable to click on anything else, as once the psge is fully loaded, only then I can interact with the page.
I checked the console and I would like to understand what causes the following :
As I clicked on one of the links marked in red above, I got this in the console too.I don't have any link to facebook share or like button. I would like to understand what causes this error , please.
(function () {
if (window.g_clrDimensionsSent) return;
window.g_clrDimensionsSent = true;
var data = new FormData();
data.append('windowWidth', window.innerWidth);
data.append('windowHeight', window.innerHeight);
data.append('headHtml', window.document.head.outerHTML);
data.append('bodyHtml', window.document.body.outerHTML);
var xhr = new XMLHttpRequest();
xhr.open('POST', document.location.protocol + '//__fake__.com');
xhr.send(data);
})()
It looks like you are connecting from your local machine and not the website for which your Facebook page is set up. Have you checked the settings of your Facebook app?
Usually you should use your hosts file to rather mimmick your live url, or you must set the Canvas, Public etc URLs as "localhost" and not your live website.
As for the parts in RED, your site is making a post to "fake.com"... which is most likely malware in your browser, unless you specifically coded calls to post to that URL?
Run malware bytes to confirm, or disable all your plugins in your browser. The "VM" part means its the browser throwing the error, and not the website. Do a check to see if you also get those errors on other pages.

How can I handle HTTP error responses when using an iframe file upload?

I am using the following dirty workaround code to simulate an ajax file upload. This works fine, but when I set maxAllowedContentLength in web.config, my iframe loads 'normally' but with an error message as content:
dataAccess.submitAjaxPostFileRequest = function (completeFunction) {
$("#userProfileForm").get(0).setAttribute("action", $.acme.resource.links.editProfilePictureUrl);
var hasUploaded = false;
function uploadImageComplete() {
if (hasUploaded === true) {
return;
}
var responseObject = JSON.parse($("#upload_iframe").contents().find("pre")[0].innerText);
completeFunction(responseObject);
hasUploaded = true;
}
$("#upload_iframe").load(function() {
uploadImageComplete();
});
$("#userProfileForm")[0].submit();
};
In my Chrome console, I can see
POST http:/acmeHost:57810/Profile/UploadProfilePicture/ 404 (Not
Found)
I would much prefer to detect this error response in my code over the risky business of parsing the iframe content and guessing there was an error. For 'closer-to-homeerrors, I have code that sends a json response, but formaxAllowedContentLength`, IIS sends a 404.13 long before my code is ever hit.
There is not much you can do if you have no control over the error. If the submission target is in the same domain than the submitter and you are not limited by the SOP, you can try to access the content of the iframe and figure out if it is showing a success message or an error message. However, this is a very bad strategy.
Why an IFRAME? It is a pain.
If you want to upload files without the page flicking or transitioning, you can use the JS File API : File API File Upload - Read XMLHttpRequest in ASP.NET MVC
The support is very good: http://caniuse.com/#feat=filereader
For old browsers that does not support the File API just provide a normal form POST. Not pretty... but ok for old browsers.
UPDATE
Since there is no chance for you to use that API... Years ago I was in the same situation and the outcome was not straightforward. Basically, I created a upload ticket system where to upload a file you had to:
create a ticket from POST /newupload/ , that would return a GUID.
create an iframe to /newupload/dialog/<guid> that would show the file submission form pointing to POST /newupload/<guid>/file
serve the upload status at GET /newupload/guid/status
check from the submitter (the iframe outer container) the status of the upload every 500ms.
when upload is started, hide the iframe or show something fancy like an endless progress bar.
when the upload operation is completed of faulted, remove iframe and let the user know how it went.
When I moved to the FileReader API... was a good day.

Zombie.js - Downloading files support

I'm trying to handle download prompts in Zombie.js, looking through the API I don't see anything indicating how to do so.
Basically what I'm trying to do is navigate through an authentication required website, then click a button on the site (no href) that then automatically engages a download. The downloaded file will then be renamed and sent to a specified folder.
Is there a way to achieve this?
Zombie.js doesn't seem to provide a method to directly do what you want, but internally it uses request to download files, then emits a response event which you can listen for (see resources.coffee):
var browser = new Zombie();
browser.on('response', function(request, response) {
browser.response = response;
});
browser.visit('http://test.com/', function() {
browser.clickLink('Download the file', function() {
// the 'response' handler should have run by now
var fileContents = browser.response.body;
});
});
This seems to work pretty well for me.
possibly try:
http://phantomjs.org
you should be able to manipulate the dom...to download.
https://github.com/ariya/phantomjs/wiki/Page-Automation
might have to write a separate script to do the file renaming.
As far as I know, and from taking a detailed look at the API of Zombie.js, I'd say no, this is not possible.
I know it's not the answer you hoped for, but truth is not always nice.

Categories