I am using wkhtmltopdf on Ubuntu 18.04 LTS from command line and am trying to create a pdf from a webpage. The webpage content is updated once a day and filled with current information by javascript.
When I use wkthmltopdf it renders all the "bare"/"empty" HTML but without current information included by the javascript, as if the javascript has not yet been loaded for the PDF or just ignored.
I am using wkhtmltopdf --javascript-delay 60000 --no-stop-slow-scripts --enable-javascript https://example.com/ test.pdf
Do you know how to create the pdf after the javascript has been loaded and filled the webpage with current information?
My solution:
After debugging the javascript via wkhtmltopdf (added option --debug-javascript) as #Nitin suggested I found that I had a SyntaxError (parse error).
I changed parts of the javascript from a "for..of" to "for..in" loop. This solved the problem an the pdf was created correctly. Thanks #Daphoque for explaining that this is probably due to the javascript version used by wkhtmltopdf.
I had an issue like this with jquery, but it was working fine with vanilla js.
Are you using javascript library ?
Related
I have a requirement that I need to create a PDF. For me the best way is to do it render html template and create a PDF with any third party lib. I have come across the solution which render HTML with ejs and create pdf with html-pdf. It works fine though but I had a problem with a page break.
There is a popular module pdfkit. But it uses it own concept and procedure to render pdf. For node it does not render html file but for python it does render html template.
Please tell me how can i render html template to pdf using pdfKit and also what is the best way to render html and convert to pdf ?
Thanks
Puppeteer is the best way for converting HTML to the PDF, and also for web scraping.
A short instruction for generating PDF from HTML You can see here
Also, the Chrome DevTools team maintains the library, so it is the best solution.
About the page-break. This issue can be solved in the HTML code. There is a style option for solving the issue.
style="page-break-after:always;"
The problem with using PDF converter libraries available on NPM like pdfkit is that, you gonna have to recreate the page structures again in your html templates to get the desired output.
One of the best approach to rendering html and convert to pdf is by using Puppeteer on NodeJs. Puppeteer is a Node library which provides a high-level API to control Chrome or Chromium over the DevTools Protocol. It can be used to generate screenshots and PDFs of html pages in your case.
Can anyone suggest how to use wkhtmltopdf in JS to generate PDF files from static html files?
wkhtmltopdf - http://code.google.com/p/wkhtmltopdf/
All answers say run "wkhtmltopdf http://www.google.com google.pdf"
But run the code where? On a server? On a browser? On a command line?
Any code samples will be helpful
As per the project's homepage: it's a command line tool.
What is it?
wkhtmltopdf and wkhtmltoimage are open source (LGPL) command line tools to render HTML into PDF and various image formats using the QT Webkit rendering engine. These run entirely "headless" and do not require a display or display service.
In your server-side application, you can call the binary and wait for its output (for PHP, it's basically $output = exec("wkhtmltopdf $url $options");). There used to be an example wrapper on the project's Google Code wiki, but it seems to be gone.
If you're using PHP and don't want to write a wrapper yourself then you can use this one. If there's no wrapper for your chosen server-side language, then try making one yourself. It's not hard.
Your front-end JavaScript will have to call (probably using AJAX) a script that runs on your server to generate the PDF.
I need to scrape a site with python. I obtain the source html code with the urlib module, but I need to scrape also some html code that is generated by a javascript function (which is included in the html source). What this functions does "in" the site is that when you press a button it outputs some html code. How can I "press" this button with python code? Can scrapy help me? I captured the POST request with firebug but when I try to pass it on the url I get a 403 error. Any suggestions?
In Python, I think Selenium 1.0 is the way to go. It’s a library that allows you to control a real web browser from your language of choice.
You need to have the web browser in question installed on the machine your script runs on, but it looks like the most reliable way to programmatically interrogate websites that use a lot of JavaScript.
Since there is no comprehensive answer here, I'll go ahead and write one.
To scrape off JS rendered pages, we will need a browser that has a JavaScript engine (e.i, support JavaScript rendering)
Options like Mechanize, url2lib will not work since they DO NOT support JavaScript.
So here's what you do:
Setup PhantomJS to run with Selenium. After installing the dependencies for both of them (refer this), you can use the following code as an example to fetch the fully rendered website.
from selenium import webdriver
driver = webdriver.PhantomJS()
driver.get('http://jokes.cc.com/')
soupFromJokesCC = BeautifulSoup(driver.page_source) #page_source fetches page after rendering is complete
driver.save_screenshot('screen.png') # save a screenshot to disk
driver.quit()
I have had to do this before (in .NET) and you are basically going to have to host a browser, get it to click the button, and then interrogate the DOM (document object model) of the browser to get at the generated HTML.
This is definitely one of the downsides to web apps moving towards an Ajax/Javascript approach to generating HTML client-side.
I use webkit, which is the browser renderer behind Chrome and Safari. There are Python bindings to webkit through Qt. And here is a full example to execute JavaScript and extract the final HTML.
For Scrapy (great python scraping framework) there is scrapyjs: an additional downloader handler / middleware handler able to scraping javascript generated content.
It's based on webkit engine by pygtk, python-webkit, and python-jswebkit and it's quite simple.
I am trying to generate pdf files in a cakePHP app, but so far get only html to be included in a file. The problem is that the main content of the page (calendar) is produced by the javascript which is completely ignored when generating a PDF. What is the best solution in this case?
I really appreciate your help.
If you use something like wkhtmltopdf it should work as it contains the actual rendering code used in chrome.
There is a plugin ready made that works out the box (after installing wkhtmltopdf)
I'm trying to use urllib2 to fetch webpage from a website. After I managed to log on and retrieve the page, I found out the page has some <script>.....</script> inside. How can I save the rendered the output (the complete content of the webpage, not the script)?
Javascript can't be easily handled if you are using urllib.
What you need is a headless browser, for ex. WebKit.
A simple example can be found here.
If you don't want yourself to be limited to python, try Phantomjs
I'd also like to mention pywebkitgtk (which I've been using a lot lately as an embedded browser), and Selenium.