Extract all CSS with Puppeteer? - javascript

I am performing some analysis on website complexity. What is the best way to extract all CSS (external stylesheets, <style> tags, and inline CSS), for all nodes in a web page, using headless Chrome/Puppeteer?
I'm ideally looking for compiled CSS, in similar format to the "Styles" tab in the Chrome dev-tools.

You ask for two different things:
Scraping
For web scraping in nodejs better use cheerio package.
Sniffing network requests
If you want to get css files requested you can go something like:
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
page.on('response',async response => {
if(response.request().resourceType() === 'stylesheet') {
console.log(await response.text());
}
});
await page.goto('https://myurl.com');
await browser.close();
})();

Related

chrome extract coverage report as part of build process

in chrome you can generate a coverage report of the css and js used, how can i get this report during my build process in order to automatically split and load my files accordingly? (coverage report without any interactions with the website).
Don't mean on this practice as usefull at all, i'll explain:
if you are using dynamic content this will break the styling of your dynamic elements or your dynamic elements itself.
The best way i found to deal with this is using parcel js as package manager and properly set your js and scss into each .html for each component on your view. This will create a "vitual tree" of styles for each view and will create an insert a js and css on each view. Of course you may get some extra css and js, depending on your needs and your best practices.
For dynamic content: if you are in a view that can create dynamically an element, you'll need to preload its js and css alongside with the rest of css and js even if the user does not interact with the element that will create this other element dynamically.
For static content: If you're working with static content, then parcel.js and write your css and js specifically for each view will be the way to go.
There's no much point on scrape the coverage and take what you need now, as you can work with best practices in development time.
If you want to do that anyway, you can do this with puppeteer (npm i puppeteer --save).
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage()
//Start sending raw DevTools Protocol commands are sent using `client.send()`
//First off enable the necessary "Domains" for the DevTools commands we care about
const client = await page.target().createCDPSession()
await client.send('Page.enable')
await client.send('DOM.enable')
await client.send('CSS.enable')
const inlineStylesheetIndex = new Set();
client.on('CSS.styleSheetAdded', stylesheet => {
const { header } = stylesheet
if (header.isInline || header.sourceURL === '' || header.sourceURL.startsWith('blob:')) {
inlineStylesheetIndex.add(header.styleSheetId);
}
});
//Start tracking CSS coverage
await client.send('CSS.startRuleUsageTracking')
await page.goto(`http://localhost`)
// const content = await page.content();
// console.log(content);
const rules = await client.send('CSS.takeCoverageDelta')
const usedRules = rules.coverage.filter(rule => {
return rule.used
})
const slices = [];
for (const usedRule of usedRules) {
// console.log(usedRule.styleSheetId)
if (inlineStylesheetIndex.has(usedRule.styleSheetId)) {
continue;
}
const stylesheet = await client.send('CSS.getStyleSheetText', {
styleSheetId: usedRule.styleSheetId
});
slices.push(stylesheet.text.slice(usedRule.startOffset, usedRule.endOffset));
}
console.log(slices.join(''));
await page.close();
await browser.close();
})();
Source

Capture HTML canvas of third-party site as image, from command line

I know one can use tools such as wget or curl to perform HTTP requests from the command line, or use HTTP client requests from various programming languages. These tools also support fetching images or other files that are referenced in the HTML code.
What I'm searching for is a mechanism that also executes the JavaScript of that web page that renders an image into an HTML canvas. I then want to extract that rendered image as an image file. The goal to achieve is to grab a time series of those images, e.g. weather maps or other diagrams that plot time-variant data into a constant DOM object, via a cron job.
I'd prefer a solution that works from a script. How could this be done?
You can use puppeteer to load the page inside a headless chrome instance
Open the page and wait for it to load
Using page.evaluate return the dataUrl of the canvas
Convert the dataUrl to a buffer and write the result to a file
const puppeteer = require('puppeteer');
const fs = require('fs');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://games.novatoz.com/jigsaw-puzzle');
const dataUrl = await page.evaluate(async () => {
const sleep = (time) => new Promise((resolve) => setTimeout(resolve, time));
await sleep(5000);
return document.getElementById('canvas').toDataURL();
});
const data = Buffer.from(dataUrl.split(',').pop(), 'base64');
fs.writeFileSync('image.png', data);
await browser.close();
})();

Way to scrape a JS-Rendered page?

I'm currently scraping a list of URLs on my site using the request-promise npm module.
This works well for what I need, however, I'm noticing that not all of my divs are appearing because some are rendered after the fact with JS. I know I can't run that JS code remotely to force the render, but is there any ways to be able to scrape the pages only after those elements are added in?
I'm doing this currently with Node, and would prefer to keep using Node if possible.
Here is what I have:
const urls ['fake.com/link-1', 'fake.com/link-2', 'fake.com/link-3']
urls.forEach(url => {
request(url)
.then(function(html){
//get dummy dom
const d_dom = new JSDOM(html);
....
}
});
Any thoughts on how to accomplish this? Or if there is currently an alternative to Selenium as an npm module?
You will want to use puppeteer which is a Chrome headless browser (owned and maintained by Chrome/Google) for loading and parsing dynamic web pages.
Use page.goto() to goto a specific page, then use page.content() to load the html content from the rendered page.
Here is an example of how to use it:
const { JSDOM } = require("jsdom");
const puppeteer = require('puppeteer')
const urls = ['fake.com/link-1', 'fake.com/link-2', 'fake.com/link-3']
urls.forEach(async url => {
let dom = new JSDOM(await makeRequest(url))
console.log(dom.window.document.title)
});
async function makeRequest(url) {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(url);
let html = await page.content()
await browser.close();
return html
}

How can you run an html file with its javascript content on Linux terminal?

I am working on a website crawler bot which extracts a specific information from them.
And I need to run at least "on document ready" javascript function on an html file, so that the content is generated and I can get it.
How can I do this? I saw about a command called "rhino" but it seems it is only for .js files, the file is an html file. It includes both html and JS inside, as you can guess.
The plan is:
Download html files, edit their "on document ready" js functions, get output, pass on the next one, repeat.
You can try some manager for a headless browser.
This is an example of how something similar can be done with GoogleChrome/puppeteer. If this does not work for you, please elaborate your task and issues.
'use strict';
const puppeteer = require('puppeteer');
(async function main() {
try {
const browser = await puppeteer.launch();
const [page] = await browser.pages();
await page.goto('https://example.org/', { waitUntil: 'domcontentloaded ' });
const data = await page.evaluate(() => {
return document.title;
});
console.log(data);
await browser.close();
} catch (err) {
console.error(err);
}
})();

Extract text from a font tag on nodeJs

I'm using Cheerio to extract informations from html code of different webpages.
However there is a website in which the text that I wanna extract is included in a script tag; therefore that piece of code wasn't accessible by Cheerio methods.
So, looking for a solution, I found on the web the possibility to run that script using puppeteer, that is an API node to handle a chrome instance.
Using this, even if not in the best way because I discovered it some days ago, finally I obtained the html code that I need.
Unfortunately I am not able to extract the information that I need.
This is the html code from which I wanna extract the data:
<h2 class="property-price">
<a href="blablabla">
<strong>
<font style="vertical-align: inherit;">
<font style="vertical-align: inherit;">Text that I wanna extract</font>
</font>
<small></small>
</strong>
</a>
</h2>
This is instead the code that I used to extract the text data without success:
var cheerio = require("cheerio");
const puppeteer = require('puppeteer');
var $;
const POST_LINK_SELECTOR = 'div.property-title';
(async() => {
const browser = await puppeteer.launch({ headless: false });
const page = await browser.newPage();
await page.goto('myUrl',{
timeout: 0
});
$=cheerio.load(renderedContent);
console.log($('h2.property-price').find('font').children().text());
await browser.close();
})();
I'm sure that this is not the best way to obtain the data text that I need, so if you have some suggestions I will acccept them happily.
Furthermore I would know if is possible to extract what I need using directly the puppeteer API or if I need to use Cheerio(like I did in my case and that anyway doesn't work).
Thank you
You can find the needed data right with the puppeteer, with the help of page.evaluate method:
(async() => {
const browser = await puppeteer.launch({ headless: false });
const page = await browser.newPage();
await page.goto('myUrl',{waitUntil: "networkidle0"});
const text = await page.evaluate(() => document.querySelector("h2.property-price a").textContent.trim() )
console.log(text);
await browser.close();
})();
If you'd like to continue using jQuery-like syntax of Cheerio, that can be done too, just add jQuery to the page (if the site doesn't use it aready)
await page.goto(...);
await page.addScriptTag({url: 'https://code.jquery.com/jquery-3.2.1.min.js'});

Categories