in chrome you can generate a coverage report of the css and js used, how can i get this report during my build process in order to automatically split and load my files accordingly? (coverage report without any interactions with the website).
Don't mean on this practice as usefull at all, i'll explain:
if you are using dynamic content this will break the styling of your dynamic elements or your dynamic elements itself.
The best way i found to deal with this is using parcel js as package manager and properly set your js and scss into each .html for each component on your view. This will create a "vitual tree" of styles for each view and will create an insert a js and css on each view. Of course you may get some extra css and js, depending on your needs and your best practices.
For dynamic content: if you are in a view that can create dynamically an element, you'll need to preload its js and css alongside with the rest of css and js even if the user does not interact with the element that will create this other element dynamically.
For static content: If you're working with static content, then parcel.js and write your css and js specifically for each view will be the way to go.
There's no much point on scrape the coverage and take what you need now, as you can work with best practices in development time.
If you want to do that anyway, you can do this with puppeteer (npm i puppeteer --save).
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage()
//Start sending raw DevTools Protocol commands are sent using `client.send()`
//First off enable the necessary "Domains" for the DevTools commands we care about
const client = await page.target().createCDPSession()
await client.send('Page.enable')
await client.send('DOM.enable')
await client.send('CSS.enable')
const inlineStylesheetIndex = new Set();
client.on('CSS.styleSheetAdded', stylesheet => {
const { header } = stylesheet
if (header.isInline || header.sourceURL === '' || header.sourceURL.startsWith('blob:')) {
inlineStylesheetIndex.add(header.styleSheetId);
}
});
//Start tracking CSS coverage
await client.send('CSS.startRuleUsageTracking')
await page.goto(`http://localhost`)
// const content = await page.content();
// console.log(content);
const rules = await client.send('CSS.takeCoverageDelta')
const usedRules = rules.coverage.filter(rule => {
return rule.used
})
const slices = [];
for (const usedRule of usedRules) {
// console.log(usedRule.styleSheetId)
if (inlineStylesheetIndex.has(usedRule.styleSheetId)) {
continue;
}
const stylesheet = await client.send('CSS.getStyleSheetText', {
styleSheetId: usedRule.styleSheetId
});
slices.push(stylesheet.text.slice(usedRule.startOffset, usedRule.endOffset));
}
console.log(slices.join(''));
await page.close();
await browser.close();
})();
Source
Related
I'm trying to find the numbers of users on this website, using node.js, but I'm not sure of what process I can do to get it. The value is subclassed under multiple divs with different class names. Right now I will just be console.logging it but I'm going to do more with the data later on.
I was suggested to use the package "puppeteer", but I'm not sure how this would fetch it or what to do with it to fetch it.
Thank you in advance
You can use Puppeteer and obtain the value via a query string. The typical code looks like this:
const pt= require('puppeteer')
async function getText(){
//launch browser in headless mode
const browser = await pt.launch()
//browser new page
const page = await browser.newPage()
//launch URL
await page.goto('https://feds.lol/')
//identify element
const f = await page.$("QUERY_STRIG")
//obtain text
const text = await (await f.getProperty('textContent')).jsonValue()
console.log("Text is: " + text)
}
getText()
However, I see this website is protected by Captcha
Well, I would like a way to use the puppeteer and the for loop to get all the links on the site and add them to an array, in this case the links I want are not links that are in the html tags, they are links that are directly in the source code, javascript file links etc... I want something like this:
array = [ ]
for(L in links){
array.push(L)
//The code should take all the links and add these links to the array
}
But how can I get all references to javascript style files and all URLs that are in the source code of a website?
I just find a post and a question that teaches or shows how it gets the links from the tag and not all the links from the source code.
Supposing you want to get all the tags on this page for example:
view-source:https://www.nike.com/
How can I get all script tags and return to console? I put view-source:https://nike.com because you can get the script tags, I don't know if you can do it without displaying the source code, but I thought about displaying and getting the script tag because that was the idea I had, however I do not know how to do it
It is possible to get all links from a URL using only node.js, without puppeteer:
There are two main steps:
Get the source code for the URL.
Parse the source code for links.
Simple implementation in node.js:
// get-links.js
///
/// Step 1: Request the URL's html source.
///
axios = require('axios');
promise = axios.get('https://www.nike.com');
// Extract html source from response, then process it:
promise.then(function(response) {
htmlSource = response.data
getLinksFromHtml(htmlSource);
});
///
/// Step 2: Find links in HTML source.
///
// This function inputs HTML (as a string) and output all the links within.
function getLinksFromHtml(htmlString) {
// Regular expression that matches syntax for a link (https://stackoverflow.com/a/3809435/117030):
LINK_REGEX = /https?:\/\/(www\.)?[-a-zA-Z0-9#:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()#:%_\+.~#?&//=]*)/gi;
// Use the regular expression from above to find all the links:
matches = htmlString.match(LINK_REGEX);
// Output to console:
console.log(matches);
// Alternatively, return the array of links for further processing:
return matches;
}
Sample usage:
$ node get-links.js
[
'http://www.w3.org/2000/svg',
...
'https://s3.nikecdn.com/unite/scripts/unite.min.js',
'https://www.nike.com/android-icon-192x192.png',
...
'https://connect.facebook.net/',
... 658 more items
]
Notes:
I used the axios library for simplicity and to avoid "access denied" errors from nike.com. It is possible to use any other method to get the HTML source, like:
Native node.js http/https libraries
Puppeteer (Get complete web page source html with puppeteer - but some part always missing)
Although the other answers are applicable in many situations, they will not work for client-side rendered sites. For instance, if you just do an Axios request to Reddit, all you'll get is a couple of divs with some metadata. As Puppeteer actually gets the page and parses all JavaScript in a real browser, the websites' choice of document rendering becomes irrelevant for extracting page data.
Puppeteer has an evaluate method on the page object which allows you to run JavaScript directly on the page. Using that, you easily extract all links as follows:
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://example.com');
const pageUrls = await page.evaluate(() => {
const urlArray = Array.from(document.links).map((link) => link.href);
const uniqueUrlArray = [...new Set(urlArray)];
return uniqueUrlArray;
});
console.log(pageUrls);
await browser.close();
})();
yes you can get all the script tags and their links without opening view source.
You need to add dependency for jsdom library in your project and then pass the HTML response to its instance like below
here is the code:
const axios = require('axios');
const jsdom = require("jsdom");
// hit simple HTTP request using axios or node-fetch as you wish
const nikePageResponse = await axios.get('https://www.nike.com');
// now parse this response into a HTML document using jsdom library
const dom = new jsdom.JSDOM(nikePageResponse.data);
const nikePage = dom.window.document
// now get all the script tags by querying this page
let scriptLinks = []
nikePage.querySelectorAll('script[src]').forEach( script => scriptLinks.push(script.src.trim()));
console.debug('%o', scriptLinks)
Here I have made CSS selector for <script> tags that have src attribute inside them.
You can write same code in using puppeteer, but it will take time opening the browser and everything and then getting its pageSource.
you can use this to find the links and then do whatever you want to use with them using puppeteer or anything.
I'm trying pull down price and quantity data from a Chiefs game on ticketmaster to see how quantities have changed with COVID but there's no easy way to get the data off the screen (API not an option since I need seat/row breakout).
I tried using puppeteer and a headless browser (code below) but was blocked by the site (they recognized it's a bot). I stumbled across the option to right click and "view page source" and can see everything in json -- which is great because I can parse this into some structured format.
I'd like to get some more of this data with regularity and without the ability to pull by linking into the html through puppeteer, would pulling from the "page source" be my best option? And is there a way I could automate a copy/paste from the "page source" if puppeteer is already blocked?
I'm relatively new with javascript so any help would be appreciated!!
const puppeteer = require('puppeteer')
async function getTix(){
const browser = await puppeteer.launch({
headless: false,
defaultViewport: null
});
const page = await browser.newPage();
const url = 'https://www.ticketmaster.com/kansas-city-chiefs-vs-las-vegas-kansas-city-missouri-10-11-2020/event/060058FEB7590E10';
await page.goto(url);
await page.waitForSelector(".quick-picks__list-item");
const results = await page.$$eval(".quick-picks__list-item", rows => {
return rows.map(row => {
const properties = {};
const titleElement = row.querySelector(".result-title");
properties.title = titleElement.innerText;
return properties;
});
})
console.log(results);
}
getTix();
I'm currently scraping a list of URLs on my site using the request-promise npm module.
This works well for what I need, however, I'm noticing that not all of my divs are appearing because some are rendered after the fact with JS. I know I can't run that JS code remotely to force the render, but is there any ways to be able to scrape the pages only after those elements are added in?
I'm doing this currently with Node, and would prefer to keep using Node if possible.
Here is what I have:
const urls ['fake.com/link-1', 'fake.com/link-2', 'fake.com/link-3']
urls.forEach(url => {
request(url)
.then(function(html){
//get dummy dom
const d_dom = new JSDOM(html);
....
}
});
Any thoughts on how to accomplish this? Or if there is currently an alternative to Selenium as an npm module?
You will want to use puppeteer which is a Chrome headless browser (owned and maintained by Chrome/Google) for loading and parsing dynamic web pages.
Use page.goto() to goto a specific page, then use page.content() to load the html content from the rendered page.
Here is an example of how to use it:
const { JSDOM } = require("jsdom");
const puppeteer = require('puppeteer')
const urls = ['fake.com/link-1', 'fake.com/link-2', 'fake.com/link-3']
urls.forEach(async url => {
let dom = new JSDOM(await makeRequest(url))
console.log(dom.window.document.title)
});
async function makeRequest(url) {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(url);
let html = await page.content()
await browser.close();
return html
}
I am performing some analysis on website complexity. What is the best way to extract all CSS (external stylesheets, <style> tags, and inline CSS), for all nodes in a web page, using headless Chrome/Puppeteer?
I'm ideally looking for compiled CSS, in similar format to the "Styles" tab in the Chrome dev-tools.
You ask for two different things:
Scraping
For web scraping in nodejs better use cheerio package.
Sniffing network requests
If you want to get css files requested you can go something like:
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
page.on('response',async response => {
if(response.request().resourceType() === 'stylesheet') {
console.log(await response.text());
}
});
await page.goto('https://myurl.com');
await browser.close();
})();