Way to scrape a JS-Rendered page? - javascript

I'm currently scraping a list of URLs on my site using the request-promise npm module.
This works well for what I need, however, I'm noticing that not all of my divs are appearing because some are rendered after the fact with JS. I know I can't run that JS code remotely to force the render, but is there any ways to be able to scrape the pages only after those elements are added in?
I'm doing this currently with Node, and would prefer to keep using Node if possible.
Here is what I have:
const urls ['fake.com/link-1', 'fake.com/link-2', 'fake.com/link-3']
urls.forEach(url => {
request(url)
.then(function(html){
//get dummy dom
const d_dom = new JSDOM(html);
....
}
});
Any thoughts on how to accomplish this? Or if there is currently an alternative to Selenium as an npm module?

You will want to use puppeteer which is a Chrome headless browser (owned and maintained by Chrome/Google) for loading and parsing dynamic web pages.
Use page.goto() to goto a specific page, then use page.content() to load the html content from the rendered page.
Here is an example of how to use it:
const { JSDOM } = require("jsdom");
const puppeteer = require('puppeteer')
const urls = ['fake.com/link-1', 'fake.com/link-2', 'fake.com/link-3']
urls.forEach(async url => {
let dom = new JSDOM(await makeRequest(url))
console.log(dom.window.document.title)
});
async function makeRequest(url) {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(url);
let html = await page.content()
await browser.close();
return html
}

Related

How to get the content of a div tag when scraping with puppeteer and NodeJs

I heard of this library called puppeteer and is usefulness in scraping web pages. so i decided to scrape a gaming site content so I can store it data and go through it later.
But after i copied the XPATH of the div tag I want puppeteer to scrape it content, its returning Empty string Please what am I doing wrong.
This is the url am trying to scrape here
i want to scrape the div tag where the result of the 6 different color ball are being displayed.
so i can get the number of those colors every 45 seconds.
const puppeteer = require("puppeteer");
async function scrapeData(url){
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(url);
const [dataReceived] = await page.$x('/html/body/div[1]/div/div/div/footer/div[2]/div[1]/div/div[1]/div[2]/div/div');
const elContent = await dataReceived.getProperty('textContent');
const elValue = await elContent.jsonValue();
console.log({elValue});
//console.log(elContent);
//console.log(dataReceived)
browser.close();
}
scrapeData("https://logigames.bet9ja.com/Games/Launcher?gameId=11000&provider=0&sid=&pff=1&skin=201");
console.log("just testing");
Rather than using page.$x here, you could use a simpler selector, which would be less brittle. Try page.$('.ball-value'), or possibly page.waitForSelector('.ball-value') to deal with transition times. Testing on that page using a simpler selector seems to work. If you want to get all the ball values rather than just the first one, there's page.$$ (which is the same as document.querySelectorAll, so it would return an array of elements).

Get current page url with Playwright Automation tool?

How can I retrieve the current URL of the page in Playwright?
Something similar to browser.getCurrentUrl() in Protractor?
const {browser}=this.helpers.Playwright;
await browser.pages(); //list pages in the browser
//get current page
const {page}=this.helpers.Playwright;
const url=await page.url();//get the url of the current page
To get the URL of the current page as a string (no await needed):
page.url()
Where "page" is an object of the Page class. You should already have a Page object, and there are various ways to instantiate it, depending on how your framework is set up: https://playwright.dev/docs/api/class-page
It can be imported with
import Page from '#playwright/test';
or this
const { webkit } = require('playwright');
(async () => {
const browser = await webkit.launch();
const context = await browser.newContext();
const page = await context.newPage();
}

chrome extract coverage report as part of build process

in chrome you can generate a coverage report of the css and js used, how can i get this report during my build process in order to automatically split and load my files accordingly? (coverage report without any interactions with the website).
Don't mean on this practice as usefull at all, i'll explain:
if you are using dynamic content this will break the styling of your dynamic elements or your dynamic elements itself.
The best way i found to deal with this is using parcel js as package manager and properly set your js and scss into each .html for each component on your view. This will create a "vitual tree" of styles for each view and will create an insert a js and css on each view. Of course you may get some extra css and js, depending on your needs and your best practices.
For dynamic content: if you are in a view that can create dynamically an element, you'll need to preload its js and css alongside with the rest of css and js even if the user does not interact with the element that will create this other element dynamically.
For static content: If you're working with static content, then parcel.js and write your css and js specifically for each view will be the way to go.
There's no much point on scrape the coverage and take what you need now, as you can work with best practices in development time.
If you want to do that anyway, you can do this with puppeteer (npm i puppeteer --save).
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage()
//Start sending raw DevTools Protocol commands are sent using `client.send()`
//First off enable the necessary "Domains" for the DevTools commands we care about
const client = await page.target().createCDPSession()
await client.send('Page.enable')
await client.send('DOM.enable')
await client.send('CSS.enable')
const inlineStylesheetIndex = new Set();
client.on('CSS.styleSheetAdded', stylesheet => {
const { header } = stylesheet
if (header.isInline || header.sourceURL === '' || header.sourceURL.startsWith('blob:')) {
inlineStylesheetIndex.add(header.styleSheetId);
}
});
//Start tracking CSS coverage
await client.send('CSS.startRuleUsageTracking')
await page.goto(`http://localhost`)
// const content = await page.content();
// console.log(content);
const rules = await client.send('CSS.takeCoverageDelta')
const usedRules = rules.coverage.filter(rule => {
return rule.used
})
const slices = [];
for (const usedRule of usedRules) {
// console.log(usedRule.styleSheetId)
if (inlineStylesheetIndex.has(usedRule.styleSheetId)) {
continue;
}
const stylesheet = await client.send('CSS.getStyleSheetText', {
styleSheetId: usedRule.styleSheetId
});
slices.push(stylesheet.text.slice(usedRule.startOffset, usedRule.endOffset));
}
console.log(slices.join(''));
await page.close();
await browser.close();
})();
Source

Capture HTML canvas of third-party site as image, from command line

I know one can use tools such as wget or curl to perform HTTP requests from the command line, or use HTTP client requests from various programming languages. These tools also support fetching images or other files that are referenced in the HTML code.
What I'm searching for is a mechanism that also executes the JavaScript of that web page that renders an image into an HTML canvas. I then want to extract that rendered image as an image file. The goal to achieve is to grab a time series of those images, e.g. weather maps or other diagrams that plot time-variant data into a constant DOM object, via a cron job.
I'd prefer a solution that works from a script. How could this be done?
You can use puppeteer to load the page inside a headless chrome instance
Open the page and wait for it to load
Using page.evaluate return the dataUrl of the canvas
Convert the dataUrl to a buffer and write the result to a file
const puppeteer = require('puppeteer');
const fs = require('fs');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://games.novatoz.com/jigsaw-puzzle');
const dataUrl = await page.evaluate(async () => {
const sleep = (time) => new Promise((resolve) => setTimeout(resolve, time));
await sleep(5000);
return document.getElementById('canvas').toDataURL();
});
const data = Buffer.from(dataUrl.split(',').pop(), 'base64');
fs.writeFileSync('image.png', data);
await browser.close();
})();

Extract all CSS with Puppeteer?

I am performing some analysis on website complexity. What is the best way to extract all CSS (external stylesheets, <style> tags, and inline CSS), for all nodes in a web page, using headless Chrome/Puppeteer?
I'm ideally looking for compiled CSS, in similar format to the "Styles" tab in the Chrome dev-tools.
You ask for two different things:
Scraping
For web scraping in nodejs better use cheerio package.
Sniffing network requests
If you want to get css files requested you can go something like:
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
page.on('response',async response => {
if(response.request().resourceType() === 'stylesheet') {
console.log(await response.text());
}
});
await page.goto('https://myurl.com');
await browser.close();
})();

Categories