Load any url content and follow XPATH in JS - javascript

What i would like to do, is loading a page, and getting the content of something trough XPath or Selector or JS Path to then use a value got by that into my program. How could i do that ?
For instance on this page, doing a request using the url of the page and following that path (while also targeting the type somehow, here it is the class) :
//*[#id="question-header"]/h1/a
Would give me 'Load any url content and follow XPATH in JS'
As i am getting the text inside this :
Load any url content and follow XPATH in JS

If you need the most reliable way to get some data from a web page — i.e. including the data that can be generated by a JavaScript execution on the client side — you can use some manager of a headless browser. For example, the described task can be accomplished with Node.js and puppeteer in this script (selectors and XPath are supported as well as all the Web API via evaluation of code fragments in browser context and exchanging the data between Node.js and browser contexts):
'use strict';
const puppeteer = require('puppeteer');
(async function main() {
try {
const browser = await puppeteer.launch();
const [page] = await browser.pages();
await page.goto('https://stackoverflow.com/questions/54847748/load-any-url-content-and-follow-xpath-in-js');
const data = await page.evaluate(() => {
return document.querySelector('#question-header > h1 > a').innerText;
});
console.log(data);
await browser.close();
} catch (err) {
console.error(err);
}
})();

Well, you could use something like
document.getElementById('question-header').children[0].children[0].href;
It's not as dynamic as XPATH (redundancy of the children), but should do the trick of you're facing a static structure. For Node.js there are several libraries that could as well do it, such as libxmljs or parse5 - more on this here.

Related

Await, catch and assign to variables - fetch/xhr responses node js, puppeteer

I am using Puppeteer to get page data, but unfortunately there is no way to make all requests.
Therefore, the question arose - How, after opening the site, get from all Fetch / XHR requests with the name v2 JSON contained in their responses?
In this case, as I understand it, need to use waiting.
It is not possible to peep into the request and the body and repeat a similar request, since the body uses code that is generated randomly each time - therefore this is not an option, it was in connection with this that it became necessary to simply display all json responses from requests with names v2.
I am attaching a screenshot and my code, I beg you - point me in the right direction, I will be grateful for any help!
// puppeteer-extra is a drop-in replacement for puppeteer,
// it augments the installed puppeteer with plugin functionality
import puppeteer from "puppeteer-extra";
// add stealth plugin and use defaults (all evasion techniques)
import StealthPlugin from 'puppeteer-extra-plugin-stealth'
export async function ProductAPI() {
try {
puppeteer.use(StealthPlugin())
const browser = await puppeteer.launch({headless: true});
const page = await browser.newPage();
await page.goto('here goes link for website');
const pdata = await page.content() // this just prints HTML
console.log(pdata)
browser.close();
} catch (err) {
throw err
}
}(ProductAPI())
link for image: https://i.stack.imgur.com/ZR6T1.png
I know that the code I wrote just returns html. I'm just trying to figure out how to get the data I need, I googled for a very long time, but could not find the answer I needed.
It is very important that the execution is on node js (javscript) and it doesn’t matter if it’s a puppeteer or something else.
This works!
import puppeteer from "puppeteer-extra";
import StealthPlugin from 'puppeteer-extra-plugin-stealth'
async function SomeFunction () {
puppeteer.use(StealthPlugin())
const browser = await puppeteer.launch({headless: true});
const page = await browser.newPage();
page.on('response', async (response) => {
if(response.url().includes('write_link_here')){
console.log('XHR response received');
const HTMLdata = await response.text()
console.log(HTMLdata)
};});
await page.goto('some_website_link');}

Target an iFrame using Node.js/Puppeteer

Is it possible to target an iFrame when using the GUI Workflow builder in AWS Cloudwatch Synthetics?
I've set up the canary to log in to a website and redirect the page which has run successfully, but one of the elements I need to check with Node.js is within an iFrame which isn't being recognised.
This is the iframe code. It loads from Javascript, but all content is from the same domain:
<iframe id="paramsFrame" src="empty.htm" frameborder="0" ppTabId="-1"
onload="paramsDocumentLoaded('paramsFrame', true);"></iframe>
This is the code I'm using for this section, but it's just returning a timeout error:
await synthetics.executeStep('verifyText', async function() {
const elementHandle = await page.waitForSelector('#paramsFrame');
const frame = await elementHandle.contentFrame();
await frame.waitForXPath("//div[#class=\'css7\'][contains(text(),'Specificity')]", { timeout: 30000 });
})
This code is trying to target a div with class css7 found within an iframe with id paramsFrame
Edit: I did a null check on frame and it came back as not null, not sure if that is relevant.
I also tried to target an element directly:
const next = await frame.waitForSelector('.protocol-name-link');
but I got the error message:
TimeoutError: waiting for selector .protocol-name-link
If the iframe is on a different origin (e.g. different domain), you cannot access it through Puppeteer.
You can try to disable some security features of Puppeteer, although this is not advised.
Specifically, you'd probably want to add these args to puppeteer.launch
--disable-web-security
--disable-features=IsolateOrigins,site-per-process
I tried running similar code on a website which had a youtube iframe and I didnt need the puppeteer launch args
i.e
args: [
"--disable-web-security",
"--disable-features=IsolateOrigins,site-per-process",
],
But, First I would like to suggest is for the iframe try to confirm it is the same iframe that you need maybe by logging, debugging or even just going on dev console.
And the second is to use full xpath of the element in the frame.
Here is my code which I tried running.
const page = await browser.newPage();
console.log("open page");
await page.goto("https://captioncrusher.com/");
console.log("page opened");
// use this if you want to wait for all the requests to be done.
// await page.waitForNetworkIdle();
const elementHandle = await page.waitForSelector("iframe.yt");
const frame = await elementHandle.contentFrame();
//These both work for me
const aLink = await frame.waitForXPath("/html/body/div/div/a");
const classLink = await frame.waitForSelector(".ytp-impression-link");
await browser.close();

Multiple separate browser with one tab each - simultaneous interaction with elements on pages (puppeteer headless)

Using Node.js, Chrome and puppeteer as headless on ubuntu server, I'm scraping a few different websites. One of the occasional task is to interact with the loaded page (click on a link to open another page and then possibly do another click to accept the terms and such).
I can do all this just fine, but I'm trying to understand how it will work if I have multiple pages open simultaneously and am trying to interact with different loaded pages at the same time (overlapping times).
To visualize this, I'm thinking how a user will do the same job. They'll have to open multiple browser windows, open the page and switch between them to see and then click on links.
But using puppeteer, we have separate browser object, we don't need to see the window or page to know where to click. We can traverse it through the browser object and then do a click on desired element without looking (headless).
I'm thinking I should be able to do multiple pages at the same time as long as I have CPU and memory available to handle them.
Does anyone have any experience with puppeteer interacting with multiple websites simultaneously? Anything I need to watch out for?
This is the problem the library puppeteer-cluster (I'm the author) is addressing. It allows you to build a pool of pages (or browsers) to use and run tasks inside.
You find several general code samples in the repository (and also on stackoverflow). Let me address your specific use case of running different tasks with an example.
Code Sample
The following code creates two tasks:
crawl: Opens the page and extracts an URL to then start the second task
screenshot: Takes a screenshot of the extracted URL
The process is started by queuing the crawl task with the URLs.
const { Cluster } = require('puppeteer-cluster');
(async () => {
const cluster = await Cluster.launch({ // use four pages in parallel
concurrency: Cluster.CONCURRENCY_PAGE,
maxConcurrency: 4,
});
// We define two tasks
const crawl = async ({ page, data: url }) => {
await page.goto(url);
const extractedURL = /* ... */; // extract an URL (or multiple) from the document somehow
cluster.queue(extractedURL, screenshot);
};
const screenshot = async ({ page, data: url }) => {
await page.goto(url);
await page.screenshot();
};
// Crawl some pages
cluster.queue('https://www.google.com/', crawl);
cluster.queue('https://github.com/', crawl);
// Wait until everything is done and close the cluster
await cluster.idle();
await cluster.close();
})();
This is a minimal example. I left out error handling, monitoring and the setup options.
I can usually get 5 or so browsers going on a 4GB server, if you're just popping urls off a queue it's pretty straightforward:
const puppeteer = require('puppeteer');
let queue = [
'http://www.amazon.com',
'http://www.google.com',
'http://www.fabebook.com',
'http://www.reddit.com',
]
const doQueue = async () => {
const browser = await puppeteer.launch()
const page = await browser.newPage()
let url
while(url = queue.shift()){
await page.goto(url)
console.log(await page.title())
}
await browser.close()
}
[1,2,3].map(() => doQueue())

uploading a file using puppeteer browserWSEndpoint

I am trying to upload a file using puppeteer and browserWSEndpoint, the error message I am getting is
"Uncaught (in promise) Error: File chooser handling does not work with multiple connections to the same page".
Here is my code:
const puppeteer = require('puppeteer');
async function getTest() {
const browser = await puppeteer.connect({
browserWSEndpoint: 'wss://chrome.browserless.io'
});
const page = (await browser.pages())[0];
await page.goto('https://someWebSite');
//DO STUFF
console.log("before upload"); //code runs until here
const [fileChooser] = await Promise.all([page.waitForFileChooser(),page.click('#uploadTrigger'),]);
await fileChooser.accept(['C:\\myProgram\\pic.jpg']);
await page.click('#edit-submit');
}
getTest().then(console.log);
I must mention that if I don't use browserWSEndpoint, and use this code at the beginning instead, everything works fine.
const browser = await puppeteer.launch({headless: false, defaultViewport:null});
Honnestly I am pretty lost with browserWSEndpoint, I used info from this post How to run Puppeteer code in any web browser?
which led me to browserless.io, copied the code and it works.
Now this is my precise question, my error indicates does not work with multiple connections to the same page. How exactly am I connecting with multiple connections? Maybe I can resolve this issue and then I could use const [fileChooser].
My main issue is that I need to upload a file, using browserless
Others seem to have the same problem according to https://github.com/GoogleChrome/puppeteer/issues/4783, but using chromuim is not an option if I want to use browserless
If you are the only client connected to that browser you must be connected to a browser that doesn't support the fileChooser. You should connect to a Chromium 77.0.3844.0 (r674921) or higher.

Using page.getMetrics() to get page load time in puppeteer

I am trying to use puppeteer to measure how fast a set of web sites loads in my environment. My focus is on the quality of network connection and network speed, so I am happy to know the the time taken for a page to load, for a layman's definition of load, when all images and html is downloaded by browser.
By using puppeteer I can run the test repeatedly and measure the difference in load times precisely.
I can see that in 64.0.3240.0 (r508693) page.getMetrics and event: 'metrics' have landed, which should help me in getting what I am looking for.
But being a newbie in node and js I am not sure how to read the page.getMetrics and which of the different key/value pairs give a useful information in my context.
My current pathetic attempt at reading metrics is as follows:
const puppeteer = require('puppeteer');
async function run() {
const browser = await puppeteer.launch({args: ['--no-sandbox', '--disable-setuid-sandbox']});
const page = await browser.newPage();
page.on('load', () => console.log("Loaded: " + page.url()));
await page.goto('https://google.com');
const metrics = page.getMetrics();
console.log(metrics.Documents, metrics.Frames, metrics.JSEventListeners);
await page.goto('https://yahoo.com');
await page.goto('https://bing.com');
await page.goto('https://github.com/login');
browser.close();
}
run();
Any help in getting this code to some thing more respectable is much appreciated :)
in recent versions you have page.metrics() available:
It will return an object with a bunch of numbers including:
The timestamp when the metrics sample was taken
Combined durations of all page layouts
Combined duration of all tasks performed by the browser.
Check out the docs for the full list
You can use it like this:
await page.goto('https://github.com/login');
const gitMetrics = await page.metrics();
console.log(gitMetrics.Timestamp)
console.log(gitMetrics.TaskDuration)

Categories