CefSharp does not marshal DOM nodes - javascript

When I execute js in CEFSharp using EvaluateScriptAsync(), I can return primitive types like string or array. For example, the following works:
var result = await Browser.EvaluateScriptAsync("Array.from(document.getElementsByTagName('input')).map(element => element.value)");
if (result.Success && result.Result != null)
{
dynamic values = result.Result;
foreach (dynamic value in values)
{
MessageBox.Show($"Value is: {value}");
}
}
But once I try to get a DOM element, either one or a list of, I get null:
var result = await Browser.EvaluateScriptAsync("Array.from(document.getElementsByTagName('input'))");
// `result.Success` is `true`, `result.Result` is `null`
I thought that CEFSharp only knows how to marshal primitive types, but object literals also work:
var result = await Browser.EvaluateScriptAsync("({ a: 1, b: 'hello' })");
if (result.Success && result.Result != null)
{
dynamic obj = result.Result;
MessageBox.Show($"{{ a: {obj.a}, b: {obj.b} }}");
}
So it turns out that CEFSharp only doesn't know how to marshal DOM objects.
Why? Is there a solution or workaround out there?

Firstly it's important to understand that Javascript is executed in the render process. The result of EvaluateScriptAsync is effectively a DTO, we create an object that represents the result of executing the script.
It's not currently possible to return a HTMLElement or any object that has a cyclic reference.
If we look at `HTMLElement as a specific example it will have a parentElement/parentNode and the parent has children which includes the node itself. You also end up walking the whole DOM tree as well.
CEF has very limited type support for it's CefV8Value type, so it's hard to do anything too fancy. See this.
We could potentially add an extension method that wraps the user script in an IIFE and does some instanceof HTMLElement style type checking to return a trimmed down representation of the HTML element. See this for an example of how I'm fudging support for returning a Promise.

As an alternative to using JavaScript you can now use CefSharp.Dom which is an asynchronous library for accessing the DOM.
It's freely available on
// Add using CefSharp.Dom to access CreateDevToolsContextAsync and related extension methods.
await using var devToolsContext = await chromiumWebBrowser.CreateDevToolsContextAsync();
// Get element by Id
// https://developer.mozilla.org/en-US/docs/Web/API/Document/querySelector
var element = await devToolsContext.QuerySelectorAsync<HtmlElement>("#myElementId");
//Strongly typed element types (this is only a subset of the types mapped)
var htmlDivElement = await devToolsContext.QuerySelectorAsync<HtmlDivElement>("#myDivElementId");
var htmlSpanElement = await devToolsContext.QuerySelectorAsync<HtmlSpanElement>("#mySpanElementId");
var htmlSelectElement = await devToolsContext.QuerySelectorAsync<HtmlSelectElement>("#mySelectElementId");
var htmlInputElement = await devToolsContext.QuerySelectorAsync<HtmlInputElement>("#myInputElementId");
var htmlFormElement = await devToolsContext.QuerySelectorAsync<HtmlFormElement>("#myFormElementId");
var htmlAnchorElement = await devToolsContext.QuerySelectorAsync<HtmlAnchorElement>("#myAnchorElementId");
var htmlImageElement = await devToolsContext.QuerySelectorAsync<HtmlImageElement>("#myImageElementId");
var htmlTextAreaElement = await devToolsContext.QuerySelectorAsync<HtmlImageElement>("#myTextAreaElementId");
var htmlButtonElement = await devToolsContext.QuerySelectorAsync<HtmlButtonElement>("#myButtonElementId");
var htmlParagraphElement = await devToolsContext.QuerySelectorAsync<HtmlParagraphElement>("#myParagraphElementId");
var htmlTableElement = await devToolsContext.QuerySelectorAsync<HtmlTableElement>("#myTableElementId");
// Get a custom attribute value
var customAttribute = await element.GetAttributeAsync<string>("data-customAttribute");
//Set innerText property for the element
await element.SetInnerTextAsync("Welcome!");
//Get innerText property for the element
var innerText = await element.GetInnerTextAsync();
//Get all child elements
var childElements = await element.QuerySelectorAllAsync("div");
//Change CSS style background colour
await element.EvaluateFunctionAsync("e => e.style.backgroundColor = 'yellow'");
//Type text in an input field
await element.TypeAsync("Welcome to my Website!");
//Click The element
await element.ClickAsync();
// Simple way of chaining method calls together when you don't need a handle to the HtmlElement
var htmlButtonElementInnerText = await devToolsContext.QuerySelectorAsync<HtmlButtonElement>("#myButtonElementId")
.AndThen(x => x.GetInnerTextAsync());
//Event Handler
//Expose a function to javascript, functions persist across navigations
//So only need to do this once
await devToolsContext.ExposeFunctionAsync("jsAlertButtonClick", () =>
{
_ = devToolsContext.EvaluateExpressionAsync("window.alert('Hello! You invoked window.alert()');");
});
var jsAlertButton = await devToolsContext.QuerySelectorAsync<HtmlButtonElement>("#jsAlertButton");
//Write up the click event listner to call our exposed function
_ = jsAlertButton.AddEventListenerAsync("click", "jsAlertButtonClick");
//Get a collection of HtmlElements
var divElements = await devToolsContext.QuerySelectorAllAsync<HtmlDivElement>("div");
foreach (var div in divElements)
{
// Get a reference to the CSSStyleDeclaration
var style = await div.GetStyleAsync();
//Set the border to 1px solid red
await style.SetPropertyAsync("border", "1px solid red", important: true);
await div.SetAttributeAsync("data-customAttribute", "123");
await div.SetInnerTextAsync("Updated Div innerText");
}
//Using standard array
var tableRows = await htmlTableElement.GetRowsAsync().ToArrayAsync();
foreach (var row in tableRows)
{
var cells = await row.GetCellsAsync().ToArrayAsync();
foreach (var cell in cells)
{
var newDiv = await devToolsContext.CreateHtmlElementAsync<HtmlDivElement>("div");
await newDiv.SetInnerTextAsync("New Div Added!");
await cell.AppendChildAsync(newDiv);
}
}
//Get a reference to the HtmlCollection and use async enumerable
//Requires Net Core 3.1 or higher
var tableRowsHtmlCollection = await htmlTableElement.GetRowsAsync();
await foreach (var row in tableRowsHtmlCollection)
{
var cells = await row.GetCellsAsync();
await foreach (var cell in cells)
{
var newDiv = await devToolsContext.CreateHtmlElementAsync<HtmlDivElement>("div");
await newDiv.SetInnerTextAsync("New Div Added!");
await cell.AppendChildAsync(newDiv);
}
}

Related

How do I print just the first and second element of an array?

I have an array called tagline that looks like this:
[" Leger Poll", " Web survey of 2", "test", "test", "test", "test"]
it is pulled from an external CSS file. I have assigned it a variable name tagline.
I want to print the first and second elements using document.getElementById so that I can style the text. I am not sure why this is not working? I tried pulling the variable outside of the main function so that it would be global but still not working. I am a beginner coder. Here is what I have. Please help.
var tagline = [];
async function getData() {
// const response = await fetch('testdata.csv');
var response = await fetch('data/test3.csv');
var data = await response.text();
data = data.replace(/"/g, "");
var years = [];
var vals = [];
var rows = data.split('\n').slice(1);
rows = rows.slice(0, rows.length - 1);
rows = rows.filter(row => row.length !== 0)
rows.forEach(row => {
var cols = row.split(",");
years.push(cols[0]);
vals.push(0 + parseFloat(cols[1]));
tagline.push(cols[2]);
});
console.log(years, vals, tagline);
return { years, vals, tagline };
}
var res = tagline.slice(1);
document.getElementById("demo1").innerHTML = res;
var res2 = tagline.slice(2);
document.getElementById("demo2").innerHTML = res2;
</script> ```
It seems You defined the function getData() but you didn't call it to execute.
Since you use Async function, I am using then().
var tagline = [];
async function getData() { ...// your function }
getData().then(() => {
const res = tagline[0];
document.getElementById("demo1").innerHTML = res;
const res2 = tagline[1];
document.getElementById("demo2").innerHTML = res2;
});
To access a specific index of an array use:
array[index];
In your case:
tagline[0]; //first element
tagline[1]; //second element
Since the getData is async you must await for it to fill the tagline:
await getData(); //call it before you use the tagline array
If you are using an older version of JS which does not support async/await you need to wait for the promise response with .then.
Also, be aware:
The slice() method returns a shallow copy of a portion of an array
into a new array object selected from start to end (end not included)
where start and end represent the index of items in that array.

Extracting childNodes from Table.getProperties('childNodes/children')

so I'm having this issue trying to scrape a web-table. Im able to extract tablenodes by using the 'firstChild' and 'lastElementChild' as a single child node. My problem here is that i want to extract all the childnodes(rows/cells) in map or array so i can iterate and extract data in a loop.
NOTE: im using puppeteer therefore ASYNC function
here is a code-snippet:
const [table] = await page.$x(xpath);
const tbody = await table.getProperty('lastElementChild'); //<-- in this case tbody is lastchild
const rows = Array.from(await tbody.getProperties('childNodes')); // <-- LINE OF THE PROBLEM
const cell = await rows.getProperty('firstChild') // <-- using firstChild for testing (ideally 'childNodes' with forEach())
const data = await cell.getProperty('innerText');
const txt = await data.jsonValue();
console.log(txt);
i found another way...
here is the solution:
const row = await page.evaluate(() => {
let row = document.querySelector('.fluid-table__row'); //<-- this refers to a HTML class
let cells = [];
row.childNodes.forEach(function(cell){
cells.push(cell.textContent)
})
return cells;
})
console.log(row);

Check if element class contains string using playwright

I am trying to get the element of day 18, and check if it has disabled on its class.
<div class="react-datepicker__day react-datepicker__day--tue" aria-label="day-16" role="option">16</div>
<div class="react-datepicker__day react-datepicker__day--wed react-datepicker__day--today" aria-label="day-17" role="option">17</div>
<div class="react-datepicker__day react-datepicker__day--thu react-datepicker__day--disabled" aria-label="day-18" role="option">18</div>
this is my code, assume
this.xpath = 'xpath=.//*[contains(#class, "react-datepicker__day") and not (contains(#class, "outside-month")) and ./text()="18"]'
async isDateAvailable () {
const dayElt = await this.page.$(this.xpath)
console.log(dayElt.classList.contains('disabled'))) \\this should return true
I can't seem to make it work. Error says TypeError: Cannot read property 'contains' of undefined. Can you help point what I am doing wrong here?
Looks like you can just write
await expect(page.locator('.selector-name')).toHaveClass(/target-class/)
/target-class/ - slashes is required because it's RegExp
For check few classes by one a call I use this helper (It's because api way doesn't work for me https://playwright.dev/docs/test-assertions#locator-assertions-to-have-class):
async function expectHaveClasses(locator: Locator, className: string) {
// get current classes of element
const attrClass = await locator.getAttribute('class')
const elementClasses: string[] = attrClass ? attrClass.split(' ') : []
const targetClasses: string[] = className.split(' ')
// Every class should be present in the current class list
const isValid = targetClasses.every(classItem => elementClasses.includes(classItem))
expect(isValid).toBeTruthy()
}
In className you can write few classes separated by space:
const result = await expectHaveClasses(page.locator('.item'), 'class-a class-b')
You have to evaluate it inside the browser. $ will return an ElementHandle which is a wrapper around the browser DOM element, so you have to use e.g. evaluate then on it. Or simply $eval which will lookup the element, pass it into a callback which gets executed inside the browsers JavaScript engine. This means something like that would work:
// #ts-check
const playwright = require("playwright");
(async () => {
const browser = await playwright.chromium.launch();
const context = await browser.newContext();
const page = await context.newPage();
await page.setContent(`
<div id="a1" class="foo"></div>
`)
console.log(
await page.$eval("#a1", el => el.classList.contains("foo1"))
)
await browser.close();
})();

Puppeteer: Converting circular structure to JSON Are you passing a nested JSHandle?

I am trying to scrape a one-page website. There are multiple selection combinations that would result in different search redirects. I wrote a for loop in the page.evaluate's call back function to click the different selections and did the click search in every button. However, I got error: Converting circular structure to JSON Are you passing a nested JSHandle?
Please help!
My current version of code looks like this:
const res = await page.evaluate(async (i, courseCountArr, page) => {
for (let j = 1; j < courseCountArr[i]; j++) {
await document.querySelectorAll('.btn-group > button, .bootstrap-select > button')['1'].click() // click on school drop down
await document.querySelectorAll('div.bs-container > div.dropdown-menu > ul > li > a')[`${j}`].click() // click on each school option
await document.querySelectorAll('.btn-group > button, .bootstrap-select > button')['2'].click() // click on subject drop down
const subjectLen = document.querySelectorAll('div.bs-container > div.dropdown-menu > ul > li > a').length // length of the subject drop down
for (let k = 1; k < subjectLen; k++) {
await document.querySelectorAll('div.bs-container > div.dropdown-menu > ul > li > a')[`${k}`].click() // click on each subject option
document.getElementById('buttonSearch').click() //click on search button
page.waitForSelector('.strong, .section-body')
return document.querySelectorAll('.strong, .section-body').length
}
}
}, i, courseCountArr, page);
Why the error happens
While you haven't shown enough code to reproduce the problem (is courseCountArr an array of ElementHandles? Passing page to evaluate won't work either, that's a Node object), here's a minimal reproduction that shows the likely pattern:
const puppeteer = require("puppeteer");
let browser;
(async () => {
const html = `<ul><li>a</li><li>b</li><li>c</li></ul>`;
browser = await puppeteer.launch();
const [page] = await browser.pages();
await page.setContent(html);
// ...
const nestedHandle = await page.$$("li"); // $$ selects all matches
await page.evaluate(els => {}, nestedHandle); // throws
// ...
})()
.catch(err => console.error(err))
.finally(() => browser?.close())
;
The output is
TypeError: Converting circular structure to JSON
--> starting at object with constructor 'BrowserContext'
| property '_browser' -> object with constructor 'Browser'
--- property '_defaultContext' closes the circle Are you passing a nested JSHandle?
at JSON.stringify (<anonymous>)
Why is this happening? All code inside of the callback to page.evaluate (and family: evaluateHandle, $eval, $$eval) is executed inside the browser console programmatically by Puppeteer. The browser console is a distinct environment from Node, where Puppeteer and the ElementHandles live. To bridge the inter-process gap, the callback to evaluate, parameters and return value are serialized and deserialized.
The consequence of this is that you can't access any Node state like you're attempting with page.waitForSelector('.strong, .section-body') inside the browser. page is in a totally different process from the browser. (As an aside, document.querySelectorAll is purely synchronous, so there's no point in awaiting it.)
Puppeteer ElementHandles are complex structures used to hook into the page's DOM that can't be serialized and passed to the page as you're trying to do. Puppeteer has to perform the translation under the hood. Any ElementHandles passed to evaluate (or have .evaluate() called on them) are followed to the DOM node in the browser that they represent, and that DOM node is what your evaluate's callback is invoked with. Puppeteer can't do this with nested ElementHandles, as of the time of writing.
Possible fixes
In the above code, if you change .$$ to .$, you'll retrieve only the first <li>. This singular, non-nested ElementHandle can be converted to an element:
// ...
const handle = await page.$("li");
const val = await page.evaluate(el => el.innerText, handle);
console.log(val); // => a
// ...
Or:
const handle = await page.$("li");
const val = await handle.evaluate(el => el.innerText);
console.log(val); // => a
Making this work on your example is a matter of either swapping the loop and the evaluate call so that you access courseCountArr[i] in Puppeteer land, unpacking the nested ElementHandles into separate parameters to evaluate, or moving most of your console browser calls to click on things back to Puppeteer (depending on your use case and goals with the code).
You could apply the evaluate call to each ElementHandle:
const nestedHandles = await page.$$("li");
for (const handle of nestedHandles) {
const val = await handle.evaluate(el => el.innerText);
console.log(val); // a b c
}
To get an array of results, you could do:
const nestedHandles = await page.$$("li");
const vals = await Promise.all(
nestedHandles.map(el => el.evaluate(el => el.innerText))
);
console.log(vals); // [ 'a', 'b', 'c' ]
You can also unpack the ElementHandles into arguments for evaluate and use the (...els) parameter list in the callback:
const nestedHandles = await page.$$("li");
const vals = await page.evaluate((...els) =>
els.map(e => e.innerText),
...nestedHandles
);
console.log(vals); // => [ 'a', 'b', 'c' ]
If you have other arguments in addition to the handles you can do:
const nestedHandle = await page.$$("li");
const vals = await page.evaluate((foo, bar, ...els) =>
els.map(e => e.innerText + foo + bar)
, 1, 2, ...nestedHandle);
console.log(vals); // => [ 'a12', 'b12', 'c12' ]
or:
const nestedHandle = await page.$$("li");
const vals = await page.evaluate(({foo, bar}, ...els) =>
els.map(e => e.innerText + foo + bar)
, {foo: 1, bar: 2}, ...nestedHandle);
console.log(vals); // => [ 'a12', 'b12', 'c12' ]
Another option may be to use $$eval, which selects multiple handles, then runs a callback in browser context with the array of selected elements as its parameter:
const vals = await page.$$eval("li", els =>
els.map(e => e.innerText)
);
console.log(vals); // => [ 'a', 'b', 'c' ]
This is probably cleanest if you're not doing anything else with the handles in Node.
Similarly, you can totally bypass Puppeteer and do the entire selection and manipulation in browser context:
const vals = await page.evaluate(() =>
[...document.querySelectorAll("li")].map(e => e.innerText)
);
console.log(vals); // => [ 'a', 'b', 'c' ]
(note that getting the inner text throughout is just a placeholder for whatever browser code of arbitrary complexity you might have)
I wrote a little utility to solve this problem
const jsHandleToJSON = (jsHandle) => {
if (jsHandle.length > 0) {
let json = []
for (let i = 0; i < jsHandle.length; i++) {
json.push(jsHandleToJSON(jsHandle[i]))
}
return json
} else {
let json = {}
const keys = Object.keys(jsHandle)
for (let i = 0; i < keys.length; i++) {
if (typeof jsHandle[keys[i]] !== 'object') {
json[keys[i]] = jsHandle[keys[i]]
} else if (['elements', 'element'].includes(keys[i])) {
json[keys[i]] = jsHandleToJSON(jsHandle[keys[i]])
} else {
console.log(`skipping field ${keys[i]}`)
}
}
return json
}
}
It will create a new object with all the primitive fields of the jsHandle (recursively) and parse some extra jsHandle properties ['elements', 'element'], skips the others.
You could add more properties in there if you need them (but adding all of them will result in a infinite loop).
To make the log into puppeteer working you need to add the following line before the evaluate
page.on('console', message => console.log(`${message.type()}: ${message.text()}`))

Continue on Null Value of Result (Nodejs, Puppeteer)

I'm just starting to play around with Puppeteer (Headless Chrome) and Nodejs. I'm scraping some test sites, and things work great when all the values are present, but if the value is missing I get an error like:
Cannot read property 'src' of null (so in the code below, the first two passes might have all values, but the third pass, there is no picture, so it just errors out).
Before I was using if(!picture) continue; but I think it's not working now because of the for loop.
Any help would be greatly appreciated, thanks!
for (let i = 1; i <= 3; i++) {
//...Getting to correct page and scraping it three times
const result = await page.evaluate(() => {
let title = document.querySelector('h1').innerText;
let article = document.querySelector('.c-entry-content').innerText;
let picture = document.querySelector('.c-picture img').src;
if (!document.querySelector('.c-picture img').src) {
let picture = 'No Link'; } //throws error
let source = "The Verge";
let categories = "Tech";
if (!picture)
continue; //throws error
return {
title,
article,
picture,
source,
categories
}
});
}
let picture = document.querySelector('.c-picture img').src;
if (!document.querySelector('.c-picture img').src) {
let picture = 'No Link'; } //throws error
If there is no picture, then document.querySelector() returns null, which does not have a src property. You need to check that your query found an element before trying to read the src property.
Moving the null-check to the top of the function has the added benefit of saving unnecessary calculations when you are just going to bail out anyway.
async function scrape3() {
// ...
for (let i = 1; i <= 3; i++) {
//...Getting to correct page and scraping it three times
const result = await page.evaluate(() => {
const pictureElement = document.querySelector('.c-picture img');
if (!pictureElement) return null;
const picture = pictureElement.src;
const title = document.querySelector('h1').innerText;
const article = document.querySelector('.c-entry-content').innerText;
const source = "The Verge";
const categories = "Tech";
return {
title,
article,
picture,
source,
categories
}
});
if (!result) continue;
// ... do stuff with result
}
Answering comment question: "Is there a way just to skip anything blank, and return the rest?"
Yes. You just need to check the existence of each element that could be missing before trying to read a property off of it. In this case we can omit the early return since you're always interested in all the results.
async function scrape3() {
// ...
for (let i = 1; i <= 3; i++) {
const result = await page.evaluate(() => {
const img = document.querySelector('.c-picture img');
const h1 = document.querySelector('h1');
const content = document.querySelector('.c-entry-content');
const picture = img ? img.src : '';
const title = h1 ? h1.innerText : '';
const article = content ? content.innerText : '';
const source = "The Verge";
const categories = "Tech";
return {
title,
article,
picture,
source,
categories
}
});
// ...
}
}
Further thoughts
Since I'm still on this question, let me take this one step further, and refactor it a bit with some higher level techniques you might be interested in. Not sure if this is exactly what you are after, but it should give you some ideas about writing more maintainable code.
// Generic reusable helper to return an object property
// if object exists and has property, else a default value
//
// This is a curried function accepting one argument at a
// time and capturing each parameter in a closure.
//
const maybeGetProp = default => key => object =>
(object && object.hasOwnProperty(key)) ? object.key : default
// Pass in empty string as the default value
//
const getPropOrEmptyString = maybeGetProp('')
// Apply the second parameter, the property name, making 2
// slightly different functions which have a default value
// and a property name pre-loaded. Both functions only need
// an object passed in to return either the property if it
// exists or an empty string.
//
const maybeText = getPropOrEmptyString('innerText')
const maybeSrc = getPropOrEmptyString('src')
async function scrape3() {
// ...
// The _ parameter name is acknowledging that we expect a
// an argument passed in but saying we plan to ignore it.
//
const evaluate = _ => page.evaluate(() => {
// Attempt to retrieve the desired elements
//
const img = document.querySelector('.c-picture img');
const h1 = document.querySelector('h1')
const content = document.querySelector('.c-entry-content')
// Return the results, with empty string in
// place of any missing properties.
//
return {
title: maybeText(h1),
article: maybeText(article),
picture: maybeSrc(img),
source: 'The Verge',
categories: 'Tech'
}
}))
// Start with an empty array of length 3
//
const evaluations = Array(3).fill()
// Then map over that array ignoring the undefined
// input and return a promise for a page evaluation
//
.map(evaluate)
// All 3 scrapes are occuring concurrently. We'll
// wait for all of them to finish.
//
const results = await Promise.all(evaluations)
// Now we have an array of results, so we can
// continue using array methods to iterate over them
// or otherwise manipulate or transform them
//
results
.filter(result => result.title && result.picture)
.forEach(result => {
//
// Do something with each result
//
})
}
Try-catch worked for me:
try {
if (await page.$eval('element')!==null) {
const name = await page.$eval('element')
}
}catch(error){
name = ''
}

Categories