HtmlAgilityPack download webpage which loaded asynchronously by javascript - javascript

i am using HtmlAgilityPack and trying to load some webpages. some webpages are javascript based and loads asynchoronously. is there any way to load web page after x seconds or after making sure page is completely loaded

Html Agility Pack is not mimicking the client side calls to dynamically load content into the DOM. It is a headless scraper that is downloading the static page given by the server; if you want that content, you will have to mimic the calls made by the client browser. If you do not want to try to emulate the calls a browser would make, instead of using a headless scraper, you can use something like Selenium to do this for you, the down side being, the browser will be opened on the host machine.

Related

How to manipulate DOM using Node js?

I am new to javascript and Node.js but I am trying to figure out if there is an alternative to document.getElementById() in Node that has the same function. If it cannot be done in Node, is it possible to create a pure js file to manipulate the DOM and a separate Node file. For extra information, what I am trying to do is to convert csv lines into a json object and then update the webpage with new information which is why I want to use document.getElementById().
document.getElementById() is a function that exists in a browser. There is no such function in nodejs.
It is possible to get a 3rd party module that will parse an HTML web page, create a DOM and then allow you to access the DOM programmatically to see what's in the web page. Cheerio and Puppeteer are two such 3rd party modules, each with differing levels of features. Puppeteer actually uses the Chromium browser engine and can even run Javascript in the page and generate screenshots. Cheerio parses the HTML and lets you access just what it creates (without Javascript running).
It sounds like maybe you're a bit confused about how web pages work. A browser running on the end user's computer loads a web page. Once the page is loaded, at that point the server's job is done. The web page exists only in the browser on the user's computer. The server can't directly, on its own, change that web page.
To change that web page (without reloading it), you would have to have supporting Javascript code in the web page (that runs in the user's browser). For example, you could have your Javascript make an Ajax call from the web page that would request certain data from the server. When the server gets that request, it could generate the data and return JSON back to the browser. The Javascript in the browser would then receive that JSON, parse it into a Javascript object and then use the DOM to insert new objects into the existing web page based on the data it received.
Note that all changes to the existing web page in the browser are made by the Javascript running in the web page in the user's browser, not directly by the server. The server can supply data, but cannot directly change the user's web page itself. Of course, the user could request an update page and the browser would request a new version of the whole page and the server could then supply a page that had different data in it, but that would involve reloading the whole page.
There are also template engines that exist for nodejs so that when your server is generating a web page, the template engine can help you create a set of HTML for that web page that incorporates dynamic data. This doesn't dynamically change a web page that is already sitting in a browser being displayed. Instead, it helps you generate a web page from scratch that incorporates dynamic data into the web page when it is first downloaded. Examples of templates engines that work with Express in nodejs are Pug, EJS, Nunjucks, handlebars and many others.

How to load JavaScript enabled response using axios or fetch API in JavaScript?

I am working on a personal project in which I want to read the whole HTML of a JavaScript dependent webpage. For Example if I to load this URL in a JavaScript Enabled web browser, this is what I get:
However, if I disable JavaScript in the browser, and load the same URL now, I get this:
This is pretty normal I know.
Now I am trying to load the HTML of the same link in JavaScript code using axios HTTP client, and obviously I am getting the HTML of JavaScript disabled webpage as the HTTP response.
I want to get the HTML(+JS) source as the response of the same link (in which JavaScript is enabled). I don't know how to mimic a JavaScript enabled Web Browser when working with HTTP clients like axios or fetch API.
If you're trying to do this in the browser, you basically can't unless the site you're loading lets you do so (via CORS or similar). You'd have to load it into a window or iframe, wait for its JavaScript to run, and then access the resulting DOM. But accessing the DOM of a cross-origin page is disabled by default.
The only browser-based way I can think of is to write and install a browser extension, since when a user installs an extension, they can grant greater power to the extension than a web page normally has.
If you're trying to do this in a non-browser environment, you can use a headless browser like headless Chromium or similar. The browser-based restrictions don't apply.

Web browser Control loading html with angular.js script

I'm trying to create a c# windows form application with Web Browser Control. The intention is to load local html file which is designed using angular.js. When I run the application, I get script error. "Do you want to continue running the script?"
The files are all local. What's the best way to enable web browser control to load local html files containing javascript in it.

How to load only html from web pages in selenium

How to load only html from web pages in selenium?
I need only html of requested page without css and javascript.
If you need selenium for web-scraping, strictly speaking, you would still need need javascript and css files since they can take a significant part in the page load and rendering. For example, several parts of a page can be loaded with additional ajax calls, or inserted via a custom javascript logic.
Also, if you want only HTML part of a page, why do you need to involve a real browser?
If you still want to prevent js and css files from loading, you can configure certain permissions in Firefox through tweaking FirefoxProfile preferences, see:
Do not want images to load and CSS to render on Firefox in Selenium WebDriver tests with Python
FirefoxDriver: how to disable javascript,css and make sendKeys type instantly?

Append javascript/html to page when navigating from a different page?

Alright, first off this is not a malicious question I'm asking. I have no intentions of using any info for ill gains.
I have an application that contains an embedded browser. This browser runs within the application's process, so I can't access it via Selenium WebDriver or anything like that. I know that it's possible to dynamically append scripts and html to loaded web pages via WebDriver, because I've done it.
In the embedded browser, I don't have access to the pages that get loaded. Instead, I can create my own html/javascript pages and execute them, to manipulate the application that houses the browser. I'm having trouble manipulating the existing pages within the browser.
Is there a way to dynamically add javascript to a page when you navigate to it and have it execute right after the page loads?
Something like
page1.navigateToUrl(executeThisScriptOnLoad)
page2 then executes the passed script.
I guess it is not possible to do it without knowledge of destination site. Although you can send data to the site and then use eval() function to evaluate sent data on destination page.

Categories