Process web page in JavaScript/browser and inserting into database? - javascript

I'm processing URLs with my program,
Basically what it does is, file_get_contents($url) to get web content and attach a JavaScript code at the bottom which processes the HTML source for the biggest image and inserts into the database by ajax.
The problem is I have to instantiate Firefox browser for the processing. I really don't need Firefox to render the page visually and other workings. All I want is for my script to do its job.
So is there a way to use Firefox's HTML/CSS and JavaScript engine without having to call the entire browser?

Related

How to manipulate DOM using Node js?

I am new to javascript and Node.js but I am trying to figure out if there is an alternative to document.getElementById() in Node that has the same function. If it cannot be done in Node, is it possible to create a pure js file to manipulate the DOM and a separate Node file. For extra information, what I am trying to do is to convert csv lines into a json object and then update the webpage with new information which is why I want to use document.getElementById().
document.getElementById() is a function that exists in a browser. There is no such function in nodejs.
It is possible to get a 3rd party module that will parse an HTML web page, create a DOM and then allow you to access the DOM programmatically to see what's in the web page. Cheerio and Puppeteer are two such 3rd party modules, each with differing levels of features. Puppeteer actually uses the Chromium browser engine and can even run Javascript in the page and generate screenshots. Cheerio parses the HTML and lets you access just what it creates (without Javascript running).
It sounds like maybe you're a bit confused about how web pages work. A browser running on the end user's computer loads a web page. Once the page is loaded, at that point the server's job is done. The web page exists only in the browser on the user's computer. The server can't directly, on its own, change that web page.
To change that web page (without reloading it), you would have to have supporting Javascript code in the web page (that runs in the user's browser). For example, you could have your Javascript make an Ajax call from the web page that would request certain data from the server. When the server gets that request, it could generate the data and return JSON back to the browser. The Javascript in the browser would then receive that JSON, parse it into a Javascript object and then use the DOM to insert new objects into the existing web page based on the data it received.
Note that all changes to the existing web page in the browser are made by the Javascript running in the web page in the user's browser, not directly by the server. The server can supply data, but cannot directly change the user's web page itself. Of course, the user could request an update page and the browser would request a new version of the whole page and the server could then supply a page that had different data in it, but that would involve reloading the whole page.
There are also template engines that exist for nodejs so that when your server is generating a web page, the template engine can help you create a set of HTML for that web page that incorporates dynamic data. This doesn't dynamically change a web page that is already sitting in a browser being displayed. Instead, it helps you generate a web page from scratch that incorporates dynamic data into the web page when it is first downloaded. Examples of templates engines that work with Express in nodejs are Pug, EJS, Nunjucks, handlebars and many others.

Web scraping of modal window(dialogue box) using jsoup

I am studying about the project in which I have to extract the data from the website . The project is in java and the website is in java script . I am using Jsoup to extract the data from the website But there are some modal windows(dialogue box , pop up windows) present in the web page.So Is it possible to extract the data of modal windows using jsoup?????
So if answer is yes , then how could I do it?? please provide links and if not, then what are the other best ways to do it???
Thanks for your help. I really appreciate it.
I assume that the modal is generated by Javascript.
Jsoup is just a parser. This means that it will make an HTTP request (GET or POST, whatever you tell it to do) and the server (website) will respond with the initial html. By saying initial, I mean the html before any javascript is executed.
Javascript can generate html (like the modal in question), but this is not visible to Jsoup because a parser can only read, it cannot execute code. The browser is able to generate the modal because it includes a Javascript execution engine that parses and executes Javascript.
When you visit a web page you don't know what is dynamic (generated by Javascript) and what is static (fetched by the server as is).
A little trick to check what is dynamic and what is static (static is visible to Jsoup) is to do the following:
Visit the web page you want to parse (with chrome if possible, mozilla will work too I think).
Press Ctrl + U. This will open a new tab.
The new tab will contain some mesh of html, css and js. This is what the server fetches to the browser and is also visible to Jsoup.
If the modal is in there, then great, it is visible to Jsoup. If not, then you have to use a library that acts as a headless browser.
A headless browser is essentially a browser without the graphical interface. It can parse and execute Javascript. It "sees" what a normal browser sees.
The most common library used is selenium webdriver. Be careful, selenium is a testing framework that has a lot of parts. What you need is the webdriver.
There a lot of examples out there with ready made code to get you started.

Pure HTML Solution for a Dynamic Widget Using Forms

I need to create a HTML code snippet that I will distribute to third party websites. This code snippet talks to a php file on my server and contains a logic to update the content(image) after specified time intervals. The reason I cannot use JavaScript is that it is not search engine friendly.
The way I have it now is using an HTML+ Javascript code which includes an XMLhttp request and uses Ajax to call a PHP file which in turn reads a csv file and updates the banner image on the third party site. But it is not crawlable by search engines.
Any other way of getting this to work using HTML? Probably using forms?
HTML is not active. If you want to do something, you need some sort of scripting language. You can do this without using Ajax (XMLhttp). Before Ajax, it was a common practice to relay information to the server using dynamic image loading. Of course, the dynamic image loading required a script. It can be rather simple:
<img id='myimg' src='temp.jpg'
onload="document.getElementById('myimg').src='myscript.php?width='+window.innerWidth;"
>
Your script replaces the image with whatever you like, but you have information delivered from the web page to your server through the get string. Originally, I saw this used extensively to deliver rotating ads. With this, you can record which ads are shown along with information that would otherwise only be known by the web browser.

How can I get the HTML generated with javascript?

I want to get the HTML content of a web page but most of the content is generated by javascript.
Is it posible to get this generated HTML (with python if posible)?
The only way I know of to do this from your server is to run the page in an actual browser engine that will parse the HTML, build the normal DOM environment, run the javascript in the page and then reach into that DOM engine and get the innerHTML from the body tag.
This could be done by firing up Chrome with the appropriate URL from Python and then using a Chrome plugin to fetch the dynamically generated HTML after the page was done initializing itself and communicate back to your Python.
Checkout Selenium. It have a python driver, which might be what you're looking for.
If most of the content is generated by Javascript then the Javascript may be doing ajax calls to retrieve the content. You may be able to call those server side scripts from your Python app.
Do check that it doesn't violate the website's terms though and get permission.

Append javascript/html to page when navigating from a different page?

Alright, first off this is not a malicious question I'm asking. I have no intentions of using any info for ill gains.
I have an application that contains an embedded browser. This browser runs within the application's process, so I can't access it via Selenium WebDriver or anything like that. I know that it's possible to dynamically append scripts and html to loaded web pages via WebDriver, because I've done it.
In the embedded browser, I don't have access to the pages that get loaded. Instead, I can create my own html/javascript pages and execute them, to manipulate the application that houses the browser. I'm having trouble manipulating the existing pages within the browser.
Is there a way to dynamically add javascript to a page when you navigate to it and have it execute right after the page loads?
Something like
page1.navigateToUrl(executeThisScriptOnLoad)
page2 then executes the passed script.
I guess it is not possible to do it without knowledge of destination site. Although you can send data to the site and then use eval() function to evaluate sent data on destination page.

Categories