I want to convert some web pages with javascript to plain html, and I found there several ways(pls tell me if I'm wrong):
Use Jython, an example: http://blog.databigbang.com/web-scraping-ajax-and-javascript-sites/
Use Java together with htmlunit
Use a proxy, an example: http://grep.codeconsult.ch/2007/02/24/crowbar-scrape-javascript-generated-pages-via-gecko-and-rest/
Use python together with qt or PyV8
Because I want to make a tiny tool to meet my request, and I thought it somewhat complicated to install V8 and qt, although python is my first choice.
So I tried to make a proxy with gecko, but it seems need a DISPLAY which I can not afford in a remote Linux server.
Now I am trying to use jython, but it seems there is no simple way to just convert a whole page to plain html.
Actually, I want to ask is there a way to convert a web page contains javascript to plain html, just like the brower does. Can node.js do this job?
I've recently built a server on top of PhantomJS that does this. I highly recommend this route.
http://phantomjs.org/
Basically, you write a quick script that has PhantomJS run the page, and configure a trigger method that lets you know the page is finished and sends the data off. My version used the built-in HTTP server, so PhantomJS easily served up the results on its own. This takes about 15 lines of code to do. (Sorry, can't paste it here... wrote it on work time. But, check out the example on their home page. It's almost complete!)
Related
A little info:
When 'inspected' (Google Chrome), the website displays the information I need (namely, a simple link to a .pdf).
When I cURL the website, only a part of it gets saved. This coupled with the fact that there are functions and <script> tags leads me to believe that javascript is the culprit (I'm honestly not 100% sure, as I'm pretty new at this).
I need to pull this link periodically, and it changes each time.
The question:
Is there a way for me, in bash, to run this javascript and save the new HTML code it generates to a file?
Not trivially.
Typically, for that approach, you need to:
Construct a DOM from the HTML
Execute the JavaScript in the context of that DOM while resolving URLs relative to the URL you fetched the HTML from
There are tools which can help with this, such as Puppeteer, PhantomJS, and Selenium, but they generally lend themselves to being driven with beefier programming languages than bash.
As an alternative, you can look at reverse engineering the page. It gets the data from somewhere. You can probably work out the URLs (the Network tab of a browser's developer tools is helpful there) and access them directly.
If you want to download a web page that generates itself with JavaScript, you'll need to execute this JavaScript in order to load the page. To achieve this you can use libraries that do this like puppeteer with NodeJS. There's a lot of other libraries, but that's the most popular.
If you're wondering why does this happens, it's because web developers often use frameworks like React, Vue or Angular to quote the most popular ones which only generates a JavaScript output that's not executed by common HTTP requesting libraries.
I want to write to a text file using javascript. I know that it is possible but there are some problems.
I am running a javascript program that calculates the location of an object (its latitude & longitude) which changes every 5 seconds; i want to write this information to a text file. The javascript program will soon run on a server and I'll use the information written to the text file to communicate with an Android app on my phone.
So, my question really is:
How can it be done properly?
I know that there may be some permission issues but considering it won't be online and available to others will it be a problem and, if it is, should I go with PHP to do what I want? I know ASP is more Microsoft orientated and I work with a Mac so PHP would be the preference here.
When writing a file, is it possible to write to an existing file or does the process simply destroy and recreate the same file?
I would use PHP
http://www.php.net/
This has a good code example:
http://www.tizag.com/phpT/filewrite.php
Also, you can make the request using jQuery's AJAX function, this will allow to effectively run this code from javascript:
http://api.jquery.com/jQuery.ajax/
Using the latest HTML5 Javascript, you can use the FileSystem APIs to read/write/append to text files. This is a good tutorial here : Exploring the FileSystem APIs.
I have a program written in python, and I would like to make it easy to enter parameter values for this program through a GUI. I realise that I could create a GUI using python tools, but I am interested in using a html / javascript page and have the javascript code call my python script when the user clicks a button to run. Something like;
var xmlhttp = new XMLHttpRequest();
xmlhttp.open("GET", "../scripts/python_script.py", true);
xmlhttp.send();
Currently, when I do that, I just get back the text in the python script, but it doesn't actually run. Ideally the python script would run in the background without blocking further input to the web page, and as the script produces it's different result files (png images), these would be displayed in the browser. Clearly, I could do this using a web server (and I may end up doing this eventually anyway, hence the html interface), but I am wondering if it is possible to do so without one. That way I could package the html page and the python script together and give them to someone who could then go and run the program on their computer without needing to start a web server. Is this possible?
If it is not, is there an alternative way do achieve a similar result? Could I embed a small server into a python script that displays the html page when it starts up, and then responds to an XMLHttpRequest to start the python script? If I did this, would the user have to start the script, and then go to the specified address in their browser as a separate action?
EDIT: I got a quick solution working using SimpleHTTPServer, but I had a look at bottle and I'll probably try something using that as well. Thanks for your help.
First of all, using something like bottle it is pretty simple to make a web server to run your script. Look at http://bottlepy.org/docs/dev/
A good starting point is the code at http://bottlepy.org/docs/dev/tutorial.html#http-request-methods but you would put up a form asking for parameters rather than a login form. Then just run your Python script, capture the output and send it back in the return statement.
This question Capture subprocess output shows you two ways to run your main script depending on whether you want to show the output progressively or all at the end.
You'd need to bundle some kind of webserver with the application. If it is not intended for deployment I would go for something like bottle.py. It's a micro web framework that has its own development server. Other micro/mini frameworks probably pack their own webserver with them for development purposes (web2py, flask, ..).
If you want something more serious you'd probably need some better web server. If that's the case - have a look at this reddit discussion.
This is kind of tricky. There is this webpage which, I am guessing, uses some kind of AJAX to pull out content based on the search query. When I fetch the page using get in Perl, it fetches the script code behind the php/html, but not the results which are displayed when the query is searched manually. I need to be able to fetch the content of the results page. Is there anyway to do this in Perl?
Take a look at Selenium RC and the WWW::Selenium module in Perl. With them you can control a real web browser.
Another option is WWW::HtmlUnit which uses the HtmlUnit Java library to execute the JavaScript without a web browser. WWW::HtmlUnit uses Inline::Java to give Perl access to the library. I have found that when installing, it is best to say No to the question "Do you wish to build the JNI extension?".
If you are writing tests that need to check the rendered page, you can have a look at Schwern's javascript-tap-harness, which works with Selenium and handles all the scaffolding.
I also found Using WWW::Selenium To Test Or Automate An Ajax Website pretty useful.
I'm writing a web crawler (web spider) that crawl all links in a website.
My application is a Win32 App, written in C# with .Net framework 3.5.
Now I'm using HttpWebRequest an HttpWebResponse to communicate with the web server.
I also built my own Http Parser that can parse anything I want.
I found all link like "href", "src", "action"... in the parse.
But I can not solve one problem: Simulate Client Script in the page (like JS and VBS)
For example, if a link like:
a href = "javascript:buildLink(1)"
... with buildLink(parameter) is a Javascript function that will make a custom link due to the parameter.
Please help me to solve this problem. How to simulate JavaScript in this app? I can parse the HTML source code and take all JavaScript code to another file, but how to simulate a function of it?
Thanks.
Your only real option is to automate a browser. As other answers have said, you cannot reliably simulate browser javascript without having a complete DOM.
There are fortunately ways to automate the browser, check out Selenium.
It has a C# API, so you can control the browser from C#.
Use your .NET web crawler code to crawl the site. Whenever you encounter a href="javascript:... link, handle the page containing the link in Selenium:
Use the Selenium API to tell the browser to load the page.
Use the Selenium API to find all links on the page.
This way, your spider only uses Selenium when necessary (pages without javascript links can be handled by the browser-less spider code you already got). And since this is an embarrassingly parallel workload, you could easily have multiple Selenium processes running at the same time (either on one computer or on other computers).
But remember that href="javascript is hardly the only way a page can have dynamic links. The more common case is probably that a onload or $(document).ready() script manipulates the DOM and adds links that way.
To catch that case (and others), the spider probably will have to use Selenium for all pages that have a <script> tag.
You are basically pretending to be a browser, except that HttpWebRequest only does the networking stuff for you.
I would recommend using the ie web browser control and interop'ing into that from your c# application. That will allow you to run JavaScript, set variables, post, etc etc.
Here's some basic links I found after a search for "ie web browser control":
http://www.c-sharpcorner.com/UploadFile/mahesh/WebBrowserInCSMDB12022005001524AM/WebBrowserInCSMDB.aspx
http://support.microsoft.com/kb/313068
This is a problem which is not easily solved. You could consider taking one of the existing JavaScript implementations and porting or interfacing with it somehow.
If I were tackling this problem, I'd probably build a small side application in Java on top of Rhino, with some sort of RPC framework layered on top of that so that I could communicate with it from my primary application.
Unfortunately, without having a complete DOM implementation on top of that, you would be limited to only very simple javascript.
You could execute the javascript by using the MS JScript engine or something similar.
MSDN Reference
Eric Lippert's blog on using Eval (part 1) (part 2) (part 3)
This isn't guaranteed to work, especially if the javascript tries to access the DOM, or somesuch... But for simple scripts, it might be enough.