How to save a website made with javascript to a file - javascript

A little info:
When 'inspected' (Google Chrome), the website displays the information I need (namely, a simple link to a .pdf).
When I cURL the website, only a part of it gets saved. This coupled with the fact that there are functions and <script> tags leads me to believe that javascript is the culprit (I'm honestly not 100% sure, as I'm pretty new at this).
I need to pull this link periodically, and it changes each time.
The question:
Is there a way for me, in bash, to run this javascript and save the new HTML code it generates to a file?

Not trivially.
Typically, for that approach, you need to:
Construct a DOM from the HTML
Execute the JavaScript in the context of that DOM while resolving URLs relative to the URL you fetched the HTML from
There are tools which can help with this, such as Puppeteer, PhantomJS, and Selenium, but they generally lend themselves to being driven with beefier programming languages than bash.
As an alternative, you can look at reverse engineering the page. It gets the data from somewhere. You can probably work out the URLs (the Network tab of a browser's developer tools is helpful there) and access them directly.

If you want to download a web page that generates itself with JavaScript, you'll need to execute this JavaScript in order to load the page. To achieve this you can use libraries that do this like puppeteer with NodeJS. There's a lot of other libraries, but that's the most popular.
If you're wondering why does this happens, it's because web developers often use frameworks like React, Vue or Angular to quote the most popular ones which only generates a JavaScript output that's not executed by common HTTP requesting libraries.

Related

How to use a contentScriptFile from an external URI in a Firefox addon

My latest update to a Firefox addon has been rejected because I've used a custom jquery-ui (generated by their site with just the widgets I wanted) and it fails their checksum check.
Your add-on includes a JavaScript library file that doesn't match our
checksums for known release versions. We require all add-ons to use
unmodified release versions, obtained directly from the developer's
website.
We accept JQuery/JQuery-UI libraries downloaded from
'ajax.googleapis.com', 'jquery.com' or 'jqueryui.com'; and used
without any modification. (file-name change does not matter) I'm
sorry, but we cannot accept modified, re-configured or customized
libraries.
Fair enough, I could just download the full one and resubmit, but I was wondering if it is possible to link to one instead?
If I try this:
contentScriptFile: [self.data.url("https://ajax.googleapis.com/ajax/libs/jquery/2.1.3/jquery.min.js"), self.data.url("https://ajax.googleapis.com/ajax/libs/jqueryui/1.11.3/jquery-ui.min.js"), self.data.url("api.js")],
I get an error at runtime telling me that content scripts much be local. Both google and the API seem to be proving illusive to me for an answer.
Does anyone know if this is possible and how?
Cheers
Rich
self.data.url("https://...")
It seems like you haven't read the documentation on data.url()
It clearly states that
The data.url() method returns a resource:// url that points at an embedded data file.
Which means you cannot link to an external resource.
Does anyone know if this is possible and how?
No, contentScriptFile runs with (slightly) elevated privileges compared to regular web content, that's why you are not allowed to load scripts from sources that might change and could theoretically inject malicious code in the future.
If you want to rely on external libraries and keep them up to date you could simply write a little build script that always downloads the newest version when building a new XPI.
In principle you could just load the script via privileged XHR and then pass it as string, but that's probably not gonna pass AMO review.
And piece of personal opinion: Since you're targeting a specific browser you don't really need jquery for its cross-browser logic, modern web APIs provide lots of convenience methods that you can get pretty far just with vanilla ES6-javascript and state-of-the-art DOM APIs. Iterators and arrow functions also make bulk operations fairly concise.

How can I convert web page with javascript to plain html?

I want to convert some web pages with javascript to plain html, and I found there several ways(pls tell me if I'm wrong):
Use Jython, an example: http://blog.databigbang.com/web-scraping-ajax-and-javascript-sites/
Use Java together with htmlunit
Use a proxy, an example: http://grep.codeconsult.ch/2007/02/24/crowbar-scrape-javascript-generated-pages-via-gecko-and-rest/
Use python together with qt or PyV8
Because I want to make a tiny tool to meet my request, and I thought it somewhat complicated to install V8 and qt, although python is my first choice.
So I tried to make a proxy with gecko, but it seems need a DISPLAY which I can not afford in a remote Linux server.
Now I am trying to use jython, but it seems there is no simple way to just convert a whole page to plain html.
Actually, I want to ask is there a way to convert a web page contains javascript to plain html, just like the brower does. Can node.js do this job?
I've recently built a server on top of PhantomJS that does this. I highly recommend this route.
http://phantomjs.org/
Basically, you write a quick script that has PhantomJS run the page, and configure a trigger method that lets you know the page is finished and sends the data off. My version used the built-in HTTP server, so PhantomJS easily served up the results on its own. This takes about 15 lines of code to do. (Sorry, can't paste it here... wrote it on work time. But, check out the example on their home page. It's almost complete!)

How to get the source code of webbrowser with python

I am writing a spider with scrapy, however, I come across some website which rendered with js, thus the urllib2.open_url does not work. I have found that I could open the browser with webbrowser.open_new(url), however, I did not find how to get the src code of page with webbrowser. Are there any way that I could use to do this with webbrowser, or are there any other solutions without webbrowser to deal with the js sites?
You can use scraper with Webkit engine available out there.
One of them is dryscrape.
Example:
import dryscrape
search_term = 'dryscrape'
# set up a web scraping session
sess = dryscrape.Session(base_url = 'http://google.com')
# we don't need images
sess.set_attribute('auto_load_images', False)
# visit homepage and search for a term
sess.visit('/')
q = sess.at_xpath('//*[#name="q"]')
q.set(search_term)
q.form().submit()
# extract all links
for link in sess.xpath('//a[#href]'):
print link['href']
# save a screenshot of the web page
sess.render('google.png')
print "Screenshot written to 'google.png'"
See more info at:
https://github.com/niklasb/dryscrape
https://dryscrape.readthedocs.org/en/latest/index.html
If you need a full js engine, there are a number of ways you can drive webkit from Python. Until recently, these sort of things were done with Selenium. Selenium drives an entire browser.
More recently there are newer and simpler ways to run a webkit engine (which includes the v8 javascript engine) from Python. See this SO question:
Headless Browser for Python (Javascript support REQUIRED!)
It references this blog as an example Scraping Javascript Webpages with Webkit . It looks to do more or less just what you need.
I'm trying to find an answer to the same problem for a few days now.
I suggest you try QT framework with WebKit.
There are two python bindings. One is PyQt and the other one is PySide. You can use them directly if you want to create something more complex or you want to have 100% control over your code.
For trivial stuff like executing JavaScript in a browser environment you can use Ghost.py. It has some sort of documentation and some problems when using it from the command line but otherwise it's just great.
If you need to process JavaScript you'll need to implement a JavaScript engine. This makes your spider much more complex. Mainly because JavaScript almost always modifies the DOM based on time or an action taken by the user. This makes it extremely challenging to process JS in a crawler.
If you really need to process JavaScript in your spider you can have a look at the JavaScript engine by Mozilla: https://developer.mozilla.org/en/docs/SpiderMonkey

blogengine without php or asp.net etc

Is there a way to have a blog directly integrated into my HTML/javascript-only website, without having to have something like a SQL-database and a dynamic engine like PHP or MySQL?
Maybe there is some service in the web that offers this (hopefully without ads :) ). Or maybe I can have a blog engine entirely written in javasript?
Entirely written in JavaScript? Surely that defeats the entire point of having a "blog-engine" in the first place? The point being that the data is stored somewhere and dynamically retrieved. To avoid using anything server-side (which seems to be your intent), and only use HTML/JavaScript, you'd have to store all the data for the blog in files that are served up to each visitor, and then retrieve the data from the particular, local, locations using JavaScript.
Sorry if I'm misunderstanding the point here... but this seems to be an utterly useless way of trying to go about things. Blogs are, in general, either written statically (in HTML [even though this is rare]), or are dynamically generated from a database by a server-side scripting language (most common).
Edit: As an additional point, I suppose you could include some third-party blog feed, or service, in your page, via use of JavaScript... but I'm unsure as to which (if any) blogging services would directly support this method of working. Additionally, this is quite an unreliable way of including third-party data in a page...
Here's a thought. It's not really a blog engine - but a wiki.
Entirely javascript/html/css. All lives in a single html file:
http://www.tiddlywiki.com/
not sure how it would work on a real live site, but their site is using it:
* A personal notebook
* A GTD ("Getting Things Done") productivity tool
* A collaboration tool
* For building websites (this site is a TiddlyWiki file!)
* For rapid prototyping
* ...and much more!
You could use github pages. You will get a generated blog with version control.
Other option is to use a Desktop blog tool and then update your site.
You can user iWeb if you have a Mac or CityDesk on Windows or you may try this open source tool
Edit Today I came across this tool: Zeta producer that may help.
http://code.google.com/p/showdown-blog/
Blog engine written in just JS and XML [v0.6] {JavaScript, XML}
So, what you want is to have a blog where you're website provider doesn't provide a way to serve dynamic content?
The only way I see that you can do it in that case is writing html-files (or text-files if you prefer) and adding them to the site. After that you can have some JavaScript to add them to your "blog-page".
You of course need to upload them to the website in the same way as you do for the other files, and then have a way for the JavaScript to know which pages it should fetch.
I am not aware of any JavaScript blog-engines, but you can have a look at the templating functions in for instance Prototype
Of course, that means that you will have to fetch both the template and the content through Ajax and let the client do all the processing (could be slow and possibly insecure), and you still need to have a place to upload the content and update it.
Your best bet is going to be using a generator to create the HTML/CSS/JS to upload to your server, take a look at Webby: http://webby.rubyforge.org/
IF you really need to you can use a public api for a service that lets you post small bits of info and retrieve it using javascript.
for example if you only need small posts you can make a blog in html.javascript that utilizes twitter as the engine. of course you will be limited to 140 chars. I am sure there are other services that will allow a similar idea but with less restrictions.
And of course the best option - Get a blog software or host your blog with a service provider and link to it from you site.
Good luck
One solution would be to use some application that generates the static web pages of your blog, and uploads them to your web server. This way you'd have a blog with static content that could all be managed in javascript alongside your existing site, without needing to install database, daemon software, or additional dynamic web programming languages on your server. The static content generation could happen directly on your server if possible, or you could run the html generation tool locally and upload the output.
MoveableType has a tool like this. You still need somewhere to store the content of your blog, and for this MoveableType uses MySQL by default, so you'd still need to install a database somewhere, but the database could simply be one your local desktop.
MoveableType also has support via plugins or older versions that can retrieve data from a sqlite or other database. The advantage of sqlite is that it doesn't require installing daemons like MySQL does, you can just put a sqlite file on disk somewhere, give MoveableType the path to the file, and run the script to generate your static content.
There are likely other tools like MoveableType, and I have in the past generated blog-like web pages simply by writing small scripts to generate HTML. The main issue is just that you need somewhere for these scripts to fetch data from.
Another option might be to develop your blog using XSLT, ... with XSLT, you'd put the content of your pages in XML files, and then write a template in XSL that converts your XML to HTML.
If you google for 'static blog site generation' you might find other ideas/options, including Jekyll/github mentioned in one of the other responses.

How to simulate JavaScript in a client C# Applications

I'm writing a web crawler (web spider) that crawl all links in a website.
My application is a Win32 App, written in C# with .Net framework 3.5.
Now I'm using HttpWebRequest an HttpWebResponse to communicate with the web server.
I also built my own Http Parser that can parse anything I want.
I found all link like "href", "src", "action"... in the parse.
But I can not solve one problem: Simulate Client Script in the page (like JS and VBS)
For example, if a link like:
a href = "javascript:buildLink(1)"
... with buildLink(parameter) is a Javascript function that will make a custom link due to the parameter.
Please help me to solve this problem. How to simulate JavaScript in this app? I can parse the HTML source code and take all JavaScript code to another file, but how to simulate a function of it?
Thanks.
Your only real option is to automate a browser. As other answers have said, you cannot reliably simulate browser javascript without having a complete DOM.
There are fortunately ways to automate the browser, check out Selenium.
It has a C# API, so you can control the browser from C#.
Use your .NET web crawler code to crawl the site. Whenever you encounter a href="javascript:... link, handle the page containing the link in Selenium:
Use the Selenium API to tell the browser to load the page.
Use the Selenium API to find all links on the page.
This way, your spider only uses Selenium when necessary (pages without javascript links can be handled by the browser-less spider code you already got). And since this is an embarrassingly parallel workload, you could easily have multiple Selenium processes running at the same time (either on one computer or on other computers).
But remember that href="javascript is hardly the only way a page can have dynamic links. The more common case is probably that a onload or $(document).ready() script manipulates the DOM and adds links that way.
To catch that case (and others), the spider probably will have to use Selenium for all pages that have a <script> tag.
You are basically pretending to be a browser, except that HttpWebRequest only does the networking stuff for you.
I would recommend using the ie web browser control and interop'ing into that from your c# application. That will allow you to run JavaScript, set variables, post, etc etc.
Here's some basic links I found after a search for "ie web browser control":
http://www.c-sharpcorner.com/UploadFile/mahesh/WebBrowserInCSMDB12022005001524AM/WebBrowserInCSMDB.aspx
http://support.microsoft.com/kb/313068
This is a problem which is not easily solved. You could consider taking one of the existing JavaScript implementations and porting or interfacing with it somehow.
If I were tackling this problem, I'd probably build a small side application in Java on top of Rhino, with some sort of RPC framework layered on top of that so that I could communicate with it from my primary application.
Unfortunately, without having a complete DOM implementation on top of that, you would be limited to only very simple javascript.
You could execute the javascript by using the MS JScript engine or something similar.
MSDN Reference
Eric Lippert's blog on using Eval (part 1) (part 2) (part 3)
This isn't guaranteed to work, especially if the javascript tries to access the DOM, or somesuch... But for simple scripts, it might be enough.

Categories