Download complete website (including assets) for offline use - javascript

I'm thinking of writing a Cordova app which downloads websites so they can be read offline (like HTTrack for Windows). The main reason is lack of a good offline RSS reader for Windows 10 tablets.
I know in general what I would have to do but is there some framework which could simplify some of it?
So far I think I would need to do the following:
Download the HTML of a site
Get a list of all assets (CSS, JS, images, videos)
Download those assets
Replace asset URLs with new local ones.
The biggest problem is downloading the assets. It's not as straight forward as parsing the HTML for link, script, and img tags since CSS could have imports and JS could have ajax calls.
Also, how to decide which assets to download? I wouldn't want to waste time downloading ads..
Also there are some specific questions:
How should I display a downloaded page? My first thought is in an iFrame to prevent collisions.
Are there any legal problems? Especially if I were to publish the app?
How could I save the assets so they have a URL for including in the HTML?
Might it be better using a server to do the heavy lifting (parsing, rewriting, getting URLs etc.)? Are there tools for this already?
Does anyone have any pointers? Or do you think it's impractical?

Check out https://archivebox.io, it's an open-source, self-hosted tool that creates a local, static, browsable HTML clone of websites (it saves HTML, JS, media files, PDFs, screenshot, static assets and more).
It does most of what you want, including saving assets and media files with youtube-dl, wget, and chrome headless.

Related

Protect contents of Electron app from being stolen

I'm planning to create an Electron app. The app has content of multiple mp3/png/svg files which are going to be rendered inside the app. I don't want anyone to open an audio or an image outside of the app. I researched a lot to find a solution to protect theses file so that if someone installed the app won't be able to open and use the files outside of my app but it seems that it's not possible...
Note: simple protection solution with minimal security is accepted too...
Note 2: Can we store the files (mp3/ png/svg) without any extension (this way users can not open them directly on their machine) and then open the files via electron app (we have the name and extension stored in our app so we have the extension to attach to preloaded file and open the file) I mean can we preload the audio file without extension and then open the preloaded audio with mp3 format as a work around?
I saw a clever "solution" for protecting multimedia assets back in the days of interactive CD-ROMS.
The trick was to write some junk text to the end of the .mp3 or video file in order to corrupt the file and make it unplayable. Then in the app, to play the file, the file would be written to a temporary directory with the junk text removed so it could be played.
Not foolproof by any means but it prevents casual copying assets out of the app bundle.

Can a Chrome Extension Dynamically Add JS Files to Local Extension Directory? User Upload or Saved From the Web?

I'm wondering if it's possible to for certain JS files to be added to the web extension directory later?
Like say I have an app where users can select certain settings from within the app and those files (js and html files, images or blobs) are somehow added into the extension from the web. Like some sort of ondemand updater without using any native apps but it seems that upgrades are done by the appstores automatically.
I'm reading the files using ajax and adding them to indexeddb but because it could be more than one file that's getting messy.
Say a user wants a certain feature on the extension and there's an html page, js files and images then this gets downloaded to a certain folder inside the installed extension.
function download() { //only saves to downloads directory
var imgurl = "https://www.google.com.hk/images/srpr/logo11w.png";
console.log('download');
browser.downloads.download({url:imgurl},function(downloadId){
console.log("download begin, the downId is:" + downloadId);
});
}
I also tried the chrome download function above but that only works for the downloads folder not the extension folder.
Is there any way to make a custom updater?! I know we can't save to disk but any leniency or workarounds for the extension folder?! Even something silly like making a shell call to some dos (and linux/mac) thing that saves the file to the extension folder. I can fetch the files, just not save them.
Ok so I'll put it as an answer. This is the solution I'm leaning on which works for my scenario and I've listed some alternatives below:
Having the other files as separate extensions and giving the user an install link instead where they can install that extension, then those child extensions talk to the mother extension and they know the address to the resources in their child extension folder, so the mother gets the just the file locations from the children to load those assets from that folder. The child extensions are like bundles of those html and js with a background script which sends the addresses of these items to the mother.
https://developer.chrome.com/extensions/messaging#external
The drawback is that I'll have to see how that affects the urls like if I inject the html page from the child extension folder into the main interface using ajax then I can't use relative url's to any images in that 'cos the urls are relative to the mother extension folder.. I'll have to rewrite the child extension urls with the absolute paths into the html page to load images and js from the child extension html code which has relative urls.
Pros:
Cleaner and more persistent than indexeddb.
Files can be loaded normally from disk.
Cons:
User has to install separate extensions.
URL structure might be a bit confusing, need to rewrite urls if loading html from child. However this is only for image src's and where the javascript is loaded from so it's not such a big deal.
Other Possible Solutions:
Indexeddb which I'm already doing seems to be the preferred way of doing this but I really do not want to store every html asset in indexeddb. The upside is that while extensions need to be installed, this method can be done silently fetching and adding files without user interaction and indexeddb seems to be somewhat persistent. Might still end up using this because it is silent but having to load each asset from a database sounds like a nightmare.
The File Handle Api might have worked if I was working on Firefox only https://wiki.mozilla.org/WebAPI/FileHandleAPI
I haven't tried the shell copy, maybe if I fetch with ajax and then save to disk using some dos function and then doing different save functions for different OS systems.
Filesystem Api only saves to downloads and doesn't work for extensions anyways, so that's useless.
UPDATE
In windows there isn't any sudo, but this worked without admin priveleges for a subfolder (not on the C:\ root though). It would work for a linux only app very nicely. If I just wanted to save a file to a windows machine this might work.
Shell copy method would be to grab the contents of file with ajax from the local or remote location, output to DOS as a stream to save to file on windows. And do this for every operating system with a shell exec command or detect the OS and do that command. This way I can even put the files in the exact folder location.
Like say I make this sort of command from the contents:
//To append you can use >> instead of >
//folder seems necessary, can't save to root without admin
echo the content I want to save > C:\folder\textfile.txt
I thought of calling it using shell exec that only works in nodejs, so digging through the other answers on
How to execute shell command in Javascript
//full code to save file using javascript on windows
var shell = WScript.CreateObject("WScript.Shell");
shell.Run("echo content to save > C:\folder\textfile.txt");
The shell command doesn't seem to work. i can't find what this is for. There doesn't seem to be a shell command in regular javascript for windows. It seems to require IE ActiveX. Doesn't work with Firefox or Chrome.
Extensions can't modify their sources because the browser verifies them and resets/disables the extension if they change. Also, in Firefox the extensions aren't even unpacked.
The solution is actually quite trivial: save the code in any storage (localStorage, chrome.storage.local, IndexedDB) as a string and then add it in your extension page as a standard DOM script element. You'll have to relax the standard CSP a bit for that.

CORS issue with HTML5 canvas, javascript

I have 2 HTML5 widgets, both made with Phaser.js and having images and audio, which are loaded on the fly by phaser library.
One of the widget(HTML5 file) works on local file system without XAMPP, while another only work when serve through XAMPP server.
I want to know why some HTML5 canvas files works without server while most of the time we require some server for canvas files.
Its a great confusion for me.
Plz help.
There's a very good explanation of why you need a web server on the getting started page for Phaser.
What it boils down to is you need to use a web server because:
It's to do with the protocol used to access the files. When you
request anything over the web you're using http, and the server level
security is enough to ensure you can only access files you're meant
to. But when you drag a file in it's loaded via the local file system
(technically file://) and that is massively restricted, for obvious
reasons. Under file:// there's no concept of domains, no server level
security, just a raw file system.
...
Your game is going to need to load resources: images, audio files,
JSON data, maybe other JavaScript files. And in order to do this it
needs to run unhindered by the browser security shackles. It needs
http:// access to the game files. And for that we need a web server.
Technically, none of your Phaser applications should run without a web server, it's quite odd that you got one of them to.
Set game.load.crossOrigin = true in your preload code and it should work.

Is it possible to serve user created content offline in javascript?

My application serves user created bundles of HTML pages for e-learning, also known as SCORM packages, and i'm trying to make that as fast as possible.
Loading page-by-page in iframes is quite slow, as pages may include high resolution graphics, animations, audio, video and so on.
Unfortunately pre-loading these pages is quite difficult, as they usually react to onLoad() events to start animations and interactions.
Without using applets or extensions, would it be possible to download the user bundle and serve it "in-browser" to the application?
This is a common-enough task with the advent of fat clients built on Backbone.JS, Angular, Ember, etc. Clients request data (usually JSON), media, etc. from the server as opposed to pre-rendered HTML, and do rendering and resource management client-side. If you want to go this way so that you can support flexible offline mode the way you specified, you usually need a set of generic loaders and tools in your app cache manifest that will loading the more specific (user-specific, lesson-specific, etc.) resources on page load.
The first time your user opens your app, it should be in online mode, and your app will need to request the specific resources it needs to work well offline and store them in client-side storage (localStorage, indexedDB or what it's trying to replace - WebSQL, and fileSystem. There are many resources on the web on how to use each of these APIs.). This step can also be incremental, rather than a huge download of megabytes of data.
The next time your user opens your page, your app can attempt to load all the resources it needs from client-side storage before even calling the server. It will only need to call the server if it's missing some resources, or if it needs to get a fresher version of a resource, or of course if you need to write to the server. If you did a good job of loading all the resources it needed into client-side storage the first time, it can work decently in offline mode.
If your users are running modern browsers you could use the HTML5 cache manifest.
Creating a manifest file will get the browser to download and store the site locally and then the user may even visit it offline
http://en.wikipedia.org/wiki/Cache_manifest_in_HTML5

Loading CSS and JS files from another domain and serving resources from consistent URL

I have a website that is grown somewhat large and is built on a super-restrictive platform (SBI). There you have to follow their file structure and put everything in an appropriate folder and then upload each and every file through their interface manually. I have cool HTML5 template and some Javascript with a lot of little files and images so it was just way easier to upload all this stuff to my OTHER DOMAIN hosted by Hostgator using Filezilla and then just refer css and js files from my SBI site to their location at my Hostgator's domain.
Are there any potential issues with this method?
The reason I am asking is because yesterday I came across Google's article on serving resourcing from a consistent URL: https://developers.google.com/speed/docs/best-practices/payload#duplicate_resources However, I might be misunderstanding what it means. When I put my actual URL to test at Google's page speed insights here https://developers.google.com/speed/pagespeed/insights it advises me to serve resources from a consistent URL, but in details it doesn't complain about my CSS and JS files, it complains about Facebook only, like this:
Suggestions for this page:
The following resources have identical contents, but are served from different URLs. Serve these resources from a consistent URL to save 1 request(s) and24.3KiB.
http:// static.ak.facebook.com/.../xd_arbiter.php?...
https:// s-static.ak.facebook.com/.../xd_arbiter.php?...
I appreciate you reading this. Thanks in advance!
Serving static content from a different domain is common practice, I don't see any issues there - it's as safe and reliable as the server you are using to serve it.
The facebook warning could mean you are loading the same FB API script twice, or it just may be some black magic done by the FB devs.
You should not have any problems with hosting your files on a different site. Your users may experience a slightly slower page load because their machine has to do more DNS lookups, on the other hand most web browsers only download a maximum of 2 files form a host simultaneously, so doubling your hosts can double your simultaneous downloads. That warning about Facebook is because the same script is being downloaded twice from two different places which is not ideal, but I'm not familiar with the Facebook api so I'm not sure if that can be helped.

Categories