I'm working on an application that needs to download the source of a web page from a link, with all the internal files, like images, css, javascript.
After, I will need to open this html in a webview, in offline mode, that's why I need to download everything from the page.
I'd download the images using JSOUP, but I haven't ideia how to link them into the downloaded html.
Could you give me some examples, or starting points where to look to start?
Thanks in advance
Essentially, what you'll need to do (and what my app mentioned below does) is go over all the references links to additional additional assets / images / scripts and so on, download them, and then change the HTML document to point to the local downloaded copy. Something like this, with Jsoup:
Find all the img elements on the page,
Get the location / url of the image file from the src attribute of the img elements (with .attr("abs:src:)),
Download all of those images to a local directory
Change each of the image elements src attribute values to point to the location of the downloaded image file, relative to where the main HTML file will be stored, eg with .attr("src", "assets/imagefilename.png"").
Do this for all other assets required by the page, eg. images, CSS, scripts, html5 video, and others. I also did some regex on the CSS (both linked and inline) to extract, download, and rewrite things like background image references and in the css. Webpages also have other linked things like favicons or RSS feeds which you might want too.
Save your Jsoup document (with the modified URLs pointing to your downloaded versions of the assets) to file, by calling .toString() on it and saving the result to a file.
You can then open the local HTML file in webview, and, assuming you have done everything right, it will show with all images and assets, even offline.
I actually wrote an Android app which does exactly this: save a complete HTML file and all of the CSS / images / other assets to a local file / directory, using Jsoup.
See https://github.com/JonasCz/SaveForOffline/ for the source, specifically SaveService.java for the actual HTML page saving / downloading code.
Beware that it's GPL licensed, so you have to comply with the GPL license if you use (parts of) it.
Also beware that it does a lot of things, and is quite messy as a result, (there's also no comments or documentation either...) but it may help you.
You can do it with Jsoup. IMO, it's a lot of work. On the other, you can consider Crawler4j.
There is a tutorial on their website. Have look to the example for crawling images.
Related
I frequently use Sharepoint 2010 content editors to display Custom HTML. Within my HTML files I also link to external CSS and Javascript files. All of these files are stored in document libraries, organized by folders. Each folder contains a single HTML, CSS and JS file.
When I edit these files, I use offline copies saved on my desktop and then I upload them to SharePoint and overwrite the previous version.
The issue I have is in relation to the src path in the HTML file for the CSS and JS files. When I edit them offline I only use the filename since they are stored in the same folder on my desktop. When I upload them to SharePoint, the path no longer works even though the CSS and JS is in the same folder. The only way I have been able to make it work is to change the path to the full path to each CSS and JS file, ie. "https://SharePointSite.com/Full_Path_to_JS_and_Css".
I would like reference a path to the file on sharepoint without having to use the full path.
Any assistance with be greatly appreciated, Thank you
Take the full path and replace the directories with /../. You might have to do this a bunch of times, sharepoint can be pretty deep.
So
https://SharePointSite.com/dir1/dira/Full_Path_to_JS_and_Css".
would become
../../../Full_Path_to_JS_and_Css".
I'am trying to copy a whole html page, but the css and images & javascript files are external, if there were only few of them I could copy them manually, but what if there are many of them....the links in the html pages to those files refers as local, is there a way I can copy all of the files exactly as they are in the html page? is there a tool for that? cause I can't do it in the chrome console
You can save a website in the MHTML format (short for MIME Encapsulation of Aggregate HTML Documents) which is an HTML document along its assets like styles or images in one single document.
Some browsers do support that format (e.g. Chrome “Save complete website”), for other clients you'
ll need to install a plugin.
See: https://en.wikipedia.org/wiki/MHTML
The build process for my CSS take a really long time. I want to make an extension that lets me edit my stylesheets in the browser and persist them. It will show the diff between the edited version and the original version. I can copy the diff and apply them to my source CSS (Sass).
I think I will have to download the CSS, let the user modify it in their editor of choice, and redirect requests for the original CSS to the modified CSS. I got the CSS to save using chrome.downloads.download, but it's saved in the downloads folder. I don't think the extension can access the downloads folder. One way is to have the user manually save the file in the extension folder, but that's too troublesome.
Is there a way to let the Chrome extension show the edited version of a file?
I want to be able to show PDF files within my Chrome app using PDF.js but the documentation is non-existent. I've been unable to find any simple examples or tutorials that show the code to load a PDF from a relative URL, show the page, and navigate through the PDF. They have very complex examples where 95% of the code does other things and it's very difficult to parse these and find the relevant functions. I would like to:
Include the relevant code in my app (is this the "pdf.js" created by "node make generic" and nothing else? Or do i need to include other JS files as well?)
Be able to show PDF files that are inside my myapp.crx file
Does pdf.js require "LocalStorage"? Will localStorage continue to be allowed in Chrome extensions/apps or is it deprecated?
Can someone tell me if #2 is possible and how to find some example code or documentation on the proper classes/functions to call and files to include/build?
node make generic outputs to the build/generic directory. This directory contains two subdirectories, "build" and "web".
"build" contains "pdf.js", which is the actual PDF engine.
"web" contains a viewer, similar to the one at http://mozilla.github.io/pdf.js/web/viewer.html.
After copying both of those previous directories to your app, you should be able to load the PDF file using chrome.extensi/web/viewer.html?file=path%2Fto%3Ffile.pdf
PDF.js does not require localStorage.It's used if available for persisting settings such as scroll position, but if unavailable, PDF.just continues to work without it.
There is one significant issue though: PDF.js loads the localization files using synchronous XMLHttpRequest. This is not allowed in a Chrome app. You could solve this issue by serializing all files in the locales, put it in a single JavaScript file, load this in viewer.html, and simplify l10n.js to read the translations from the file I just described.
Just to clarify: normally you should be able to access a file baked into your CRX by providing a relative or absolute path to it within the CRX's internal directory structure, e.g.:
'myfiles/pdfs/example.pdf'
With PDF.js, I guess that's what "path-to-file.pdf" should be in Rob's answer above, verbatim.
Using JavaScript, I'm wanting to detect how many JavaScript, CSS and Images are being used by the site. On top of that I'm wanting to get the file size for each asset. Is this possible and if so how would you do this?
As to detecting number of external javascripts, css and images used on website it's doable by getting all the script, image and style tags and then checking if they're external by comparing document domain to asset domain.
As to checking external resources size with javascript, that could be hard. Even though you could try for example refetch the script and css files using ajax, to get contents of the file (to estimate its size), there would be problems with estimating image size. I guess html5 file api or canvas may be used to get some idea of how large is the image.