I'm facing an interesting problem right now: I want to modify a PDF file generated on the server (using TCPDF and Symfony2) on the client before displaying it.
Why? The PDF will contain some semi-sensitive information and our customers would probably be happy to hear that this info never leaves their machines. (I'm aware that HTTPS is considered pretty much safe enough, this is more of a luxury issue to assure our customers that their data is safe.)
This is how far I got: Use placeholders on the PDF, save it as a String and retrieve it via AJAX, replace the placeholders with the local data, and convert the string to Base64. The real issue is getting the file to the user.
A lot of people recommend using the HTML5 "download" Attribute and clicking a hidden link with it, that method does not work on Safari or older versions of IE though. Data URIs are another option, but beyond a certain complexity the base64 string just gets too long and the browser freezes trying to display it in the address field. I also looked at libraries like Downloadify.js, FileSaver.js etc. but none of these seem widely supported (or rely on flash, which I would love to avoid).
I'm very open to suggestions as to what else I could try, or even someone telling me why what I'm trying to do is wholly unnecessary.
Related
Look, this might be a dupe question, and apologies if it is... but honestly, everything I've found on the subject seems to be from 2007, or calls out special caveats for IE6 and the like.
The setup:
Web page using math markup and MathJax to render the math in the web page (working fine).
The user(s) need to be able to export this to some sort of doc - word, PDF, etc - for distribution to proof-readers who are not permitted/desired to be "in the system" where the pages are served.
The issue:
Everything I've tried thus far to get the rendered final product out to some sort of doc - OTHER than doing a user-initiated browser-print - shows the unrendered markup and not the final product.
This is obviously due to the way the MathJax library renders the page when it's fully available, in the browser, as it's just a JS script inclusion. No surprises there.
I can get close by doing an ajax call to a page that renders, and sending that whole blob of html to a third page to write out to disc and re-serve it with mime and content disposition headers for msword, saving it to disc, etc., but the rendering is not correct - presumably due to packaging it up in a POST call. And that's a lot of steps to end up with a not-quite-right solution, anyway.
I'm guessing the answer is going to be "you can't do that", at least not without using one of the HUGE installs of TeX Live or MikTex, etc., and doing it in the back end with shell calls... but I don't have the ability to install on these hosts anyway.
Am I stuck with users doing a print-to-PDF solution? Is there something I'm missing?
Thanks, happy to flesh out where needed, but I can't be the first trying to do this.
For PDF there are a couple of options and it mostly depends on how much work you want to put in.
The quick and dirty solution might be wkhtmltopdf, but you'll have to specify a wait time for JavaScript rendering to finish -- not ideal.
PhantomJS requires slightly more work but allows you to listen in on the page, e.g., this discussion links to a simple example. (There are lots of PhantomJS-based tools out there actually.)
Another way would be to first pre-process using MathJax-node and then pass the result to wkhtmltopdf (then you don't have to wait for MathJax).
For doc/docx I don't think there is any way right now. The natural way would be to use MathJax-node to generate MathML, since Word can import MathML. But Word does not seem to support MathML when imported from HTML. The same holds for generating SVG with MathJax-node (but with SVG you would loose the ability to edit the equations so that might be prohibitive anyway).
Pandoc might eventually help. It can apparently convert mathematics to the MS Office format, see demo #30). But from a quick test this doesn't seem to work for HTML input right now.
If you are considering commercial solutions, have a look at pdfChip from callas software (warning: I'm heavily affiliated with this solution).
It does HTML to PDF and will actually convert MathML using MathJax into a proper PDF file (that can even be a PDF/X or PDF/A file if you so desire). I'll be happy to provide more details off-line.
I'm wondering if, with the new File API exposed in Chrome (I'm not concerned with cross-browser support at this time), it would be possible to write back to files opened via a file input.
You can see an example of what I'm trying to accomplish here: http://www.grehz.com/ide.
I know I can use server side scripts to dynamically create the files and allow the user to download them normally. I'm hoping that there's a way to accomplish this purely client side. I had read somewhere that you can write to files opened via a file input. I haven't been able to find any examples of this, though I have seen passing references to a FileWriter class.
I would be completely not surprised if this wasn't possible though (it seems likely that there are security issues with this). Just looking for some guidance or resources.
UPDATE:
I was reading here: http://dev.w3.org/2009/dap/file-system/file-writer.html
As I was playing around in Chrome, it looks like FileSaver and FileWriter are not implemented, but BlobBuilder is. I can call getBlob() on the BB object, is there any way I can then save that without FileSaver or FileWriter?
UPDATE2:
I found this issue in the Chromium project: http://code.google.com/p/chromium/issues/detail?id=65615&q=FileSaver&colspec=ID%20Stars%20Pri%20Area%20Feature%20Type%20Status%20Summary%20Modified%20Owner%20Mstone%20OS
So it's clear that it hasn't been implemented in any version yet (however, no mention of FileWriter - although I believe FileWriter depends on FileSaver).
Moving away from that, I'm considering a server-side solution. When a user clicks save, the contents of the textarea is posted to a script that then writes to a page and is sent back as plaintext or whatever mime-type would be appropriate for the user to download. Any other suggestions? This solution is fine for a "save as" but it's a little clunky as a general purpose save button.
From:
http://code.google.com/p/chromium/issues/detail?id=58985#c7
FileSystem is really the right place
to store big files (which is what it
sounds like you're doing) and is
available in Chrome 9. I suggest you
look at these alternatives.
Note the not-extensions label at the top left. It sounds like this may just be for Chromium OS. I emailed Jeremy, the developer who made this comment for clarification.
Update:
Jeremy replied that extensions actually will get access to File API including writes, but that it will be confined to a sandbox. He also linked to some undeployed docs on the matter:
http://code.google.com/p/html5rocks/source/browse/www.html5rocks.com/content/tutorials/file/filesystem/index.html?spec=svn1cbb2aab2d6954a56f3067d2d3b9e997215be441&r=1cbb2aab2d6954a56f3067d2d3b9e997215be441
No way that I know of to save until those apis are implemented - which may be some time off.
On a webpage, is it possible to split large files into chunks before the file is uploaded to the server? For example, split a 10MB file into 1MB chunks, and upload one chunk at a time while showing a progress bar?
It sounds like JavaScript doesn't have any file manipulation abilities, but what about Flash and Java applets?
This would need to work in IE6+, Firefox and Chrome. Update: forgot to mention that (a) we are using Grails and (b) this needs to run over https.
You can try Plupload. It can be configured to check whatever runtime is available on users side, be it - Flash, Silverlight, HTML5, Gears, etc, and use whichever satisfies required features first. Among other things it supports image resizing (on users side, preserving EXIF data(!)), stream and multipart upload, and chunking. Files can be chunked on users side, and sent to a server-side handler chunk-by-chunk (requires some additional care on server), so that big files can be uploaded to a server having max filesize limit set to a value much lower then their size, for example. And more.
Some runtimes support https I believe, some need testing. Anyway, developers on there are quite responsive these days. So you might at least try ;)
The only option I know of that would allow this would be a signed Java applet.
Unsigned applets and Flash movies have no filesystem access, so they wouldn't be able to read the file data. Flash is able to upload files, but most of that is handled by the built-in Flash implementation and from what I remember the file contents would never be exposed to your code.
There is no JavaScript solution for that selection of browsers. There is the File API but whilst it works in newer Firefox and Chrome versions it's not going to happen in IE (no sign of it in IE9 betas yet either).
In any case, reading the file locally and uploading it via XMLHttpRequest is inefficient because XMLHttpRequest does not have the ability to send pure binary, only Unicode text. You can encode binary into text using base-64 (or, if you are really dedicated, a custom 7-bit encoding of your own) but this will be less efficient than a normal file upload.
You can certainly do uploads with Flash (see SWFUpload et al), or even Java if you must (Jumploader... I wouldn't bother, these days, though, as Flash prevalence is very high and the Java plugin continues to decline). You won't necessarily get the low-level control to split into chunks, but do you really need that? What for?
Another possible approach is to use a standard HTML file upload field, and when submit occurs set an interval call to poll the server with XMLHttpRequest, asking it how far the file upload is coming along. This requires a bit of work on the server end to store the current upload progress in the session or database, so another request can read it. It also means using a form parsing library that gives you progress callback, which most standard language built-in ones like PHP's don't.
Whatever you do, take a ‘progressive enhancement’ approach, allowing browsers with no support to fall back to a plain HTML upload. Browsers do typically have an upload progress bar for HTML file uploads, it just tends to be small and easily missed.
Do you specifically need it two be in X chunks? Or are you trying to solve the problems cause by uploading large files? (e.g. can't restart an upload on the client side, server side crashes when the entire file is uploaded and held in memory all at once)
Search for streaming upload components. It depends on what technologies you are working with as to which component you will prefer jsp, asp.net, etc.
http://krystalware.com/Products/SlickUpload/ This one is a server side product
Here are some more pointers to various uploaders http://weblogs.asp.net/jgalloway/archive/2008/01/08/large-file-uploads-in-asp-net.aspx
some try to manage memory on the server,e.g. so the entire huge file isn´t in memory at one time, some try to manage the client side experience.
The project is a servlet to which people can upload files via, at present, HTTP POST. This is accompanied by Web page(s) providing a front-end to trigger the upload. We have more or less complete control over the servlet, and the Web pages, but don't want to impose any restrictions on the client beyond being a reasonably modern browser with Javascript. No Java applets etc.
Files may potentially be large, and a possible use case is mobile devices on less reliable networks. Some people on the project are demanding the ability to resume an upload if the network connection goes down. I don't think this is possible with plain HTTP and Javascript in a browser, but I'd love to be proved wrong.
Any suggestions?
Not with Plain Ol' JS. It doesn't have access to the file system, not even a file added to an input type=file control and so it cannot manipulate the data and upload via XHR instead.
You would have to look into a Flash or Java based alternative.
With your current restrictions, no.
(There may be a tiny chance that using the HTML5 file api could be capable of doing this. Maybe someone more knowledgeable can comment because I usually cannot make heads or tails of technical specifications from the w3c : http://www.w3.org/TR/file-upload/ )
Firefox 3.6 implements a FileReader interface, however it doesn't seem to support any form of skipping. Therefor, you would need to read the file and split it where you need it to resume.
This would not be especially useful for large file since you would probably crash the browser anyway because of the memory-allocation it would need.
https://developer.mozilla.org/en/DOM/FileReader
For completely non-nefarious purposes - machine learning specifically, I'd like to download a huge dataset of CAPTCHA images. However, CAPTCHA is always implemented using some obfuscated javascript that makes getting at the actual images without a browser a non-trivial task, at least to me, who is a javascript novice.
So, can anyone give me some helpful pointers on how to download the image of the obscured word using a script completely outside of a browser? And please don't point me to a dataset of already collected obscured words - I need to collect the images from a specific website for this particular experiment.
Thanks!
Edit: Another way this question could be asked is very simple. When you click "view source" on website with complicated javascript, you see the script references, but that's all you see. However, if you click "save webpage as..." (in firefox) and then view the source of the saved webpage, the javascript will be resolved and new html and the images (at least in the case of ASIRRA and reCAPTCHA) is in the source. How can I mimic this "save webpage as..." behavior using a script? This is an important web coding question in general, so please stop questioning me on my motives with this! This is knowledge I can use from now on in all web development involving scripting and I'm sure other stack overflow visitors can as well!
While waiting for an answer here I kept digging and eventually figured out a sort of hacked way of getting done what I wanted.
First off, the reason this is a somewhat complicated problem (at least to a javascript novice like me) is that the images from ASIRRA are loaded onto the webpage via javascript, which is a client-side technology. This is a problem when you download the webpage using something like wget or curl because it doesn't actually run the javascript, it just downloads the source html. Therefore, you don't get the images.
However, I realized that using firefox's "Save Page As..." did exactly what I needed. It ran the javascript which loaded the images, and then it saved it all into the well-known directory structure on my hard drive. That's exactly what I wanted to automate. So... I found a firefox Add-on called "iMacros" and wrote this macro:
VERSION BUILD=6240709 RECORDER=FX
TAB T=1
URL GOTO=http://www.asirra.com/examples/ExampleService.html
SAVEAS TYPE=CPL FOLDER=C:\Cat-Dog\Downloads FILE=*
Set to loop 10,000 times, it worked perfectly. In fact, since it was always saving to the same folder, duplicate images were overwritten (which is what I wanted).
Why not just get CAPTCHA yourself and generate images? reCAPTCHA's free too.
http://www.captcha.net/
Update: I see you want it from a specific site but if you get your own you can tweak it to give the same kind of images as the site you're targeting.
Get in contact with the people who run the site and ask for the dataset. If you try to download many images in any suspicious way, you'll end up on their kill list rather quickly which means that you won't get anything from them anymore.
CAPTCHAs are meant to protect people against abuse and what you do will look like abuse from their point of view.