View XML from PDF in Embedded - javascript

I'm working on a browser extension that needs to read the data in a pdf file that pops up.
When the popup comes up and I go to inspect, I only find the following information:
<embed id="plugin" type="application/x-google-chrome-pdf" src="https://thisisnottherealurl-soignorethis part......../something.aspx" stream-url="chrome-extension://xxxxxx/xxxx" headers="cache-control: no-cache, no-store,must-revalidate
content-type: application/pdf
date: Wed, 03 Mar 1999 15:31:26 GMT
expires: -1
pragma: no-cache
server: Microsoft-IIS/10.0
x-aspnet-version: 4.0.30319
x-powered-by: ASP.NET
" background-color="0xFF525659" top-toolbar-height="0" javascript="allow" full-frame="" pdf-viewer-update-enabled="">
I know for a fact that the information is in XML format, and I am certain that it is found in the embed tag. I can view it by changing the settings to 'save' the file rather than to view it. What I cannot seem to find, neither in the Network information nor the Source, is where that information is at nor how I can have the browser extension go through it for me.

For anyone else interested in this method, I found some interesting work arounds.
Apparently the pdf documents create a dynamic extension and uses Chrome APIs inside of the browser which appears to run the code for making the pdf.
This makes is somewhat more difficult than usual to get a look at the network traffic and the processes.
An interesting work around, aside from the above comment, that I had found is that the pdf document can be selected and cut/pasted into clipboard, or even into a variable.
After some testing, I found that my browser extension does have capability in the new pdf window. Thus I was able to extract the information that way.
This isn't exactly what I had been looking for, but I found it to be quite interesting and thought someone else could use it.
Remember to take into account asynchronous running of the code.
The code for select/copy that I generally use is:
let sel = window.getSelection(), range = document.createRange(); range.selectNodeContents(document.documentElement);
sel.removeAllRanges();
let textStuff = sel.addRange(range);
sel.removeAllRanges();
Problem is however that it appears that the pdf document might actually be embedded in the css, thus avoiding the usual method of copy/paste from the dom.
If the copy/paste doesn't work for you, I also found a somewhat interesting method of simulating the copy paste at:
How to implement ctrl click behavior to copy text from an embedded pdf in a webapp?

Related

How get raw response body inside a Web Extension for Firefox 55?

I try to get the raw response body inside a Web Extension using Firefox 55.0.3.
Only "solutions" I have seen for now:
Repeat the request (I absolutly don't want to repeat the request)
Using Javascript to get innerHTML attribute of HTML tags such as head and body (tell me if I'm wrong, but with a solution like that I will not always have the whole content, for example I will get nothing in case of response without HTML. So it will never be the real raw response and in some case it will simply not work.)
Also, I saw this answer for Chrome (from 2015) using the debugger, but I wasn't able to do it with Firefox. This kind of solutions are interesting, I read Mozilla documentation about devtools but I didn't find a way of using the network tab of webtools interface with Javascript inside a Web Extension.
To give you more details, my goal is to intercept the full request and response from server (header and body). This is not a problem to do it, except for the response body.
Here an example of code to get the request body: (background script)
browser.webRequest.onBeforeRequest.addListener(
function (e) {
console.log(e);
},
{urls: ["http://*/*", "https://*/*"]},
["requestBody"]
)
Here some documentations that I used (there is more, but these links are all official):
Mozilla documentation about Web Extension
Intercept HTTP requests
webRequest
webRequest.onHeadersReceived
webRequest.onBeforeRequest
webRequest.onBeforeSendHeaders
Here some examples of Web Extensions.
Any ideas, solutions or even explainations "why this is not possible" are welcome, thank you in advance for your time !
Cheers++
This is now available, as of Firefox 57:
browser.webRequest.filterResponseData allows you to add a listener via browser.webRequest.onBeforeRequest which receives, and allows you to modify the response.
You can see an example in the Mozilla github webextensions-examples repo
Firefox 57 is going to provide the API browser.webRequest.filterResponseData. This doesn't seem to be documented yet, but you can look through bug 1255894 for details.
Why is this not possible?
For the simple reason that WebRequest was ported over from Chrome extensions, where this is explicitly impossible.
Requests for such functionality (to edit, or just to read) has been around for a very long time (since 2011 and 2015 respectively); they are challenging from both the security perspective and technical perspective, however a principal agreement that read access is a good idea is there.
However, it's simply not yet implemented. Rob W has been doing some work in this direction but it's not done yet.
Perhaps Firefox has a different implementation?
A cursory glance on Mozilla bugtracker doesn't find any bugs on providing this functionality. So, it's not likely that the implementation will diverge anytime soon.
Any workarounds?
Well, only the debugger-level access can touch actual response data.
Since debugger is not implemented in the WebExtension platform, only a devtools.network-using extension can access it - and only while Dev Tools are open for the tab making said request, which is the main limitation of devtools.* APIs.

Website is being redirected with document referrer and it's ruining search engine results

If you search on Google 'new york state beach cleanup', you'll see that the first result is for the website http://najomawi.com, but the title doesn't look quite right for such a site. You'll also notice that if you click this link it instead takes you to a website for Nike shoes. It only happens if you use the Google results link though (and I believe it happens in Bing, Yahoo and others). If you put http://najomawi.com directly into your browser bar, it takes you to the correct site. Confused, I checked the page source code (both with 'View Page Source' and Chrome's inspector) and found this...
<script>
var s=document.referrer;
if(s.indexOf("google")>0 || s.indexOf("bing")>0 || s.indexOf("aol")>0 || s.indexOf("yahoo")>0)
{
self.location='http://www.theredkicks.com';
}
</script>
I have no idea how this got there. It appears in the the head tags of the home page, which is index.html. There is no PHP code, no other JS, nothing other than CSS stylesheets that I am aware of. The entire site is pretty much static HTML and CSS sheets. So how did this get there? And how can I get rid of it?
The JavaScript code is very simple. It just checks if document.referrer contains the name of the most relevant search engines and, if so, redirects the load to another page, in this case, http://www.theredkicks.com.
Your site certainly was hacked somehow or your host provider is not very honest.
Notice that there's nothing attached to the query string in this redirect, so this is not an "affiliate" (wrong) way to make money. The only person that is gaining something with this is the redirect target.
Also, it's very interesting that your page is aparently being processed trought ASP. That is strange, as long as you say that your site is made only by static HTML and CSS.
Look at the cookie, is something like this:
ASPSESSIONIDSATCSAAC=INMLBOADDKNKMPACCK
And also the headers:
HTTP/1.1 200 OK
Date: Fri, 10 Jan 2014 01:30:49 GMT
Server: Microsoft-IIS/6.0
X-Powered-By: ASP.NET
Content-Length: 15168
Content-Type: text/html
Cache-control: private
I don't know where you are hosting your site, but you should claim urgent solution for this problem there.
No. This is an injection being done conditionally, pointing to your DNS server/records being compromised.
Your DNS primary and secondary records are being routed through siteprotect.com. I have no idea if you have chosen siteprotect as your DNS handlers, but siteprotect.com doesn't actually resolve at the moment. I also have no idea who "siteprotect" are.
If your actual host is not "siteprotect" and you have not heard of them, reset your DNS records to those of your host and change your passwords etc. If your host is "siteprotect" they may be aware of the problem and working on it.

Chrome (and possibly other browsers) caching my script include

PROBLEM:
I am hosting a widget on a client's website that will be different for each page on the site.
To render the widget, the client includes a script tag on their pages. This script tag is loaded for every page on the site and the code that it returns depends on the page.
So, if this script gets cached, the end result is that we serve a widget for the wrong page.
Right now, when we serve the script, we set in the response headers
Cache-Control: max-age=0
Expires : 24 hours in the past
yet sometimes browsers still cache the script.
QUESTION:
Is there a way to use http headers to stop caching in all cases or are we going to have to take a completely different approach?
UPDATE:
The headers that topek recommended greatly improved the non-cacheability of the scripts. However, (again in Chrome who seems to be the most cache-aggressive) when using the back, forward, or reload buttons the script is still cached. If you actually CLICK on anything it will be fetched from the server.
It seems that the only foolproof way to stop caching will be to set script sources that are guaranteed to be different for each page load (as suggested by esilija and tejs).
Those two headers should do the trick:
response.setHeader("Cache-Control", "no-cache, must-revalidate");
response.setHeader("Expires", "Sat, 26 Jul 1997 05:00:00 GMT");
or you set name according to the current page, e.g. when the user requests the page http://domain/posts/1 then the script name could be http://domain/script/scriptname/posts/1. With approach the script would still be cachable per page.
Do not append a query string on the script like script.js?random_string. Proxies don't play well with this approach. If you want to place a random string in the name, then put it before .js like this script-0934234234.js and rewrite the request on your server.

PDF files do not open in Internet Explorer with Adobe Reader 10.0 - users get an empty gray screen. How can I fix this for my users?

There is a known issue with opening a PDF in Internet Explorer (v 6, 7, 8, 9) with Adobe Reader X (version 10.0.*). The browser window loads with an empty gray screen (and doesn't even have a Reader toolbar). It works perfectly fine with Firefox, Chrome, or with Adobe Reader 10.1.*.
I have discovered several workarounds. For example, hitting "Refresh" will load the document properly. Upgrading to Adobe Reader 10.1.*, or downgrading to 9.*, fixes the issue too.
However, all of these solutions require the user to figure it out. Most of my users get very confused at seeing this gray screen, and end up blaming the PDF file and blaming the website for being broken. Honestly, until I researched the issue, I blamed the PDF too!
So, I am trying to figure out a way to fix this issue for my users.
I've considered providing a "Download PDF" link (that sets the Content-Disposition header to attachment instead of inline), but my company does not like that solution at all, because we really want these PDF files to display in the browser.
Has anyone else experienced this issue?
What are some possible solutions or workarounds?
I'm really hoping for a solution that is seamless to the end-user, because I can't rely on them to know how to change their Adobe Reader settings, or to automatically install updates.
Here's the dreaded Gray Screen:
Edit: screenshot was deleted from file server! Sorry!
The image was a browser window, with the regular toolbar, but a solid gray background, no UI whatsoever.
Background info:
Although I don't think the following information is related to my issue, I'll include it for reference:
This is an ASP.NET MVC application, and has jQuery available.
The link to the PDF file has target=_blank so that it opens in a new window.
The PDF file is being generated on-the-fly, and all the content headers are being set appropriately.
The URL does NOT include the .pdf extension, but we do set the content-disposition header with a valid .pdf filename and the inline setting.
Edit: Here is the source code that I'm using to serve up the PDF files.
First, the Controller Action:
public ActionResult ComplianceCertificate(int id){
byte[] pdfBytes = ComplianceBusiness.GetCertificate(id);
return new PdfResult(pdfBytes, false, "Compliance Certificate {0}.pdf", id);
}
And here is the ActionResult (PdfResult, inherits System.Web.Mvc.FileContentResult):
using System.Net.Mime;
using System.Web.Mvc;
/// <summary>
/// Returns the proper Response Headers and "Content-Disposition" for a PDF file,
/// and allows you to specify the filename and whether it will be downloaded by the browser.
/// </summary>
public class PdfResult : FileContentResult
{
public ContentDisposition ContentDisposition { get; private set; }
/// <summary>
/// Returns a PDF FileResult.
/// </summary>
/// <param name="pdfFileContents">The data for the PDF file</param>
/// <param name="download">Determines if the file should be shown in the browser or downloaded as a file</param>
/// <param name="filename">The filename that will be shown if the file is downloaded or saved.</param>
/// <param name="filenameArgs">A list of arguments to be formatted into the filename.</param>
/// <returns></returns>
[JetBrains.Annotations.StringFormatMethod("filename")]
public PdfResult(byte[] pdfFileContents, bool download, string filename, params object[] filenameArgs)
: base(pdfFileContents, "application/pdf")
{
// Format the filename:
if (filenameArgs != null && filenameArgs.Length > 0)
{
filename = string.Format(filename, filenameArgs);
}
// Add the filename to the Content-Disposition
ContentDisposition = new ContentDisposition
{
Inline = !download,
FileName = filename,
Size = pdfFileContents.Length,
};
}
protected override void WriteFile(System.Web.HttpResponseBase response)
{
// Add the filename to the Content-Disposition
response.AddHeader("Content-Disposition", ContentDisposition.ToString());
base.WriteFile(response);
}
}
It's been 4 months since asking this question, and I still haven't found a good solution.
However, I did find a decent workaround, which I will share in case others have the same issue.
I will try to update this answer, too, if I make further progress.
First of all, my research has shown that there are several possible combinations of user-settings and site settings that cause a variety of PDF display issues. These include:
Broken version of Adobe Reader (10.0.*)
HTTPS site with Internet Explorer and the default setting "Don't save encrypted files to disk"
Adobe Reader setting - disable "Display PDF files in my browser"
Slow hardware (thanks #ahochhaus)
I spent some time researching PDF display options at pdfobject.com, which is an EXCELLENT resource and I learned a lot.
The workaround I came up with is to embed the PDF file inside an empty HTML page. It is very simple: See some similar examples at pdfobject.com.
<html>
<head>...</head>
<body>
<object data="/pdf/sample.pdf" type="application/pdf" height="100%" width="100%"></object>
</body>
</html>
However, here's a list of caveats:
This ignores all user-preferences for PDFs - for example, I personally like PDFs to open in a stand-alone Adobe Reader, but that is ignored
This doesn't work if you don't have the Adobe Reader plugin installed/enabled, so I added a "Get Adobe Reader" section to the html, and a link to download the file, which usually gets completely hidden by the <object /> tag, ... but ...
In Internet Explorer, if the plugin fails to load, the empty object will still hide the "Get Adobe Reader" section, so I had to set the z-index to show it ... but ...
Google Chrome's built-in PDF viewer also displays the "Get Adobe Reader" section on top of the PDF, so I had to do browser detection to determine whether to show the "Get Reader".
This is a huge list of caveats. I believe it covers all the bases, but I am definitely not comfortable applying this to EVERY user (most of whom do not have an issue).
Therefore, we decided to ONLY do this embedded option if the user opts-in for it. On our PDF page, we have a section that says "Having trouble viewing PDFs?", which lets you change your setting to "embedded", and we store that setting in a cookie.
In our GetPDF Action, we look for the embed=true cookie. This determines whether we return the PDF file, or if we return a View of HTML with the embedded PDF.
Ugh. This was even less fun than writing IE6-compatible JavaScript.
I hope that others with the same problem can find comfort knowing that they're not alone!
I don't have an exact solution, but I'll post my experiences with this in case they help anyone else.
From my testing, the gray screen is only triggered on slower machines [1]. To date, I have not been able to recreate it on newer hardware [2]. All of my tests have been in IE8 with Adobe Reader 10.1.2. For my tests I turned off SSL and removed all headers that could have disabled caching.
To recreate the gray screen, I followed the following steps:
1) Navigate to a page that links to a PDF
2) Open the PDF in a new window or tab (either via the context menu or target="_blank")
3) In my tests, this PDF will open without error (however I have received user reports indicating failure on the first PDF load)
4) Close the newly opened window or tab
5) Open the PDF (again) in a new window or tab
6) This PDF will not open, but instead only show the "gray screen" mentioned by the first user (all subsequent PDFs that are loaded will also not display -- until all browser windows are closed)
I performed the above test with several different PDF files (both static and dynamic) generated from different sources and the gray screen issue always occurs when following the above steps (on the "slow" computer).
To mitigate the problem in my application, I "tore down" the page that links to the PDF (removed parts piece by piece until the gray screen no longer occurred). In my particular application (built on closure-library) removing all references to goog.userAgent.adobeReader [3] appears to have fixed the issue. This exact solution won't work with jquery or .net MVC but maybe the process can help you isolate the source of the issue. I have not yet taken the time to isolate which particular portion of goog.userAgent.adobeReader triggers the bug in Adobe Reader, but it is likely that jquery might have similar plugin detection code to that used in closure-library.
[1] Machine experiencing gray screen:
Win Server '03 SP3
AMD Sempron 2400+ at 1.6GHz
256MB memory
[2] Machine not experiencing gray screen:
Win XP x64 SP2
AMD Athlon II X4 620 at 2.6 GHz
4GB memory
[3] http://closure-library.googlecode.com/svn/docs/closure_goog_useragent_adobereader.js.source.html
I ran into this issue around the time MVC1 was first released. See Generating PDF, error with IE and HTTPS regarding the Cache-Control header.
For Win7 Acrobat Pro X
Since I did all these without rechecking to see if the problem still existed afterwards, I am not sure which on of these actually fixed the problem, but one of them did. In fact, after doing the #3 and rebooting, it worked perfectly.
FYI: Below is the order in which I stepped through the repair.
Go to Control Panel > folders options under each of the General, View and Search Tabs
click the Restore Defaults button and the Reset Folders button
Go to Internet Explorer, Tools > Options > Advanced > Reset ( I did not need to delete personal settings)
Open Acrobat Pro X, under Edit > Preferences > General.
At the bottom of page select Default PDF Handler. I chose Adobe Pro X, and click Apply.
You may be asked to reboot (I did).
Best Wishes
In my case the solution was quite simple.
I added this header and the browsers opened the file in every test.
header('Content-Disposition: attachment; filename="filename.pdf"');
I had this problem. Reinstalling the latest version of Adobe Reader did nothing. Adobe Reader worked in Chrome but not in IE. This worked for me ...
1) Go to IE's Tools-->Compatibility View menu.
2) Enter a website that has the PDF you wish to see. Click OK.
3) Restart IE
4) Go to the website you entered and select the PDF. It should come up.
5) Go back to Compatibility View and delete the entry you made.
6) Adobe Reader works OK now in IE on all websites.
It's a strange fix, but it worked for me. I needed to go through an Adobe acceptance screen after reinstall that only appeared after I did the Compatibility View trick. Once accepted, it seemed to work everywhere. Pretty flaky stuff. Hope this helps someone.
Hm, would it be possible to simply do this:
The first time your user opens a pdf, using Javascript you make a popout that basically says "If you cannot see your document, please click HERE". Make "HERE" a big button where it will explain to your user what's the problem. Also make another button "everything's fine". If the user clicks on this one, you remember it, so it isn't displayed in the future.
I'm trying to be practical. Going to great lengths trying to solve this kind of problem "properly" for a small subset of Adobe Reader versions doesn't sound very productive to me.
Experimenting more, the underlying cause in my app (calling goog.userAgent.adobeReader) was accessing Adobe Reader via an ActiveXObject on the page with the link to the PDF. This minimal test case causes the gray screen for me (however removing the ActiveXObject causes no gray screen).
<!DOCTYPE html>
<html lang="en">
<head>
<title>hi</title>
<meta charset="utf-8">
</head>
<body>
<script>
new ActiveXObject('AcroPDF.PDF.1');
</script>
<a target="_blank" href="http://partners.adobe.com/public/developer/en/xml/AdobeXMLFormsSamples.pdf">link</a>
</body>
</html>
I'm very interested if others are able to reproduce the problem with this test case and following the steps from my other post ("I don't have an exact solution...") on a "slow" computer.
Sorry for posting a new answer, but I couldn't figure out how to add a code block in a comment on my previous post.
For a video example of this minimal test case, see: http://youtu.be/IgEcxzM6Kck
I realize this is a rather late post but still a possible solution for the OP. I use IE9 on Win 7 and have been having Adobe Reader's grey screen issues for several months when trying to open pdf bank and credit card statements online. I could open everything in Firefox or Opera but not IE. I finally tried PDF-Viewer, set it as the default pdf viewer in its preferences and no more problems. I'm sure there are other free viewers out there, like Foxit, PDF-Xchange, etc., that will give better results than Reader with less headaches. Adobe is like some of the other big companies that develop software on a take it or leave it basis ... so I left it.
We were getting this issue even after updating to the latest Adobe Reader version.
Two different methods solved this issue for us:
Using the free version of Foxit Reader application in place of Adobe Reader
But, since most of our clients use Adobe Reader, so instead of requiring users to use Foxit Reader, we started using window.open(url) to open the pdf instead of window.location.href = url. Adobe was losing the file handle on for some reason in different iframes when the pdf was opened using the window.location.href method.

How to force IE reload script element?

I'm using YQL in my webpage (it should run on IE6-IE9).
I'm creating a dynamic script element and setting its source to a YQL query URL.
When I'm loading my webpage at the first time it works great and IE retrieves the latest data.
However, when I'm deleting the element and recreating it (with the exact same URL), IE uses its local cache and fails to deliver the latest data.
When trying to view in fiddler, i don't see any HTTP response (no 200, 304... nothing), which means that the response is retrieved from the local cache.
The common solution is using "cachebusting", such as suggested at: How to force IE to reload javascript?
However, according to YQL Blog's article, cachebusting (adding a "&rand=1234" at the end of the URL) is not recommended.
Does anyone know how to avoid cachebusting and still force IE to reload the script element?
thanks,
I've had luck setting these two headers:
Pragma: no-cache
If-Modified-Since: Thu, 1 Jan 1970 00:00:00 GMT

Categories