Trigger prefetching after page is fully loaded - javascript

My scenario is:
user visits domain.com (home page)
domain.com/products page contains large image library and quite large CSS and JS libraries
when user visits domain.com and the home page has fully loaded, we start to prefetch resources & if possible at least some % of images from the archive.
Currently on some pages JS "eats" quite a lot of resources therefor triggering prefetch in some cases during page load is not the best answer - as it will cause a small lag when user interacts with JS created events and elements.
My questions are:
Is it even possible (will it work) to trigger <link rel="prefetch" href="image.png"> or CSS file to be added to <head> so it can prefetch data from another page after current page is fully loaded?
Should I do it similar like rendering additional stylesheet using JS where I add new tag within <head> as a stylesheet file so it can then render.. or is there another way?

You might use Cache Storage to prefetch (precache) assets. I work on an open-source project which uses this approach. Although, to serve precached assets you need a service worker. The logic of finding assets in my project looks like this.
The demo of this project is here. Also, I wrote an article which explains technical details of the project.
Assets get prefetched once the lib is loaded, so I don't wait for the entire page load. Maybe I should use requestIdleCallback to wait until the browser is idle.
Hopefully, it gives you some inspiration.

Just to note that you could add aditional stylesheet after the page load completly or whenever you want with somethig like this:
document.addEventListener("DOMContentLoaded", function(event) {
var script = document.createElement("link");
script.rel = "stylesheet";
script.href= "stylesOfAnotherPage2.css";
document.getElementsByTagName("body")[0].appendChild(script);//or head
});
When you load page1, stylesOfAnotherPage2.css is cached, so when page2 is called, stylesOfAnotherPage2.css is already cached if page2 call the same file.

You might use HTTP Caching and Link prefetching to use your browser idle time to download or prefetch documents that the user might visit in the near future.
Prefetching hints
The browser observes all of these hints and queues up each unique
request to be prefetched when the browser is idle. There can be
multiple hints per page, as it might make sense to prefetch multiple
documents. For example, the next document might contain several large
images.
<link rel="prefetch alternate stylesheet" title="Designed for Mozilla" href="mozspecific.css">
<link rel="next" href="2.html">
Also, you can read this thread:
Preload, Prefetch And Priorities in Chrome
There you can read the different states and priorities about the execution, load and preload times, some tips to improve them.

preload of CSS and JS using https://developer.mozilla.org/en-US/docs/Web/HTML/Preloading_content might be a good fit and has good support in most modern browsers: https://caniuse.com/#search=preload
There are probably better solutions, such as the one suggested by #soulshined, but another crude way to do this is the other page contains etags or cache control headers would be to use AJAX to send requests to the resources you expect to load. This would cause the browser to request those resources and prefill the user agent cache so that when the user requests that resource on the other page there is a higher change the cache would contain the resources and it'd load faster than if it had to fetch everything for the first time.

To prefetch assets there are some prefetching methods such as DNS-prefetch, pre-connect, pre-render & prefetch. As per the requirement, you may use them appropriately. Each method has its own purpose this would be useful to know each one specifically.

Related

How to preload js workers and wasm scripts for offline use (pwa)?

I would like to force the browser to preload a JavaScript worker as well as a WebAssembly script. When loaded, I have a ServiceWorker that puts these scripts in the CacheStorage.
For the images, I use the following <link> tag and this works well:
<link rel="prefetch" href="/img/foo.png" as="image" type="image/png" importance="low" />
So I tried the same things for my wasm and js scripts:
<link rel="prefetch" href="/wasm/bar.wasm" type="application/wasm" importance="low" />
<link rel="prefetch" href="/js/baz.worker.js" as="script" type="text/javascript" importance="low" />
However, it looks like the Browser (Chrome) does not load these scripts with the prefetch rel. So I tried to use rel="preload".
But I don't know what to fill for the as property for the wasm file and for the js worker file, I get the following warning: The resource http://localhost:3000/wasm/bar.wasm was preloaded using link preload but not used within a few seconds from the window's load event. Please make sure it has an appropriate `as` value and it is preloaded intentionally..
If the user executes an action that requires these files, the browser loads them well and then we can use them when offline.
These scripts are not needed before the load event. I could send a fetch request when the browser emits the load event? But I think it is better to let the browser manages such things with a link tag? In my case, it is quite likely that the user will need this script (90%+), so a preload would make sense, but theoretically, I want to make a prefetch of the content only.
How would you recommend preloading these files for offline use (pwa)?
It looks like sending an HTTP request with the fetch API puts well the script into the CacheStorage.
/**
* Preload resources for offline pwa use
*/
window.addEventListener('load', () => {
fetch('/js/baz.worker.js');
fetch('/wasm/bar.wasm');
});
By sending these requests after the load event, it does not disturb the page load time.
To keep the priority hint, it is possible to set the importance parameter to low like the following:
fetch('/wasm/bar.wasm', { importance: "low" });
(warning: limited browser support when this answer has been published)

How to circumvent browser caching? [duplicate]

Is there a way I can put some code on my page so when someone visits a site, it clears the browser cache, so they can view the changes?
Languages used: ASP.NET, VB.NET, and of course HTML, CSS, and jQuery.
If this is about .css and .js changes, then one way is "cache busting" by appending something like "_versionNo" to the file name for each release. For example:
script_1.0.css // This is the URL for release 1.0
script_1.1.css // This is the URL for release 1.1
script_1.2.css // etc.
or after the file name:
script.css?v=1.0 // This is the URL for release 1.0
script.css?v=1.1 // This is the URL for release 1.1
script.css?v=1.2 // etc.
You can check this link to see how it could work.
Look into the cache-control and the expires META Tag.
<META HTTP-EQUIV="CACHE-CONTROL" CONTENT="NO-CACHE">
<META HTTP-EQUIV="EXPIRES" CONTENT="Mon, 22 Jul 2002 11:12:01 GMT">
Another common practices is to append constantly-changing strings to the end of the requested files. For instance:
<script type="text/javascript" src="main.js?v=12392823"></script>
Update 2012
This is an old question but I think it needs a more up to date answer because now there is a way to have more control of website caching.
In Offline Web Applications (which is really any HTML5 website) applicationCache.swapCache() can be used to update the cached version of your website without the need for manually reloading the page.
This is a code example from the Beginner's Guide to Using the Application Cache on HTML5 Rocks explaining how to update users to the newest version of your site:
// Check if a new cache is available on page load.
window.addEventListener('load', function(e) {
window.applicationCache.addEventListener('updateready', function(e) {
if (window.applicationCache.status == window.applicationCache.UPDATEREADY) {
// Browser downloaded a new app cache.
// Swap it in and reload the page to get the new hotness.
window.applicationCache.swapCache();
if (confirm('A new version of this site is available. Load it?')) {
window.location.reload();
}
} else {
// Manifest didn't changed. Nothing new to server.
}
}, false);
}, false);
See also Using the application cache on Mozilla Developer Network for more info.
Update 2016
Things change quickly on the Web.
This question was asked in 2009 and in 2012 I posted an update about a new way to handle the problem described in the question. Another 4 years passed and now it seems that it is already deprecated. Thanks to cgaldiolo for pointing it out in the comments.
Currently, as of July 2016, the HTML Standard, Section 7.9, Offline Web applications includes a deprecation warning:
This feature is in the process of being removed from the Web platform.
(This is a long process that takes many years.) Using any of the
offline Web application features at this time is highly discouraged.
Use service workers instead.
So does Using the application cache on Mozilla Developer Network that I referenced in 2012:
Deprecated This feature has been removed from the Web standards.
Though some browsers may still support it, it is in the process of
being dropped. Do not use it in old or new projects. Pages or Web apps
using it may break at any time.
See also Bug 1204581 - Add a deprecation notice for AppCache if service worker fetch interception is enabled.
Not as such. One method is to send the appropriate headers when delivering content to force the browser to reload:
Making sure a web page is not cached, across all browsers.
If your search for "cache header" or something similar here on SO, you'll find ASP.NET specific examples.
Another, less clean but sometimes only way if you can't control the headers on server side, is adding a random GET parameter to the resource that is being called:
myimage.gif?random=1923849839
I had similiar problem and this is how I solved it:
In index.html file I've added manifest:
<html manifest="cache.manifest">
In <head> section included script updating the cache:
<script type="text/javascript" src="update_cache.js"></script>
In <body> section I've inserted onload function:
<body onload="checkForUpdate()">
In cache.manifest I've put all files I want to cache. It is important now that it works in my case (Apache) just by updating each time the "version" comment. It is also an option to name files with "?ver=001" or something at the end of name but it's not needed. Changing just # version 1.01 triggers cache update event.
CACHE MANIFEST
# version 1.01
style.css
imgs/logo.png
#all other files
It's important to include 1., 2. and 3. points only in index.html. Otherwise
GET http://foo.bar/resource.ext net::ERR_FAILED
occurs because every "child" file tries to cache the page while the page is already cached.
In update_cache.js file I've put this code:
function checkForUpdate()
{
if (window.applicationCache != undefined && window.applicationCache != null)
{
window.applicationCache.addEventListener('updateready', updateApplication);
}
}
function updateApplication(event)
{
if (window.applicationCache.status != 4) return;
window.applicationCache.removeEventListener('updateready', updateApplication);
window.applicationCache.swapCache();
window.location.reload();
}
Now you just change files and in manifest you have to update version comment. Now visiting index.html page will update the cache.
The parts of solution aren't mine but I've found them through internet and put together so that it works.
For static resources right caching would be to use query parameters with value of each deployment or file version. This will have effect of clearing cache after each deployment.
/Content/css/Site.css?version={FileVersionNumber}
Here is ASP.NET MVC example.
<link href="#Url.Content("~/Content/Css/Reset.css")?version=#this.GetType().Assembly.GetName().Version" rel="stylesheet" type="text/css" />
Don't forget to update assembly version.
I had a case where I would take photos of clients online and would need to update the div if a photo is changed. Browser was still showing the old photo. So I used the hack of calling a random GET variable, which would be unique every time. Here it is if it could help anybody
<img src="/photos/userid_73.jpg?random=<?php echo rand() ?>" ...
EDIT
As pointed out by others, following is much more efficient solution since it will reload images only when they are changed, identifying this change by the file size:
<img src="/photos/userid_73.jpg?modified=<? filemtime("/photos/userid_73.jpg")?>"
A lot of answers are missing the point - most developers are well aware that turning off the cache is inefficient. However, there are many common circumstances where efficiency is unimportant and default cache behavior is badly broken.
These include nested, iterative script testing (the big one!) and broken third party software workarounds. None of the solutions given here are adequate to address such common scenarios. Most web browsers are far too aggressive caching and provide no sensible means to avoid these problems.
Updating the URL to the following works for me:
/custom.js?id=1
By adding a unique number after ?id= and incrementing it for new changes, users do not have to press CTRL + F5 to refresh the cache. Alternatively, you can append hash or string version of the current time or Epoch after ?id=
Something like ?id=1520606295
<meta http-equiv="pragma" content="no-cache" />
Also see https://stackoverflow.com/questions/126772/how-to-force-a-web-browser-not-to-cache-images
Here is the MDSN page on setting caching in ASP.NET.
Response.Cache.SetExpires(DateTime.Now.AddSeconds(60))
Response.Cache.SetCacheability(HttpCacheability.Public)
Response.Cache.SetValidUntilExpires(False)
Response.Cache.VaryByParams("Category") = True
If Response.Cache.VaryByParams("Category") Then
'...
End If
Not sure if that might really help you but that's how caching should work on any browser. When the browser request a file, it should always send a request to the server unless there is a "offline" mode. The server will read some parameters like date modified or etags.
The server will return a 304 error response for NOT MODIFIED and the browser will have to use its cache. If the etag doesn't validate on server side or the modified date is below the current modified date, the server should return the new content with the new modified date or etags or both.
If there is no caching data sent to the browser, I guess the behavior is undetermined, the browser may or may not cache file that don't tell how they are cached. If you set caching parameters in the response it will cache your files correctly and the server then may choose to return a 304 error, or the new content.
This is how it should be done. Using random params or version number in urls is more like a hack than anything.
http://www.checkupdown.com/status/E304.html
http://en.wikipedia.org/wiki/HTTP_ETag
http://www.xpertdeveloper.com/2011/03/last-modified-header-vs-expire-header-vs-etag/
After reading I saw that there is also a expire date. If you have problem, it might be that you have a expire date set up. In other words, when the browser will cache your file, since it has a expiry date, it shouldn't have to request it again before that date. In other words, it will never ask the file to the server and will never receive a 304 not modified. It will simply use the cache until the expiry date is reached or cache is cleared.
So that is my guess, you have some sort of expiry date and you should use last-modified etags or a mix of it all and make sure that there is no expire date.
If people tends to refresh a lot and the file doesn't get changed a lot, then it might be wise to set a big expiry date.
My 2 cents!
I implemented this simple solution that works for me (not yet on production environment):
function verificarNovaVersio() {
var sVersio = localStorage['gcf_versio'+ location.pathname] || 'v00.0.0000';
$.ajax({
url: "./versio.txt"
, dataType: 'text'
, cache: false
, contentType: false
, processData: false
, type: 'post'
}).done(function(sVersioFitxer) {
console.log('Versió App: '+ sVersioFitxer +', Versió Caché: '+ sVersio);
if (sVersio < (sVersioFitxer || 'v00.0.0000')) {
localStorage['gcf_versio'+ location.pathname] = sVersioFitxer;
location.reload(true);
}
});
}
I've a little file located where the html are:
"versio.txt":
v00.5.0014
This function is called in all of my pages, so when loading it checks if the localStorage's version value is lower than the current version and does a
location.reload(true);
...to force reload from server instead from cache.
(obviously, instead of localStorage you can use cookies or other persistent client storage)
I opted for this solution for its simplicity, because only mantaining a single file "versio.txt" will force the full site to reload.
The queryString method is hard to implement and is also cached (if you change from v1.1 to a previous version will load from cache, then it means that the cache is not flushed, keeping all previous versions at cache).
I'm a little newbie and I'd apreciate your professional check & review to ensure my method is a good approach.
Hope it helps.
In addition to setting Cache-control: no-cache, you should also set the Expires header to -1 if you would like the local copy to be refreshed each time (some versions of IE seem to require this).
See HTTP Cache - check with the server, always sending If-Modified-Since
There is one trick that can be used.The trick is to append a parameter/string to the file name in the script tag and change it when you file changes.
<script src="myfile.js?version=1.0.0"></script>
The browser interprets the whole string as the file path even though what comes after the "?" are parameters. So wat happens now is that next time when you update your file just change the number in the script tag on your website (Example <script src="myfile.js?version=1.0.1"></script>) and each users browser will see the file has changed and grab a new copy.
Force browsers to clear cache or reload correct data? I have tried most of the solutions described in stackoverflow, some work, but after a little while, it does cache eventually and display the previous loaded script or file. Is there another way that would clear the cache (css, js, etc) and actually work on all browsers?
I found so far that specific resources can be reloaded individually if you change the date and time on your files on the server. "Clearing cache" is not as easy as it should be. Instead of clearing cache on my browsers, I realized that "touching" the server files cached will actually change the date and time of the source file cached on the server (Tested on Edge, Chrome and Firefox) and most browsers will automatically download the most current fresh copy of whats on your server (code, graphics any multimedia too). I suggest you just copy the most current scripts on the server and "do the touch thing" solution before your program runs, so it will change the date of all your problem files to a most current date and time, then it downloads a fresh copy to your browser:
<?php
touch('/www/sample/file1.css');
touch('/www/sample/file2.js');
?>
then ... the rest of your program...
It took me some time to resolve this issue (as many browsers act differently to different commands, but they all check time of files and compare to your downloaded copy in your browser, if different date and time, will do the refresh), If you can't go the supposed right way, there is always another usable and better solution to it. Best Regards and happy camping. By the way touch(); or alternatives work in many programming languages inclusive in javascript bash sh php and you can include or call them in html.
For webpack users:-
I added time with chunkhash in my webpack config. This solved my problem of invalidating cache on each deployment. Also we need to take care that index.html/ asset.manifest is not cached both in your CDN or browser. Config of chunk name in webpack config will look like this:-
fileName: [chunkhash]-${Date.now()}.js
or If you are using contenthash then
fileName: [contenthash]-${Date.now()}.js
This is the simple solution I used to solve in one of my applications using PHP.
All JS and CSS files are placed in a folder with version name. Example : "1.0.01"
root\1.0.01\JS
root\1.0.01\CSS
Created a Helper and Defined the version Number there
<?php
function system_version()
{
return '1.0.07';
}
And Linked JS and SCC Files like below
<script src="<?= base_url(); ?>/<?= system_version();?>/js/generators.js" type="text/javascript"></script>
<link rel="stylesheet" type="text/css" href="<?= base_url(); ?>/<?= system_version(); ?>/css/view-checklist.css" />
Whenever I make changes to any JS or CSS file, I change the System Verson in Helper and rename the folder and deploy it.
I had the same problem, all i did was change the file names which are linked to my index.html file and then went into the index.html file and updated their names, not the best practice but if it works it works. The browser sees them as new files so they get redownloaded on to the users device.
example:
I want to update a css file, its named styles.css, change it to styless.css
Go into index.html and update , and change it to
in case interested I've found my solution to get browsers refreshing .css and .js in the context of .NET MVC (.net fw 4.8) and the use of bundles.
I wanted to make browsers refresh cached files only after a new assembly is deployed.
Buinding on Paulius Zaliaduonis response, my solution is as follows:
store your application base url in the web config app settings (the HttpContext is not yet available at runtime during the RegisterBundle...), then make this parameter changing according to the configuration (debug, staging, release...) by the xml transform
In BundleConfig RegisterBundles get the assembly version by the means of reflection, and...
...change the default tag format of both styles and scripts so that the bundling system generates link and script tags appending a query string parameter on them.
Here is the code
public static void RegisterBundles(BundleCollection bundles)
{
string baseUrl = system.Configuration.ConfigurationManager.AppSettings["by.app.base.url"].ToString();
string assemblyVersion = Assembly.GetExecutingAssembly().GetName().Version.ToString();
Styles.DefaultTagFormat = $"<link href='{baseUrl}{{0}}?v={assemblyVersion}' rel='stylesheet'/>";
Scripts.DefaultTagFormat = $"<script src='{baseUrl}{{0}}?v={assemblyVersion}'></script>";
}
You'll get tags like
<script src="https://example.org/myscriptfilepath/script.js?v={myassemblyversion}"></script>
you just need to remember to to build a new version before deploying.
Ciao
Do you want to clear the cache, or just make sure your current (changed?) page is not cached?
If the latter, it should be as simple as
<META HTTP-EQUIV="Pragma" CONTENT="no-cache">

how should my site handle ocassionally missing javascript files gracefully?

Say I've got this script tag on my site (borrowed from SO).
<script type="text/javascript" async=""
src="http://edge.quantserve.com/quant.js"></script>
If edge.quantserve.com goes down or stops responding without returning a 404, won't SO have to wait for the timeout before the rest of the page loads? I'm thinking Chaos Monkey shows up and blasts a server that my site is depending on, a server that isn't part of a CDN and has a poor failover.
What's the industry standard way to handle this issue? I couldn't find a dupe on SO, maybe I'm searching for the wrong terms.
Update: I should have looked a bit more closely at the SO code, there's this at the bottom:
<script type="text/javascript">var _gaq=_gaq||[];_gaq.push(['_setAccount','UA-5620270-1']);
_gaq.push(['_setCustomVar', 2, 'accountid', '14882',2]);
_gaq.push(['_trackPageview']);
var _qevents = _qevents || [];
(function(){
var s=document.getElementsByTagName('script')[0];
var ga=document.createElement('script');
ga.type='text/javascript';
ga.async=true;
ga.src='http://www.google-analytics.com/ga.js';
s.parentNode.insertBefore(ga,s);
var sc=document.createElement('script');
sc.type='text/javascript';
sc.async=true;
sc.src='http://edge.quantserve.com/quant.js';
s.parentNode.insertBefore(sc,s);
})();
</script>
OK, so if the quant.js file fails to load, it's creating a script tag with ga.async=true;. Maybe that's the trick.
Possible answer: https://stackoverflow.com/a/1834129/30946
Generally, it's tricky to do it well and cross-browser.
Some proposals:
Move the script to the very bottom of the HTML page (so that almost everything is displayed before you request that script)
Move it to the bottom and wrap it in <script>document.write("<scr"+"ipt src='http://example.org/script.js'></scr"+"ipt>")</script> or the way you added after update (document.createElement('script'))
A last option is to load it via XHR (but this works only for same-domain, or cross-domain only if the CORS is enabled on a third-party server); you can then use timeout property of the XHR (for IE and Fx12+), and in the other browsers, use setTimeout and check the XHR's readyState. It's kind of convoluted and very non-cross-browser for now, so the option 2 looks the best.
Make a copy of the file on your server and use this. it will load your copy only if the one from the server has failed to load
<script src="http://edge.quantserve.com/quant.js"></script>
<script>window.quant || document.write('<script src="js/quant.js"><\/script>')</script>
To answer your question about the browser having to wait for the script to load before the rest of the page loads, the answer to that would typically be no. Typical browsers will have multiple threads processing the download of the page and linked content (CSS, images, js). So the rest of the page should be loaded, though the user's browser indicator will still show the page trying to load until the final request is fulfilled or timed out.
Depending on the nature of the resource you are trying to load, this will obviously effect your page differently. Typically, if you are worried about this, you can host all your files on a common CDN (or your website if it is not that highly trafficked), that way at least if one thing fails, chances are everything is failing and you have a bigger issue to contend with :)

Find jQuery cache hit/miss from CDN

If you include jQuery from a CDN, is there a way to determine whether a user fetched the content from the CDN or retrieved it from their cache?
Obviously a cache hit doesn't make an HTTP request, but could you test that and report Javascript back to your own server with the data?
Why not just use CHARLES or a similar debugging proxy to determine loading speed?
If you want to know the speed from a client's perspective from multiple locations, use http://www.webpagetest.org/ with two differing versions of your website (one with CDN, one with self-hosted static location) and compare the loading speeds. Personally, unless you have a lot of custom javascript code, it makes sense to use a CDN for jQuery, especially since lots of sites use the Google Libraries API for jQuery.
If you have logging on your CDN (we don't seem to?) you could change the CDN url for test runs and also use a pingback url on your server. Over a period of time, compare the ping back url hits and times with your CDN url hits and the unique visitors counts.
You should be able to get an idea about how many unique hits on your cdn url you get vs unique hits you get on your page. The difference should be bots, scrapers and cached or failed loading of resources. Bots you can eliminate, scrapers probably as well, so your %ages should be reflective over a long enough period.
Would this work for you?
We do this on non-cdn resources to see if people are downloading the latest CSS files or not to force a name change only on those IPs that seem to have cached a resource after a change was made to the css file.
Testing for a 304 (not modified) would be difficult without using ajax. And using ajax will be very difficult unless you get around the same origin policy on the CDN.
I assume you want to test the actual time the scripts loads and becomes available on the clients, and would like to compare this data using CDN vs. something local. If so, wouldn’t it be better to test the actual time instead of doing some cache test?
It’s fairly easy to set up an A/B test of the actual time the scripts are loading.
For the A test you could do the CDN/local separation
<script>var _time = new Date().getTime();</script>
<script src="http://code.jquery.com/jquery-1.7.1.min.js"></script>
<script src="project.js"></script>
<script>_time = new Date().getTime() - _time;</script>
And the B test a local script merge or whatever:
<script>var _time = new Date().getTime();</script>
<script src="project.includingjquery.min.js"></script>
<script>_time = new Date().getTime() - _time;</script>
Then report the _time variable into analytics or your own database using ajax. If the B users have lower _time reported, you know it’s the right way to go...
If you were willing to add some bulk to a page to test this you could add an ajax request to a CDN for jQuery and then check the headers for a 304 response. If you then create a second ajax request to ping back to your server telling you if jQuery was cached or not, i.e. if it was a 200 or a 304 response. I haven't tried this but it should work, but it will add some extra requests for your users, but given the fact they'll be asynchronous it probably wouldn't have any impact.
<script>var s=new Date().getTime();</script>
<script src="cdncontent"></script>
<script>
var s = new Date().getTime() - s;
if (s < 100) {
//likely from cache
} else {
//likely from CDN
}
</script>
It always worth loading resources from a CDN. The only drawback is whether the CDN is geographically close to the user or not. You need to check your target audience and decide if it's better to use it or not. In case of jquery, I always use Google CDN which I think is realiable and robust.

Is there a way to mitigate downloading of resources (images/css and js files) with Javascript?

I have a html page on my localhost - get_description.html.
The snippet below is part of the code:
<input type="text" id="url"/>
<button id="get_description_button">Get description</button>
<iframe id="description_container" src="#"/>
When the button is clicked the src of the iframe is set to the url entered in the textbox. The pages fetched this way are very big with lots of linked files. What I am interested in the page is a block of text contained in a <div id="description"> element.
Is there a way to mitigate downloading of resources linked in the page that loads into the iframe?
I don't want to use curl because the data is only available to logged in users and the steps to take with curl to get the content is too complicated. The iframe is simple as I use this on a box which sends the right cookies to identify the request as coming from a logged in user, but the problem is that it is very wasteful to get nearly 1 MB of data to keep 1 KB of it and throw out the rest.
Edit
If the proposed method just works in Firefox it is fine, so I added Firefox tag. Also, it is possible that the answer actually is from the realm of Firefox add-on techniques, so I added that tag as well.
The problem is not that I cannot get at what I'm looking for, rather, the problem is the easy iframe method is wasteful.
I know that Firefox does allow loading only the text of a page. If you open a page and press Ctrl+U you are taken to 'view page source' window, There links behave as normal and are clickable, if you click on a link in source view, the source of the new page is loaded into the view source window, without the linked resources being downloaded, exactly what I'm trying to get. But I don't know how to access this behaviour.
Another example is the Adblock add-on. It somehow kills elements before they get loaded. With plain Javascript this is not possible. Because it only is triggered too late to intervene in good time.
The Same Origin Policy forbids any web page to access contents of any other web page in a different domain so basically you cannot do that.
However it seems that with some browsers it is allowed to access web pages content if you are trying to access it from a local web page which seems to be your case.
Safari, IE 6/7/8 are browser that allow a local web page to do so via XMLHttpRequest (source: Google Browser Security Handbook) so you may want to choose to use one of those browsers to do what you need (note that future versions of those browsers may not allow to do so anymore).
A part from this solution I only see two possibities:
If the web pages you need to fetch content from are somehow controlled by you, you can create a simpler interface to let other web pages to get the content you need (for example allowing JSONP requests).
If the web pages you need to fetch content from are not controlled by you the only solution I see is to fetch content server side logging in from the server directly (I know that you don't want to do so, but I don't see any other possibility if the previous I mentioned are not practicable)
Hope it helps.
Actually I've seen Cross Domain jQuery .load request before, here: http://james.padolsey.com/javascript/cross-domain-requests-with-jquery/
The author claims that codes like these found on that page
$('#container').load('http://google.com'); // SERIOUSLY!
$.ajax({
url: 'http://news.bbc.co.uk',
type: 'GET',
success: function(res) {
var headline = $(res.responseText).find('a.tsh').text();
alert(headline);
}
});
// Works with $.get too!
would work. (The BBC code might not work because of the recent redesign, but you get the idea)
Apparently it is using YQL wrapped into a jQuery plugin to do the trick. Now I cannot say I fully understand what he is doing there but it appears to work, and fits the bill. Once you load the data I suppose it is a simple matter of filtering out the data that you need.
If you prefer something that works at the browser level, may I suggest Mozilla's Jetpack framework for lightweight extensions. I've not yet read the documentations in its entirety but it should contain the APIs needed for this to work.
There are various ways to go about this in AJAX, I'm going to show the jQuery way for brevity as one option, though you could do this in vanilla JavaScript as well.
Instead of an <iframe> you can just use a container, let's say a <div> like this:
<div id="description_container"></div>
Then to load it:
$(function() {
$("#get_description_button").click(function() {
$("#description_container").load($("input").val() + " #description");
});
});
This uses the .load() method which takes a string in this format: .load("url selector"), then takes that element in the page and places it's content inside the container you're loading, in this case #description_container.
This is just the jQuery route, mainly to illustrate that yes, you can do what you want, but you don't have to do it exactly like this, just showing the concept is getting what you want from an AJAX request, rather than in an <iframe>.
Your description sounds like you are fetching pages from the same domain (you said that you need to be logged in and have session credentials) so have you tried to use async request via XMLHttpRequest? It might complain if the html on a page is particularly messed up but you chould still be able to get raw text via .responseText and extract what you need with a regex.

Categories