Mirror a website with httrack while executing javascript

Mirror a website with httrack while executing javascript - javascript

I want do save a mirror of www.youtube.com/tv. I obviously do not want to save the videos. I want the code running the website in a local copy, everything else can stay remote. The code I want is mainly contained in 2 files: live.js and app-prod.js.
I tried using httrack. I have issue parsing the javascript to load anything past the first file: live.js. The %P parameter does not help.
httrack www.youtube.com/tv +* -r6 --mirror -%P -j
It doesn't go further than live.js because some javascript needs to be executed to load the next file.
I know I can do this manually with any browser. I want to automate the process.
Is httrack able to do this by itself?
If yes, how?

Related

Is it possible to use javascript to set the return value of an html file called from a batch file?

In Win7, I want to pass an arbitrary integer as the return code or Errorlevel from an HTML/Javascript file back to the batch file that started it.
Is there some way I can use javascript in the HTML program to set the return or Errorlevel code?
Edit:
WooHoo! Thanks! Yeah, my IE executes the sample HTA code.
That leaves the question of how I can manipulate the Errorlevel code from out of javascript in an hta program. The wikipedia blurb says I can control the system from an hta file, but javascript doesn't seem to have any commands to do so, even though hta allows it.
I still feel like one of those paralyzed 'locked-in' patients who have to communicate by trying to blink their eyes.

Is there a way to clear/remove a specific resources from browser cache rather than clearing entire cache?

For instance in Chrome, I'm working on a webapp (which is heavy, takes ~5 seconds) has lot of static resources (JS) files and CSS to load in the first time. To reflect changes of one JS, I need to reload the webpage with "Empty Clear Cache".
If there can be a way to only remove specific resource(s) JS files from cache (so to force refetch from server), my testing time can be reduced by great extent.

A technique is to add a random parameter to the url of assets you don't want cached.
Depending on what your server-side language is, you might be able to do something like the following:
<script src="my.js?_=<%= encode(new Date().toString()) %>"></script>

If you have a PHP backend, you could tack on a random number to the JS file URL:
<script type="text/javascript" src="/script.js?<?php echo time(); ?>"></script>
This will cause the browser to re-fetch it every time from server as the URL will differ.

Specific resources can be reloaded individually if you change the date and time on your files on the server. "Clearing cache" is not as easy as it should be. Instead of clearing cache on my browsers, I realized that "touching" the server files cached will actually change the date and time of the source file cached on the server (Tested on Edge, Chrome and Firefox) and most browsers will automatically download the most current fresh copy of whats on your server (code, graphics any multimedia too). I suggest you just copy the most current scripts on the server and "do the touch thing" solution before your program runs, so it will change the date of all your problem files to a most current date and time, then it downloads a fresh copy to your browser:
<?php
touch('/www/sample/file1.css');
touch('/www/sample/file2.css');
touch('/www/sample/file2.css') ?>
then ... the rest of your program...
It took me some time to resolve this issue (as many browsers act differently to different commands, but they all check time of files and compare to your downloaded copy in your browser, if different date and time, will do the refresh), If you can't go the supposed right way, there is always another usable and better solution to it. Best Regards and happy camping. By the way touch(); or alternatives work in many programming languages inclusive in javascript bash sh php and you can include or call them in html.

Javascript parsing time vs http request

I'm working on a large project which is extensible with modules. Every module can have it's own javascript file which may only be needed on one page, multiple pages that use this module or even all pages if it is a global extension.
Right now I'm combining all .js files into one file whenever they get updated or a new module get's installed. The client only has to load one "big" .js file but parse it for every page. Let's assume someone has installed a lot of modules and the .js file grows to 1MB-2MB. Does it make sense to continue this route or should I include every .js when it is needed.
This would result in maybe 10-15 http requests more for every page. At the same time the parsing time for the .js file would be reduced since I only need to load a small portion for every page. At the same time the browser wouldn't try to execute js code that isn't even required for the current page or even possible to execute.
Comparing both scenarios is rather difficult for me since I would have to rewrite a lot of code. Before I continue I would like to know if someone has encountered a similar problem and how he/she solved it. My biggest concern is that the parsing time of the js files grows too much. Usually network latency is the biggest concern but I've never had to deal with so many possible modules/extensions -> js files.

If these 2 conditions are true, then it doesn't really matter which path you take as long as you do the Requirement (below).
Condition 1:
The javascript files are being run inside of a standard browser, meaning they are not going to be run inside of an apple ios uiWebView app (html5 iphone/ipad app)
Condition 2:
The initial page load time does not matter so much. In other words, this is more of a web application than a web page. So users login each day, stay logged in for a long time, do lots of stuffs...logout...come back the next day...
Requirement:
Put the javascript file(s), css files and all images under a /cache directory on the web server. Tell the web server to send the max-age of 1 year in the header (only for this directory and sub-dirs). Then once the browser downloads the file, it will never again waste a round trip to the web server asking if it has the most recent version.
Then you will need to implement javascript versioning, usually this is done by adding "?jsver=1" in the js include line. Then increment the version with each change.
Use chrome inspector and make sure this is setup correctly. After the first request, the browser never sends an Etag or asks the web server for the file again for 1 year. (hard reloads will download the file again...so test using links and a standard navigation path a user would normally take. Also watch the web server log to see what requests are being severed.
Good browsers will compile the javascript to machine code and the compiled code will sit in browser's cache waiting for execution. That's why Condition #1 is important. Today, the only browser which will not JIT compile js code is Apple's Safari inside of uiWebView which only happens if you are running html/js inside of an apple app (and the app is downloaded from the app store).
Hope this makes sense. I've done these things and have reduced network round trips considerably. Read up on Etags and how the browsers make round trips to determine if is using the current version of js/css/images.
On the other hand, if you're building a web site and you want to optimize for the first time visitor, then less is better. Only have the browser download what is absolutely needed for the first page view.

You really REALLY should be using on-demand JavaScript. Only load what 90% of users will use. For things most people won't use keep them separate and load them on demand. Also you should seriously reconsider what you're doing if you've got upwards of two megabytes of JavaScript after compression.
function ondemand(url,f,exe)
{
if (eval('typeof ' + f)=='function') {eval(f+'();');}
else
{
var h = document.getElementsByTagName('head')[0];
var js = document.createElement('script');
js.setAttribute('defer','defer');
js.setAttribute('src','scripts/'+url+'.js');
js.setAttribute('type',document.getElementsByTagName('script')[0].getAttribute('type'));
h.appendChild(js);
ondemand_poll(f,0,exe);
h.appendChild(document.createTextNode('\n'));
}
}
function ondemand_poll(f,i,exe)
{
if (i<200) {setTimeout(function() {if (eval('typeof ' + f)=='function') {if (exe==1) {eval(f+'();');}} else {i++; ondemand_poll(f,i,exe);}},50);}
else {alert('Error: could not load \''+f+'\', certain features on this page may not work as intended.\n\nReloading the page may correct the problem, if it does not check your internet connection.');}
}
Example usage: load example.js (first parameter), poll for the function example_init1() (second parameter) and 1 (third parameter) means execute that function once the polling finds it...
function example() {ondemand('example','example_init1',1);}

How to mirror a site with a JavaScript menu?

I’m trying to mirror a site that uses a crazy JavaScript menu generated on the client. Both wget and httrack fail to download the whole site, because the links are simply not there until the JS code runs. What can I do?
I have tried loading the main index page into the browser. That runs the JS code, the menu gets constructed and I can dump the resulting DOM into an HTML file & mirror from this file on. That downloads more files, as the links are already in the source. But obviously the mirroring soon breaks on other, freshly downloaded pages that contain the uninterpreted JS menu.
I thought about replacing the menu part of every downloaded page with a static version of the menu, but I can’t find any wget or httrack flags that would let me run the downloaded files through an external command. I could write a simple filtering proxy, but that starts to sound extreme. Other ideas?

I've used HtmlUnit to great success even on sites where things are obfuscated by dynamic elements.

In my case it won’t help, but maybe it will be useful to somebody; this is how a simple filtering proxy looks in Perl:
#!/usr/bin/env perl
use HTTP::Proxy;
use HTTP::Proxy::BodyFilter::simple;
my $proxy = HTTP::Proxy->new(port => 3128);
$proxy->push_filter(
mime => 'text/html',
response => HTTP::Proxy::BodyFilter::simple->new(
sub { ${ $_[1] } =~ s/foo/bar/g }
)
);
$proxy->start;

Retrieving a csv file from web page

I would like to save a csv file from a web page. However, the link on the page
does not lead directly to the file, but it calls some kind of javascript, which leads
to the opening of the file. In other words, there is no explicit url address for the
file i want to download or at least I don't know what it should be.
I found a way to download a file by activating Internet Explorer,going to the web page
and pressing the link button and then saving the file through the dialog box.
This is pretty ugly, and I am wondering if there is a more elegant (and fast) method to retrieve a file without using internet explorer(e.g. by using urllib.retrieve method)
The javascript is of the following form (see the comment, it does not let publish the source code...):
"CSV"
Any ideas?
Sasha

You can look at what the javascript function is doing, and it should tell you exactly where it's downloading from.

I had exactly this sort of problem a year or two back; I ended up installing the rhino javascript engine; grepping the javascript out of the target document and evaluating the url within rhino, and then fetching the result.

We Keep Coding

JavaScript is the programming language of the Web.