How to mirror a site with a JavaScript menu? - javascript

I’m trying to mirror a site that uses a crazy JavaScript menu generated on the client. Both wget and httrack fail to download the whole site, because the links are simply not there until the JS code runs. What can I do?
I have tried loading the main index page into the browser. That runs the JS code, the menu gets constructed and I can dump the resulting DOM into an HTML file & mirror from this file on. That downloads more files, as the links are already in the source. But obviously the mirroring soon breaks on other, freshly downloaded pages that contain the uninterpreted JS menu.
I thought about replacing the menu part of every downloaded page with a static version of the menu, but I can’t find any wget or httrack flags that would let me run the downloaded files through an external command. I could write a simple filtering proxy, but that starts to sound extreme. Other ideas?

I've used HtmlUnit to great success even on sites where things are obfuscated by dynamic elements.

In my case it won’t help, but maybe it will be useful to somebody; this is how a simple filtering proxy looks in Perl:
#!/usr/bin/env perl
use HTTP::Proxy;
use HTTP::Proxy::BodyFilter::simple;
my $proxy = HTTP::Proxy->new(port => 3128);
$proxy->push_filter(
mime => 'text/html',
response => HTTP::Proxy::BodyFilter::simple->new(
sub { ${ $_[1] } =~ s/foo/bar/g }
)
);
$proxy->start;

Related

How to read javascript file in frontend javascript/browser

Please read carefully before marking as dupe.
I want to read a javascript file on frontend. The javascript file is obviously being used as a script on the webpage. I want to read that javascript file as text, and verify if correct version of it is being loaded in the browser. From different chunks of text in the js file, I can identify what version is actually being used in the end user's browser. The js file is main.js which is generated by angular build.
I know we can do something like creating a global variable for version or some mature version management. But currently, on production site, that will mean a new release, which is couple of months from now. Only option I have right now is html/js page, which can be directly served from production site, without waiting for new release.
So my question is, is it possible we can read a javascript file as text in hmtl/js code in the browser.
an idea can be :
use fetch api to get a container that can be use to async load the script
https://developer.mozilla.org/en-US/docs/Web/API/Fetch_API
use text() method which return a promise to get text content
fetch('http://localhost:8100/scripts.js').then((res) => res.text()).then(scriptContent => {
// scriptContent contain the content of your script
// if (scriptContent.includes('version : 1.1.1')
console.log(scriptContent);
});
this is absolutely not an efficient way, but since you want to work without any new version release or working with a version management system
here is the thing
assume file's checksum (md5 sha125 or anything) of V1.0 equals X and you will calculate before the coding part.
if checksum(X) != X{
location.reload()
}
would help for a security features too since it's an important project.
another way of controlling this situation is changing the main.js file's name if it is possible.

Make a small HTML application update a JSON file

I want to make a local HTML application read and update a JSON file and use its content to display HTML content. Alas, I'm stuck at the very first step, as I can't seem to setup any sort of test file that simply notices and reads a JSON file. From what I see online, I need to use other libraries. I attempted to use require.js but I can't make it work and the documentation doesn't help me.
I imported the require.js with a tag and attempt to launch something out of what I got from the documentation, but there's nothing to do. It doesn't look like it's willing to take .json files.
requirejs([
'example'
], function(example) {
const config = require('./config.json')
});
My issue is to get the program to read the file. From there I believe I can make the display of it, but this JS thing is all alien to me.
The recommended way would be to run a web server or use something like Electron and build a desktop app (as #chrisG points out in the comments). But if you wanna do this in the browser without an web server you could do something like:
Run Chrome with the --allow-file-access-from-files (or however you allow local file access in your browser of choice)
Put your JSON in a js file and load it (to just do this you don't need the flag, but if you want to use absolute path you'll need it)

Using the `runScript` function to run a JXA script does not allow parameters

I use JXA to script workflows for Alfred 2 and recently tried to run a script from within another script. I need to pass some text between the scripts, so I decided to use parameters, but whenever I try to pass a string, a number, an array or anything else that isn't an object to it, it gives the error "Error on line 4: Error: An error occurred.". If I do pass an object, the second script (the one being run by the first script) receives an empty object rather than the one passed to it. The same happens when the first script is an AppleScript, but if the second script is an AppleScript, it all works perfectly. Passing arguments through osascript from the command line also works. Is the API broken or is there something that I'm doing wrong?
First script:
var app = Application.currentApplication();
app.includeStandardAdditions = true;
app.runScript(new Path("/path/to/second/script.scpt"), { withParameters: "Hello World!" });
Second script:
function run(args) {
return args;
}
Edit:
If the second script is edited as below, the dialogue is displayed but the runScript method of the first script still returns an error.
function run(args) {
var app = Application.currentApplication();
app.includeStandardAdditions = true;
app.displayDialog(args.toString());
return args;
}
Edit 2:
The runScript function actually seems to be working perfectly other than the problem with the parameters. The error isn't actually being thrown, just displayed by the Script Editor, and execution continues after the call to runScript as if nothing had happened. The returned value also work perfectly, despite the parameters not working.
A note about Alfred 2 workflows
To run some code in Alfred 2 (triggered by a search, keyboard command, etc.), it must be typed into a box in the app, not in a file.
The box to enter code in is very small and does not provide syntax highlighting, and this makes editing code difficult and annoying. For smaller files, it is okay, but for larger files it is easier to use a short script to run a script file. I've tried Bash, which would be the simplest option, but Alfred 2 does not provide an option to escape single quotes. I also cannot use script libraries (to my knowledge, correct me if I'm wrong), as the code is not in a script bundle and all of the required files need to be within the same folder (for exportation reasons).
I don't know how to avoid the runScript error, but I can suggest an alternative approach: load the script as a script library.
Using a script library
Turning a script into a library can be as simple as saving the script to ~/Library/Script Libraries. If your script file is named script.scpt and has a run handler, and you save it to the Script Libraries folder, then you can then invoke it from another script like so:
Library("script").run(["Hello, world!"])
Script libraries are documented in the JXA release notes for OS X 10.10, in the WWDC 2014 session video introducing JXA, and in the AppleScript Language Guide.
Embedding a script library inside of a script bundle
According to the AppleScript Language Guide documentation for script libraries, there is a search policy for finding Script Libraries folders. The first place it searches is:
If the script that references the library is a bundle, the script’s bundle Resources directory. This means that scripts may be packaged and distributed with the libraries they use.
To apply this to the example given in the question, you would need to re-save the first script as a script bundle, and then embed the second script inside of the first script.
For example, if you re-save the first script as script.scptd, then you could save the second script embedded.scpt to script.scptd/Resources/Script Libraries/embedded.scpt. You should then be able to use Library('embedded') to access the script library.
To re-save an existing script as a script bundle, you can either use the File > Export... menu item in Script Editor, or you can hold down option while selecting the File menu to reveal the File > Save As... menu item. The File Format pop-up menu lets you choose the Script bundle format.
Once you have a script bundle open, you can reveal the bundle content panel by using the Show Bundle Contents menu item or toolbar button. You can then use the gear menu to create the Script Libraries folder inside of the Resources folder, and then you can drag a script into that folder.

Use browser to run custom JavaScript on page (client side) to simulate clicking? How to do?

I want to automatically grab some content from a page.
I wonder if it is possible:
Run my own written JavaScript on the page after the page is loaded (I use FireFox. I don't have the ability to change content of the page. I just want to run JS on my browser.). The script will use getelementbyid or similar method to get the link to the next page
Run a JavaScript to collect my interested content (some URLs) on that page and store those URLs in a local file
Go to next page (the next page will get really loaded with my browser, but I do not need to intervene at all) and repeat step 1 and step 2, until there is no next page.
The classic way to do this is to write a Perl script using LWP or PHP script using CURL, etc. But that is all server side. I wonder if I can do it client side.
I do something rather similar, actually.
By using GreaseMonkey, you can write a user-script that will interact with the pages however you need. You can get the next page link and scroll things as you like.
You can also store any data locally, within Firefox though some new functions called GM_getValue and GM_setValue.
I take the lazy way out. I just generate a long list of the URLs that I find when navigating the pages. I do a crude "document.write" method and I dump out my list of URLs as a batch file that rules on wget.
At that point I copy-and-paste the batch file then run it.
If you need to run this often enough that it should be automated, there used to be a way to turn GreaseMonkey scripts into Firefox extensions, that have access to more power.
Another option is currently AFAIK, Chrome only. You can collect whatever information you need and build a large file from it, then use the download attribute of a link and come up with a single-click to save things.
Update
I was going to share the full code for that I was doing, but it was so tied to a particular website that it wouldn't have really helped -- so I'll go for a more "general" solution.
Warning, this code typed on the fly and may not be actually correct.
// Define the container
// If you are crawling multiple pages, you'd want to load this from
// localStorage.
var savedLinks = [];
// Walk through the document and build the links.
for (var i = 0; i < document.links.length; i++) {
var link = document.links[i];
var data = {
url: link.url,
desc = getText(link)
};
savedLinks.push(data);
}
// Here you'd want to save your data via localStorage.
// If not on the last page, find the 'next' button and load the next page
// [load next page here]
// If we *are* on the last page, use document.write to output our list.
//
// Note: document.write totally destroys the current document. It really is quite
// an ugly way to do it, but in this case it works.
document.write(JSON.stringify(savedLinks, null, 2));
Selenium/webdriver will let you write a simple java/ruby/php app that will launch Firefox, use its JavaScript engine to interact with the page in the browse.
Or, if the web page does not require JavaScript to make the content you see interested in available, you could use a html parser in your favourite language and leave the browser out of it.
If you want to do it in JavaScript in Firefox you could probably do it in a greasemonkey script

Retrieving a csv file from web page

I would like to save a csv file from a web page. However, the link on the page
does not lead directly to the file, but it calls some kind of javascript, which leads
to the opening of the file. In other words, there is no explicit url address for the
file i want to download or at least I don't know what it should be.
I found a way to download a file by activating Internet Explorer,going to the web page
and pressing the link button and then saving the file through the dialog box.
This is pretty ugly, and I am wondering if there is a more elegant (and fast) method to retrieve a file without using internet explorer(e.g. by using urllib.retrieve method)
The javascript is of the following form (see the comment, it does not let publish the source code...):
"CSV"
Any ideas?
Sasha
You can look at what the javascript function is doing, and it should tell you exactly where it's downloading from.
I had exactly this sort of problem a year or two back; I ended up installing the rhino javascript engine; grepping the javascript out of the target document and evaluating the url within rhino, and then fetching the result.

Categories