How to rewrite URLs referenced by Javascript code?

How to rewrite URLs referenced by Javascript code? - javascript

This is a follow-up to my popular and technically challenging HTML injection into someone else's website? question.
To recap: I'd like to demo my technology to customers without actually modifying their live website (e.g. demoing the idea of Stackoverflow financial bounties, without modifying the live site). Essentially, I'm trying to create a server-side version of Greasemonkey.
I've implemented the mirror as follows:
A request comes into http://myserver.com/forward?uri=[remote]
My server opens a connect to [remote], pulls down the data and returns its body/headers to the request from #1.
I chose this syntax because I needed to handle requests from multiple domains (meaning, if stackoverflow.com links to meta.stackoverflow.com I need to handle both domains from the same forwarding server).
I have managed to rewrite links in the HTML and CSS files so they are relative to my server. The final hurdle is rewriting URLs referenced by Javascript files.
What is the best way to programmatically rewrite URLs referenced by someone else's Javascript code? Is this even technically doable?
Discussion
I'll give you an example of the technical hurdle I am facing. Take http://www.honda.com/ for example. They embed a Flash element on the page, but instead of embedding <object> directly, they use Javascript to document.write() the <object> tag containing the URL.
First attempt
Use https://stackoverflow.com/a/14570614/14731 to listen for DOM change events. Instead of trying to rewrite URLs in the Javascript code, wait for it to modify the DOM and rewrite the URLs on the fly.
Intercept all XmlHttpRequest requests using https://stackoverflow.com/a/629782/14731
Ideally we want intercept DOM changes before they render, so the browser does not request URLs before we have a chance to rewrite them.
Related resources:
http://blog.vjeux.com/2011/javascript/intercept-and-alter-script-include.html
Second attempt
A server-side solution will not work. Even if I can rewrite all DOM URLs, I've seen an example where an embedded Flash application references URLs stored in Javascript variables. There is no programmatic way to detect that these variables represent URLs because the Flash application is opaque.
Next, I plan on trying a client-side solution. I will load the original website in one frame, and manipulate its contents using Javascript in a second (hidden) frame. I hope to be able to inject new DOM elements (to demo my product) without having to rewrite the existing elements.

Very challenging and interesting task. I would go with first saving the javascript files on my server and reference them from the HTML served. Then I would find the URLs in the files (using a regex or something) and replace it with the wanted value. I know it is not very fast, it is not very dynamic and all, but I believe it would be easier to implement.
Hope I helped!

Answering my own question.
After much research, I find this technique works best: https://stackoverflow.com/a/23231268/14731
In other words, there doesn't seem to be a general algorithm to rewrite links. Patching them by hand isn't as much work as you'd expect.

Related

Should Javascript be used to modify HTML?

I just recently started learning javascript and have a question regarding the'proper use'. I'm still trying to identify the role of Javascript within a website, and I'm curious whether or not it would be considered ok to have Javascript modified the HTML of a web page.
Let's say I have a panel on a web page. This panel houses a list. I would like users to be prompted to add items to this list.
I was thinking that it would be possible to use Javascript to generate list items to add to the list. However, this would be modifying the actual number of HTML elements on the web page... For some reason, this just seems 'hacky'. When I think of HTML, I think of a static structure that should come to life with CSS and Javascript.
So my question: is it considered okay to have Javascript modify the HTML of a web page? What about the case of adding items to a list?
Thank you!

Javascript is a programming language designed so it can modify the document that is being displayed(the DOM), the actual HTML is never touched.
Javascript has a place on a website and modifying the document/dom is perfectly acceptable and without it, would make javascript almost useless. CSS is great for certain tasks, but you can't do everything in CSS, though CSS5 is coming pretty close for many tasks.
Rewriting the entire DOM IS bad practice, but using it to shift an element's position based on an action, or creating a popup overlay is perfectly acceptable.
Remember the gold rule:
Modify as little as possible to accomplish the goal.

What matters is the user's experience of the (HTML) document. The representation of "the document" can change by utilising a language like javascript that "manipulates the DOM" - and the DOM is like an instance of the HTML document, or "document session" if you will.
So in a way, no, the HTML is touched. It is positively manhandled by javascript - indirectly and in a non-persistent way. But if you want to be pedantic... we say it isn't and leave half the readers confused.
When to use javascript and when not to. It's swings and roundabouts. You need to learn (mostly from experience) when one particular tool suits the situation. It usually boils down to some core considerations:
HTML is for markup. Structure. Meaning.
CSS is for style, feel, appearance.
Javascript is for those situations where none of the above work.
And I'm neglecting to mention server-side processing with the obvious disclaimer that all processing that ought to be done in privacy is done on the server (with a programming language like PHP or Ruby for example).
Sometimes you get the grey area in-between where you can do something either way. Those situations you may ask yourself a question like... would it be processed quicker if the client (user's computer) processes it, or the server... and that's where experience comes in.

It depends on the case to decide if you should manipulate DOM directly or let the JS do it.
If you have a static html page, just do your html and hand craft the
DOM. There is no need for JS to get a hand here.
If you have a semi static html page where the user actions change
part of it - then get the JS to do the changing part.
If you have a highly dynamic html page (like single page app) - you
should get the JS to render html mostly.
Using plain JS however is not the best for you in this age of great JS evolution. Learn it -but also learn to use good libraries and frameworks which can take you to the track very fast.
I recommend you to learn Jquery and Angular 2 which make you feel like learning a super set of JS straightaway.

Short disclamer: Javascript may modify DOM content in a browser but never change original HTML served from Web server.
Modern Web is unthinkable without javascript. JS allows to make static HTML interactive and improve User Experience (UX). As any sharp tool JS can produce a masterpiece out of nearly dead page and cut throat to a blooming static content.
Everything is up to the developer.
When not to use JS
Do not use JS to deliver ever-green content to the page. Web bots (crawlers) don't run JS, and you message "I come to this world to testify to the truth" may appear "a voice of crying out of desert" and be non-indexed and thus unread.
When JS in the must
Every time your page visitor does something the page should respond with proper action (using JS or, if possible, just CSS).
This is especially important when a prospect fills in a form. To err is human so a developer via JS should help the visitor to make wrong things right. In many cases it is possible without requesting server and in even more cases the answer should come from the server. And JS is your best friend in this case.
Javascript never lives alone. Browser object is its trustful ally. Most modern browsers support XMLHttpObject A.K.A AJAX (you may remember this name from ancient Greek history) which communicates with the server without reloading the page.
The idea of AJAX which stands for Asynchronous Javascript And Xml is to send request and receive response from the server asynchronously without blocking page in the browser.
Native javascript may be difficult to understand to many beginner developers. To make thing easier there are a lot of JS libraries with jQuery being most known.
Returning to the OP's question, Should Javascript be used to modify HTML?
The answer is: It Depends.

How can I introduce modules to a legacy javascript project?

I'm a developer on a very large, many-page web app. We're making a push to improve the sanity of our javascript, so I'd like to introduce a module loader. Currently everything is done the old-fashioned way of slapping a bunch of script tags on a page and hoping for the best. Some restrictions are:
Our html pages are templated, inherited, and composed, such that the final page sent to the client brings together pieces from many different source html files. Any of these given files may depend on javascript resources introduced higher up the chain.
This must be achievable piecemeal. The code base is far to large to convert everything at once, so I'm looking for a good solution for new pages, and for migrating over existing pages as needed.
This solution needs to coexist on the same page as existing, non-module javascript. Some things (like menus and analytics) exist on every page, and I can't remove global jquery, for instance, as it's used all over the place.
Basically I'd like a way to carve out safe spaces that use modules and modern dependency management. Most tutorials and articles I've read all target new projects.
Requirejs looks like a decent option, and I've played with it a bit. Its fully async nature is a hindrance in some cases though - I'd like to use requirejs to package up global resources but I can't do that if I can't control the execution order. Also, it sucks to have the main function of a page (say, rendering a table) happen after secondary things like analytics calls.
Any suggestions?

That's a lot to ask for. :-) So here are my thoughts on this:
Usually, when you load a web page - everything that was being displayed is wiped so that the new incoming information does not interfere with what was already there so you really have only a couple of options (that I know of) where you can keep everything in memory in the browser and yet load new pages as needed.
The first (and easiest) way is to use FRAMES. These are not used much anymore from what I've seen but each FRAME allows you to display a different web page. So what you do is to make one frame use 100% and the second one use "*" so it isn't seen. You can then use the unseen frame to control what is going on in the 100% frame.
The second way is to use Javascript to control everything. In this scenario you create a namespace area and then use jQuery's getscript() function to load in each page as you need it. I did this once by generating the HTML, converting it to hex via bin2hex(), send it back as a javascript function and then unhexing it in the browser and applying it to the web page. If it was to completely replace the web page you do that and if it was an update to a web page you just inserted it into the HTML already on the web page. Any new javascript function always attaches itself to the namespace. If a function is no longer needed, it can be removed from the namespace to free up memory.
Why convert the HTML to hex and then send it? Because then you can use whatever characters you want to us (like single and double quotes) and it doesn't affect Javascript at all. There are fairly fast hex routines available on GitHub (see my toHex and fromHex routines).
One of the extra benefits of using getScript() is that it can sometimes also make it almost impossible for anyone to see your code then. A fluke in how getScript() works which is documented.

Engineering a better solution than getting a whole HTML document with AJAX

I have a webpage that I am working on for a personal project to learn about javascript and web development. The page consists of a menu bar and some content. My idea is, when a link is clicked, to change the content of the page via AJAX, such that I:
fetch some new content from a different page,
swap out the old content with the new,
animate the content change with some pretty javascript visual effects.
I figure that this is a little more efficient than getting a whole document with a standard HTTP GET request, since the browser won't have to fetch the style sheets and scripts sourced in the <head> tag of the document. I should also add that I am fetching content solely from documents that are served by my web app that I have created and whose content I am fully aware of.
Anyway, I came across this answer on an SO question recently, and it got me wondering about the ideal way to engineer a solution that fits the requirements I have given for the web page. The way I see it, there are two solutions, neither of which seem ideal:
Configure the back-end (server) from which I am fetching such that it will return content and not the entire page if asked for only content, and then load that content in with AJAX (my current solution), or
Get the entire document with AJAX and then use a script to extract the content and load it into the page.
It seems to me that neither solution is not quite right. For 1, it seems that I am splitting logic across two different places: The server has to be configured to serve content if asked for content and the the javascript has to know how to ask for content. For 2, it seems that this is an inappropriate use of AJAX (according to the previously mentioned SO answer), given that I am asking for a whole page and then parsing it, when AJAX is meant to fetch small bits and pieces of information rather than whole documents.
So I am asking: which of these two solutions is better from an engineering perspective? Is there another solution which would be better than either of these two options?

animate the content change with some pretty javascript visual effects.
Please don't. Anyway, you seem to be looking for a JS MVC framework like Knockout.
Using such a framework, you can let the server return models, represented in JSON or XML, which a little piece of JS transforms into HTML, using various ways of templating and annotations.
So instead of doing the model-to-HTML translation serverside and send a chunk of HTML to the browser, you just return a list of business objects (say, Addresses) in a way the browser (or rather JS) understands and let Knockout bind that to a grid view, input elements and so on.

How can Perl's WWW::Mechanize expand HTML pages that add to themselves with JavaScript?

As mentioned in a previous question, I'm coding a crawler for the QuakeLive website.
I've been using WWW::Mechanize to get the web content and this worked fine for all the pages except the one with matches. The problem is that I need to get all these kind of IDs:
<div id="ffa_c14065c8-d433-11df-a920-001a6433f796_50498929" class="areaMapC">
These are used to build specific matches URLs, but I simply can't.
I managed to see those IDs only via FireBug and no page downloader, parser, getter I tried was able to help here. All I can get is a simpler version of the page which code is the one you can see by "showing source code" in Firefox.
Since FireBug shows the IDs I can safely assume they are already loaded, but then I can't understand why nothing else gets them. It might have something to do with JavaScript.
You can find a page example HERE

To get at the DOM containing those IDs you'll probably have to execute the javascript code on that site. I'm not aware of any libraries that'd allow you to do that, and then introspect the resulting DOM within perl, so just controlling an actual browser and later asking it for the DOM, or only parts of it, seems like a good way to go about this.
Various browsers provide ways to be controlled programatically. With a Mozilla based browser, such as Firefox, this could be as easy as loading mozrepl into the browser, opening a socket from perl space, sending a few lines of javascript code over to actually load that page, and then some more javascript code to give you the parts of the DOM you're interested in back. The result of that you could then parse with one of the many JSON modules on CPAN.
Alternatively, you could work through the javascript code executed on your page and figure out what it actually does, to then mimic that in your crawler.

The problem is that mechanize mimics the networking layer of the browser but not the rendering or javascript execution layer.
Many folks use the web browser control provided by Microsoft. This is a full instance of IE in a control that you can host in a WinForm, WPF or plain old Console app. It allows you to, among other things, load the web page and run javascript as well as send and receive javascript commands.
Here's a reasonable intro into how to host a browser control: http://www.switchonthecode.com/tutorials/csharp-snippet-tutorial-the-web-browser-control

A ton of data is sent over ajax requests. You need to account for that in your crawler somehow.

It looks like they are using AJAX. I can see where the requests are being sent using FireBug. You may need to either pick up on this by trying to parse and execute javasript that affects the DOM.

You should be able to use WWW::HtmlUnit - it loads and executes javascript.

Read the FAQ. WWW::Mechanize doesn't do javascript. They're probably using javascript to change the page. You'll need a different approach.

Including HTML fragments in a page - methods?

This is an extension of an earlier questions I asked, here:
Django - Parse XML, output as HTML fragments for iFrame?
Basically, we're looking at integrating various HTML fragments into a page. We have an small web app generating little fragments for different results/gadgets, at various URLs. This is not necessairly on the same domain as the main page.
I was wondering what's the best way of including these into the main page? We also need the ability to skin the HTML fragments using CSS from the main page.
My initial thought was iFrames, however, I'm thinking the performance of this might not be great, and there's restrictions on CSS/JS manipulation of the included fragments.
Are SSI a better idea? Or should we use php includes? or JS? Any other suggestions? The two main considerations are performance of the main page, and the ability to style/manipulate the included fragments.
Cheers,
Victor

This sounds similar to what Facebook Platform applications do. One kind simply uses IFRAMEs, the other takes output from a backend and transforms it -- <fb:whatever> elements are expanded, JavaScript executed, and things like buttons are skinned. You could look at them for an example.
Using IFRAMEs would probably make things complicated. By default you cannot modify styles inside them from the outer frames, but you could probably use something like Google Closure's net.IframeIo to work around that.
I would try loading widgets using cross-domain scripting. Then you can add the widget's content to the page, however you wish, such as inserting it into the DOM.

iFrames should not be a problem performance-wise - It won't make a difference whether it's the browser doing the querying our your server. You may get design problems though.
SSI and PHP are more or less the same, but they both have the same problem: If the source page is down, rendering of the whole page is delayed.
The best thing performance-wise would be a cached PHP solution that reads the snippet, and is thus less vulnerable towards outages.
Funnily enough, I have written a PHP-based tool for exactly this purpose, and the company I wrote it for has agreed on publishing it as Open Source. It will be at least another four weeks, though, until I will get around to packaging it and setting up the documentation. Should that be of any interest to you despite the timeframne let me know and I will keep you updated.

We Keep Coding

JavaScript is the programming language of the Web.