Recording web page events / ajax calls/results and so on

Recording web page events / ajax calls/results and so on - javascript

I'm mostly looking for directions here.
I'm looking to record events that happen within a web page. Somewhat similar to your average "Macro-recorder", with the difference that I couldn't care less about exact cursor movement or keyboard input. The kind of events I would like record are modification of input fields, hovers, following links, submitting forms, scripts that are launched, ajax calls, ajax results and so on.
I've been thinking of using Jquery to build a little app for this, and inserting this on whichever pages I would like to test it on (or more likely, loading the pages into an iframe or something). I however can not accommodate the scripts on these pages to work with this so it has to work regardless of the content.
So I guess my first question is: Can this be done? Especially in regards to ajax calls and various script execution.
If it can, how would I go about the ajax/script part of it? If it can't, what language should I look into for this task?
Also: maybe there's something out there that can already do what I'm looking for?
Thanks in advance

Two ways I can think of are:
Use an add on (firefox) or an extension (chrome) to inject a script tags that loads jquery and your jquery app
Set a proxy (you can use node.js or some other proxy server) and in your proxy inject script tags, be sure to adjust the ContentLength header. (tricky in https sites).
A much simpler and faster option where you don't need to capture onload is to write a JavaScript snippet that load jquery and your app by inject script tags, make that a bookmarklet and after the page loads hit the bookmarklet.

Came across this post when looking for a proxy for tag injection.
Yes it's quite possible to trap (nearly) all the function and method calls by a browser via code in a javascript loaded in the page - usually a javascript debugger (firebug?) or HTTP debugger (tamperdata / fiddler) will give you msot of what you require with a lot less effort.
OTOH if you really want to do this with bulk data / arbitary sites, then (based on what I've seen so far) you could use Squid proxy with an icap server/ecap module (not trivial - will involve a significant amount of programming) or implement the javascript via greasemonkey as a browser extension.
Just to clarify, so far I've worked out how to catch function and method (including constructor calls) and proxy then within my own code, but not yet how to deal with processing triggered by direct setting of a property (e.g. img.src='http://hackers-r-us.com') nor handle ActiveX neatly.

Related

Triggering Google Analytics on a direct image/pdf view on a web server?

So basically I'm trying to trigger a page view in google analytics when someone directly visits an image or pdf, rather than when the image is on a page. E.g https://example.com/i/image.jpg. Is there anyway I can do this? Thank you for reading, any help would be appreciated!

UPDATE BASED ON SERVER-SIDE REQUIREMENTS
If you want to track direct hits to assets that cannot run javascript (e.g. images and documents), there is no simple way to do this, and is not possible with the GA javascript library. When I say direct hits, I mean where the user has accessed assets outside of your normal web workflow e.g. a link from an email, document, direct download etc.
It CAN be done, but can require a lot of setup that I can't answer here (mostly due to having no knowledge about your tech stack), but I will go high level so you can lookup more specific answers that might give you a steer if you need it.
The simpler approach is not use GA for this at all and just use server logs to find what assets are being served and from where they are being requested.
If you need the traffic to show in GA from a server-side request, you will need to the Google Analytics "Measurement protocol" which is just the api used by GA to send data to their servers. You will also need to put a process in place between the request and the asset, and this isn't straightforward especially if you are trying to add tracking to existing assets that are being used on web.
The problem is that the handler needs to be intercepted so you are doing something like this:-
User requests your image "https://example.com/i/image.jpg"
You have an Http handler that detects all requests to image assets (for example). How this is done is entirely dependent on your framework (e.g. php, C#, nodejs) and is fairly common practice, but not always straightforward and may require fiddling with your web server (maybe, this can also be handled directly in the web project) so the webserver knows to go via your code rather than just serving it up directly.
In the code for the handler you call the GA Measurement protocol, filling in the details that you want to see in your GA traffic reports
Complete the request by sending the image in the response
Another way of looking at it is to think of it as an api request that does a portion of work before delivering the image as a response. This method of image tracking is how 3rd party cookies work, and how tracking email "opens" work.
I'm sorry that this a how-to guide for your issue, but the question is so broad it's difficult to answer precisely.
=================
Using a variety of methods, you can trigger an event when someone clicks on the element. How you do this depends on what framework you are using, how you have implemented the markup and whether you are using GTM or straight GA on the page, which also depends on the GA version (async, universal, gtag).
The simplest method, using Universal Analytics, would look something like this
<a href="/image.jpg" target="_blank"
onclick="ga('send', 'event','Click','Image','Image opened: ' + this.href)"
>View the image</a>
Few things to note
use a target so the image or pdf will open in a new window. This will also allow the event to fire without accidentally getting stopped by the browser, and you don't need to return false or anything like that
'Click' is the category, 'Image' is the action and last bit is the label - these are arbitrary bits of text and can be whatever you want
this.href is pulling the current image src dynamically to use in the name
this method is super inefficient because you potentially need to manually tag a lot of items
this method would make a lot of devs cry because of the inline code
The question needs a bit of refinement, because there are lot of different ways this can be accomplished depending on the setup. What you want is a generic bit of code that is a catchall, here is something that uses jQuery, that uses event delegation.
$(document).on("click","a[href*='.pdf'],a[href*='.jpg']",function() {
ga('send', 'event','Click','Document','Opened: ' + this.href);
});
This would also make devs cry, because the current trend is to stay away from jQuery (and it's a bit efficient).
But hopefully you can get the gist

Preventing a specific function in a specific script from executing inside a browser

Is it possible to prevent a specific function in a specific script from executing inside a browser, possibly by redirecting calls of that function to a static/modified version of the script file with a predefined rule? (Similar to how we use an adblocker extension or userscript to customize the DOM, but this time to manipulate the scripts themselves.)
Consider this scenario, website.com utilizes client-side rendering heavily. When https://website.com/article.html is visited, the bundled big JS file https://website.com/entire-app.js will render the entire webpage, both contents and ads.
In the end, a function named isAdblockerFound() in https://website.com/entire-app.js will be called by antiAdBlockerMethod() in the same script file. It checks if ads on the page are indeed loaded and performs other adblocker detection procedures. If this function returns true, antiAdBlockerMethod() will then trash and replace all the rendered elements in the DOM with some big warning text.
In this situation, the script https://website.com/entire-app.js handles all the client-side page rendering, both ads and contents, so simply blocking it from loading will render the website unaccessible.
In order to only bypass/fool the isAdblockerFound(), the idea I came up with is to somehow replace the isAdblockerFound() function with a function which always returns false, before it is called. That is, to tell the browser to redirect calls of isAdblockerFound() to a customized isAdblockerFound() in a static/modified version of the script file, hosted locally or resides temporarily in the browser.
I understand that if we don't need a predefined rule, we can use the devtools to freeze the script with a breakpoint and execute anything between lines easily. But how can we do this automatically with a predefined rule? What extensions/tools are needed?
Google didn't give me anything useful (all the results are about routing in express etc).
EDIT: I understand that I can disable my adblocker anytime and that would be a trivial solution for this question. I also understand why ads exist on the web in the first place, and appreciate the valuable contents made possible by ads. Actually I have never had motive to apply this to any websites I visit, and I am not aware of any websites employing adblocker checking procedures exactly like website.com in my example. I asked this question because I was simply curious if it is possible to bypass this kind of checking.
I suppose, in a different context, one website could be malicious and a security engineer would need to perform an analysis. He might find fooling an environment checking procedure useful in that scenario.

Firefox provides the webextension API webRequest.filterResponseData() to inspect and rewrite the content of any network request, including javascript loads. This would allow you to parse the javascript and replace the method in question.
That only leaves the task of building a robust, streaming javascript matching and rewriting engine.

what is better? using iframe or something like jquery to load an html file in external website

I want my customers create their own HTML on my web application and copy and paste my code to their website to showing the result in the position with customized size and another options in page that they want. the output HTML of my web application contain HTML tags and JavaScript codes (for example is a web chart that created with javascript).
I found two way for this. one using iframe and two using jquery .load().
What is better and safer? Is there any other way?

iframe is better - if you are running Javascript then that script shouldn't execute in the same context as your user's sites: you are asking for a level of trust here that the user shouldn't need to accede to, and your code is all nicely sandboxed so you don't have to worry about the parent document's styles and scripts.
As a front-end web developer and webmaster I've often taken the decision myself to sandbox third-party code in iframes. Below are some of the reasons I've done so:
Script would play with the DOM of the document. Once a third-party widget took it upon itself to introduce buggy and performance-intensive PNG fix hacks for IE across every PNG used in img tags and CSS across our site.
Many scripts overwrite the global onload event, robbing other scripts of their initialisation trigger.
Reading local session info and sending it back to their own repositories.
Loading any number of resources and perform CPU-intensive processes, interrupting and weighing down my site's core experience.
The above are all examples of short-sightedness or malice on the part of the third parties you may see yourself as above, but the point is that as one of your service's users I shouldn't need to take a gamble. If I put your code in an iframe, I know it can happily do its own thing and not screw with my site or its users. I can also choose to delay load and execution to a moment of my choosing (by dynamically loading the iframe at a moment of choice).
To argue the point in terms of your convenience rather than the users':
You don't have to worry about any of the trust issues associated with XSS. You can honestly tell your users they're not exposing themselves to any unnecessary worry by running your tool.
You don't have to make the extra effort to circumvent the effects of CSS and JS on your users' sites.

How can Perl's WWW::Mechanize expand HTML pages that add to themselves with JavaScript?

As mentioned in a previous question, I'm coding a crawler for the QuakeLive website.
I've been using WWW::Mechanize to get the web content and this worked fine for all the pages except the one with matches. The problem is that I need to get all these kind of IDs:
<div id="ffa_c14065c8-d433-11df-a920-001a6433f796_50498929" class="areaMapC">
These are used to build specific matches URLs, but I simply can't.
I managed to see those IDs only via FireBug and no page downloader, parser, getter I tried was able to help here. All I can get is a simpler version of the page which code is the one you can see by "showing source code" in Firefox.
Since FireBug shows the IDs I can safely assume they are already loaded, but then I can't understand why nothing else gets them. It might have something to do with JavaScript.
You can find a page example HERE

To get at the DOM containing those IDs you'll probably have to execute the javascript code on that site. I'm not aware of any libraries that'd allow you to do that, and then introspect the resulting DOM within perl, so just controlling an actual browser and later asking it for the DOM, or only parts of it, seems like a good way to go about this.
Various browsers provide ways to be controlled programatically. With a Mozilla based browser, such as Firefox, this could be as easy as loading mozrepl into the browser, opening a socket from perl space, sending a few lines of javascript code over to actually load that page, and then some more javascript code to give you the parts of the DOM you're interested in back. The result of that you could then parse with one of the many JSON modules on CPAN.
Alternatively, you could work through the javascript code executed on your page and figure out what it actually does, to then mimic that in your crawler.

The problem is that mechanize mimics the networking layer of the browser but not the rendering or javascript execution layer.
Many folks use the web browser control provided by Microsoft. This is a full instance of IE in a control that you can host in a WinForm, WPF or plain old Console app. It allows you to, among other things, load the web page and run javascript as well as send and receive javascript commands.
Here's a reasonable intro into how to host a browser control: http://www.switchonthecode.com/tutorials/csharp-snippet-tutorial-the-web-browser-control

A ton of data is sent over ajax requests. You need to account for that in your crawler somehow.

It looks like they are using AJAX. I can see where the requests are being sent using FireBug. You may need to either pick up on this by trying to parse and execute javasript that affects the DOM.

You should be able to use WWW::HtmlUnit - it loads and executes javascript.

Read the FAQ. WWW::Mechanize doesn't do javascript. They're probably using javascript to change the page. You'll need a different approach.

What are advantages of using google.load('jQuery', ...) vs direct inclusion of hosted script URL?

Google hosts some popular JavaScript libraries at:
http://code.google.com/apis/ajaxlibs/
According to google:
The most powerful way to load the libraries is by using google.load() ...
What are the real advantages of using
google.load("jquery", "1.2.6")
vs.
<script type="text/javascript" src="http://ajax.googleapis.com/ajax/libs/jquery/1.2.6/jquery.min.js"></script>
?

Aside from the benefit of Google being able to bundle multiple files together on the request, there is no perk to using google.load. In fact, if you know all libraries that you want to use (say just jQuery 1.2.6), you're possibly making the user's browser perform one unneeded HTTP connection. Since the whole point of using Google's hosting is to reduce bandwidth consumption and response time, the best decision - if you're just using 1 library - is to call that library directly.
Also, if your site will be using any SSL certificates, you want to plan for this by calling the script via Google's HTTPS connection. There's no downside to calling a https script from an http page, but calling an http script from an https page will causing more obscure debugging problems than you would want to think about.

It allows you to dynamically load the libraries in your code, wherever you want.
Because it lets you switch directly to a new version of the library in the javascript, without forcing you to rebuild/change templates all across your site.

It lets Google change the URL (but they can't since the URL method is already established)
In theory, if you do several google.load()s, Google can bundle then into one file, but I don't think that is implemented.

I find it's very useful for testing different libraries and different methods, particularly if you're not used to them and want to see their differences side by side, without having to download them. It appears that one of the primary reason to do it, would be that it is asynchronous versus the synchronous script call. You also get some neat stuff that is directly included in the google loader, like client location. You can get their latitude and longitude from it. Not necessarily useful, but it may be helpful if you're planning to have targeted advertising or something of the like.
Not to mention that dynamic loading is always useful. Particularly to smooth out the initial site load. Keeping the initial "site load time" down to as little as possible is something every web designer is fighting an uphill battle on.

You might want to load a library only under special conditions.
Additionally the google.load method would speed up the initial page display. Otherwise the page rendering will freeze until the file has been loaded if you include script tags in your html code.

Personally, I'm interested in whether there's a caching benefit for browsers that will already have loaded that library as well. Seems like if someone browses to google and loads the right jQuery lib and then browses to my site and loads the right jQuery lib... ...both might well use the same cached jQuery. That's just a speculative possibility, though.
Edit: Yep, at very least when using the direct script tags to the location, the javascript library will be cached if someone has already called for the library from google (e.g. if it were included by another site somewhere).

If you were to write a boatload of JavaScript that only used the library when a particular event happens, you could wait until the event happens to download the library, which avoids unnecessary HTTP requests for those who don't actually end up triggering the event. However, in the case of libraries like Prototype + Scriptaculous, which downloads over 300kb of JavaScript code, this isn't practical.

We Keep Coding

JavaScript is the programming language of the Web.