Aren't Javascript analytics scripts susceptible to easy data hacks? - javascript

On Production environments, Javascript based analytics scripts (Google Analytics, Facebook Pixel etc.), are injected into most web applications, along with the Unique ID/Pixel ID, in plain Javascript.
For example, airbnb uses Google Analytics. I can open up my dev console and run
setInterval(function() {ga('send', 'pageview');}, 1000);
which will cause the analytics pixel to be requested every 1 second, forever. That is 3600 requests an hour from my machine alone.
Now, this can easily be done in a distributed fashion, causing millions of requests per second, completely skewing the Google Analytics data for the pageview event. I understand that the huge amounts of data collected would correct this skewing to a certain extend, but that can be easily compensated by hiking up the amount of requests.
My question is this: are there any safeguards to prevent competitors or malicious individuals from destroying the data integrity of applications in this manner? Does GA or Facebook provide such options?

Yes,but the unsafe part don't comes for the Javascript. For example, you can use the measurement protocol to flood data to one account. Here you can see a lot of people in the same comunity having thoubles with this (and it's quiet simple to solve.)
https://stackoverflow.com/search?q=spam+google+analytics
All this measurement systems uses HTTP calls to fill the data on your "database". If you are able to build the correct call you can Spam Everyone and everywhere (but don't do it, don't be evil).
https://developers.google.com/analytics/devguides/collection/protocol/v1/?hl=es-419
This page of Google Analytics explain what is the protocol measurement, Javascript only work as framework to build and send the hit.
https://developers.google.com/analytics/devguides/collection/protocol/v1/?hl=es-419
But, so not everything is lost.
For example, if you try to do that on you browser with that code, The Google Analytics FrameWork limit to 1 call per second and 150 per session (or cookie value). Yes it's not complicated to jump that barrier, but after that other barriers will come.
So if you use the Javascript framework are safe. Now imagine you do the same with python, sending http to the Google Analytics server. It's possible but:
So here are 2 important things to says.
Google Analytics has a proactive "firewall", to detect Spammers and ban them.(How and when they do this is not public), but in my case i see a lot of less spammer that few years ago.
Also there is a couple of good practices to avoid this. For example, store only domains under a white list, creating a filter to allow only traffic from your domain
https://support.google.com/analytics/answer/1033162?hl=en
Also it's a very good practice to protect you ecommerce, using a filter to include only data from certain store or with certain parameter, "for example brand == my brand" or "CustomDimension== true". Exclude transactions with products over $1.000 (check your limits and apply proactive filters). All this barrier make complex to broke.
If you do this, you will protect your domain a lot(because it's too much complicated to know the combination of UA + Domain Valid when you create a robot), but you know, all the system can be broken. In my experience i only see 2 or 3 cases of damage comming from spammer or people who wanna hurt, and in all this case could be prevented if I created a proactive filter. Usually spammer only spam ads into your account,almost never want to hurt you. Facebook, Piwik and other Tools happens more or less the same.

Related

JavaScript analytics / conversion script: expected failure rate?

Question for developers who work with 3rd-party analytics tools:
is there an industry-standard or expected failure rate with JavaScript tracking?
Scenario: I have a one-page website. I install Google Analytics, Mixpanel, and Heap 3rd-party analytics JavaScript tracking. The page loads clean and error-free. I use Adwords to buy 100 clicks to my site.
Now, according to raw server logs, I receive all 100 visitors. However, my Analytics dashboards report:
GA: 97 unique visitors
MixPanel: 96 unique visitors
Heap: 99 unique visitors
Report latency isn't an issue (I've waited 48 hours). I don't want to quibble about which analytics tool's definition of a "unique visitor" is best.
What I'm trying to get to the bottom of is this: is there an anticipated error bar I should apply globally to any/all analytics reports? Say that each script loads properly 95% - 99% of the time? (That way I can ignore mismatching numbers so long as they fall into this expected error bar and focus on true outliers.) Additionally, if there's an expected failure rate, I can have greater confidence that, despite the mismatched numbers above, my scripts are reporting properly and save my IT team a lot tail-chasing.
File under Anecdotes Not Data: A colleague told me his ecommerce site uses a hosted, JavaScript-based, enterprise-level conversion tracking platform. Based on 400-500 transactions per day, his analytics under-reports conversions consistently by 4-5%. He has several years of data documenting this (99.9% confidence).
What I don't know is, does this hold true globally? Do everyone's analytics scripts misfire, fail to load, or otherwise go CLICK instead of BANG 4-5% of the time?
Here are potential issues I AM aware of:
Script errors
Script conflicts
Timeouts when pulling from a third-party server
User bounces before scripts complete loading
**Not to get all chemtrails on you but: **IF there's an expected fail rate, it's certainly not common knowledge. Nobody I've spoken to at any analytics companies admit to consistent failure. Neither do they guarantee 100% accuracy.
So I ask: in your experience, what's the expected accuracy rate of your JavaScript-based, hosted analytics platforms?
I think your fourth bullet may be the most revealing. Paid media will have a very high bounce rate compared to organic/referrer/direct traffic.
What is the bounce rate reported by those various tracking tools? I would wager it is at least 80% which means the chances of users exiting prior to the scripts loading is high. You could correlate that with your page load time.
One thing you could try (since my expertise is with Google Analytics) is to use the Measurement Protocol to send pageview data on the server-side. Since that no longer requires JavaScript to load, it takes page load time and bounces out of the equation. I would not recommend using this method for production, but it could illuminate the issue.
To summarize, I think your issue is with load abandonment and not necessarily failure of the various tools.

What's the advantage of client-side analytics over server-side?

I've always used client-side web analytics that uses JavaScript to track visitor hits to the site, and all the useful information that gives. But some people have recently told me they prefer server side analytics because it's faster.
So what I wondered is what are the main advantages of doing it client-side with JavaScript? Which has more features and why?
Server or Client side for Analytics?
Server-side Advantages:
Servers can be set up with infinitely more power than desktop machines and so can crunch "the big numbers".
Performance can be more predictable as the same machines are used for everyone's analysis and generation of results.
Output will not have dependencies on browser / browser version as they just have to display an image.
Output can also be multi-device without any dependencies.
Output can be the same everywhere both reducing client issues and also making the image generation be about supporting 1 output format over many.
Client-side Advantages:
If the number of clients is large, say thousands per minute, it can be good to unload the processing to client machines to avoid having them slow down a central server.
Solutions tend to provide more interactivity and faster results as all the data and the logic is on the client.
Once downloaded initially, views can be changed without being online.
If the traffic varies a lot, say sometimes a few queries per hour, other times, hundreds per minute client-side makes sure that a central server is not over-loaded by this effort
Server-side infrastructure will not be needed and so will not cost (the provider) money.
Many companies use both Google Analytics (client side) and Webtrends (server side/client side) to do web analytics.
One thing about Google Analytics is that it doesn't work when the user doesn't allow scripts. Webtrends can crawl your access logs.
Client-side tracking provides more information in comparison with server-side tracking.

Best practice for "hidden" JavaScript HTTP request?

I'm not exactly sure how to formulate the question, but I think it's more of a suggestions request, instead of a question per se.
We are building an HTML5 service on which users get credited (rewarded, on social gaming lingo) for completing a series of offers. Most of these offers are video ad watching. We already have an implementation of this built on Flash, but for HTML5 I'm encountering a bit more issues on how to make the request calls to validate legit watched video ads. On the Flash interface, we have a series of HTTP requests that the SWF makes, some upon the video playback starts, in the middle and at the end, each one of those requests are related to each other, meaning, the response of one is needed on the next request, etc. Most of the logic to "hide" this "algorithm" is lightly hidden on the SWF binary, and it pretty much serves it purpose.
However, for HTML5 we have to rely on world visible JavaScript and that "hidden" logic is open wide. So, I guess this is a call for suggestions on how these cases are usually handled so that an skilled person could not (so easily) get access to it and exploit the service to get credited programmatically. Obfuscating the JavaScript seems like something that could help but that in no way protects fully.
There's of course some extra security on the backend (like frequency capping, per user capping, etc), but since our capping clears every day, an skilled person could still find a way to get credit for all available offers even without completing them.
It sounds like you want to ensure that your server can distinguish requests that happened as the result of the user interacting with your UI in ways you approve of from requests that did not happen that way.
There are a number of points of attack on such a system.
Inspect the JavaScript to find the event handler and invoke them via Firebug or another tool.
Inspect any keys from your code, and generate the HTTP requests without involving the browser.
Run code in the browser to programmatically generate events.
Use a 3rd-party tool that instruments the browser to generate clicks.
If you've got reasonable solutions to instrumentation attacks (3 and 4), then you can look at Is there any way to hide javascript functions from end user? for ways to get secrets into the client to allow you to sign your requests. Beyond that, obfuscation is the only (and imperfect) way to stop a not-too-determined attacker from any exploitation, and rate-limiting and UI event logging are probably your best bets for stopping determined attackers from benefiting from wide-scale fraud.
You will not be able to prevent a determined attacker (even with SWF, though it's more obfuscated). Your best bet is to make sure that:
Circumventing your measures is expensive in terms of effort, perhaps by using a computationally expensive crypto algorithm so they can't just set up a bunch of scripts to do it.
The payoff is minimal (user-capping is an example of how to reduce payoff; if you're giving out points, it's fine; if you're mailing out twenty dollar bills, you're out of luck)
Cost-benefit.

Ajax requests/responses: how to make them lightning fast?

I came across a site that does something very similar to Google Suggest. When you type in 2 characters in the search box (e.g. "ca" if you are searching for "canon" products), it makes 4 Ajax requests. Each request seems to get done in less than 125ms. I've casually observed Google Suggest taking 500ms or longer.
In either case, both sites are fast. What are the general concepts/strategies that should be followed in order to get super-fast requests/responses? Thanks.
EDIT 1: by the way, I plan to implement an autocomplete feature for an e-commerce site search where it 1.) provides search suggestion based on what is being typed and 2.) a list of potential products matches based on what has been typed so far. I'm trying for something similar to SLI Systems search (see http://www.bedbathstore.com/ for example).
This is a bit of a "how long is a piece of string" question and so I'm making this a community wiki answer — everyone feel free to jump in on it.
I'd say it's a matter of ensuring that:
The server / server farm / cloud you're querying is sized correctly according to the load you're throwing at it and/or can resize itself according to that load
The server /server farm / cloud is attached to a good quick network backbone
The data structures you're querying server-side (database tables or what-have-you) are tuned to respond to those precise requests as quickly as possible
You're not making unnecessary requests (HTTP requests can be expensive to set up; you want to avoid firing off four of them when one will do); you probably also want to throw in a bit of hysteresis management (delaying the request while people are typing, only sending it a couple of seconds after they stop, and resetting that timeout if they start again)
You're sending as little information across the wire as can reasonably be used to do the job
Your servers are configured to re-use connections (HTTP 1.1) rather than re-establishing them (this will be the default in most cases)
You're using the right kind of server; if a server has a large number of keep-alive requests, it needs to be designed to handle that gracefully (NodeJS is designed for this, as an example; Apache isn't, particularly, although it is of course an extremely capable server)
You can cache results for common queries so as to avoid going to the underlying data store unnecessarily
You will need a web server that is able to respond quickly, but that is usually not the problem. You will also need a database server that is fast, and can query very fast which popular search results start with 'ca'. Google doesn't use conventional database for this at all, but use large clusters of servers, a Cassandra-like database, and a most of that data is kept in memory as well for quicker access.
I'm not sure if you will need this, because you can probably get pretty good results using only a single server running PHP and MySQL, but you'll have to make some good choices about the way you store and retrieve the information. You won't get these fast results if you run a query like this:
select
q.search
from
previousqueries q
where
q.search LIKE 'ca%'
group by
q.search
order by
count(*) DESC
limit 1
This will probably work as long as fewer than 20 people have used your search, but will likely fail on you before you reach a 100.000.
This link explains how they made instant previews fast. The whole site highscalability.com is very informative.
Furthermore, you should store everything in memory and should avoid retrieving data from the disc (slow!). Redis for example is lightning fast!
You could start by doing a fast search engine for your products. Check out Lucene for full text searching. It is available for PHP, Java and .NET amongst other.

Non-invasive javascript performance agent?

I am seeking to (legitimately) plant bugging in my web pages to collect and report information about website performance.
Preference for internally hosted. While I expect that there are commercial offerings out there (e.g. Google Analytics) I'm keen to find something we can run entirely in-house (its not a public website and may contain sensitive data).
Also, I'm looking for something where it can report back to an independent URL - i.e. not relying on adding in a reverse-proxy / recording results within existing webserver logs. Indeed, I'd prefer something which does not require access to the webserver logs logs at all (other than those for the URL the bug reports back to).
I need to be able to monitor bulk traffic - so things tools like pagespeed and tamperdata are not appropriate.
I've tried googling but just seem to be getting lots of noise about the performance of javascript and web pages rather than how to actually measure these.
TIA
You could use the open source analytics software Piwik and write a plugin for it that sends the performance data to it.
Thanks chiborg. I'd kind of forgotten about this it was so long ago I asked. Yes, I was aware of PiWik - but not been very impressed with either its implementation nor the quality of documentation.
I'm currently working on a solution using Boomerang.

Categories