Re-thinking what anonymous really means on the internet today

Published in

SuperAwesome

7 min readJan 23, 2019

One of the most important challenges in the kidtech category is solving for a zero data internet, an idea similar in philosophy to privacy by design. The principle calls for kids apps to be built in a way that eliminates (not just reduces) the risk of kids personal data collection.

Over the course of building SuperAwesome and digging into these challenges, I’ve started to think about the evolving issues of digital privacy in a world of ever-increasing meta data. Fundamentally we need to re-think what digital anonymity really means today.

It’s all anonymous… until it isn’t

Companies often say their systems are “fully anonymous” because the only thing they store are things like IP addresses, user agents, etc.

In a perfect world this makes sense. None of those pieces of information allow you to actually identify a person by name, and their use is often limited to internal operations. The problem arises when this ‘anonymous’ data set meets today’s online world, which is fuelled by advertising and data. Vast data management platforms (DMPs) operate with the sole purpose of collecting anonymous-but-unique data points, combining them with ‘fingerprints’ (a set of individual data points that are not quite unique attributes of a consumer in their own right, but describe that user in a very unique way when all put together) and correlating them with commercial databases of actual people to build massively detailed profiles of named individuals.

There is a fault-proof way of avoiding this, which I will get to later on, but first let me provide some more context to the problem.

Why data always ends up being personal — the Anonymisation Fallacy

The theoretical idea of anonymisation is solid. Most developers maintaining a database make sure there is no personal data (in the classical sense — no first and last names, addresses, etc.) in their system, only identifying records by “anonymous identifiers” such as a string of characters or numbers that should in theory never be attributable to a human being.

The problem starts when you realise there isn’t just one database. We live in a world with millions of companies who all have multiple databases, each of which contain data sets that identify their users by anonymous, but unique identifiers. As these data sets grow over time, they are constantly adding new variables to their records building more detailed profiles of their ‘anonymous’ users.

The growth of these databases makes it ever more statistically likely they contain ‘fingerprints’ of specific users. Very few databases are isolated today — billions of matching, sharing and synchronising transactions have the net effect of making every human that goes online part of a giant, global virtual database. This virtual database is a super-set of all correlatable data, combining both personal and non-personal databases and resulting in a master set which, by definition, is very personal. This is the Anonymisation Fallacy — all it takes is a few non-personal data-points to create a personal and unique user profile, which lives and grows forever .

An example

A very simple app , maybe even designed for kids, has an authentication system that only asks for a username and a password to register
A little further in the app it asks the user for their favourite colour and maybe their favourite sports team to customise the experience
For statistical purposes, the app maintains a log of the user agent of the device, the IP address from which it accesses the app, and stores the user id alongside each recorded user event, in order to analyse user behaviour, optimise conversion, design new features, etc.

The result: Anyone accessing this data set would now be able to combine the username, favourite colour, favourite sports team, user agent and IP address to create a fingerprint that uniquely identifies this particular user, even though we don’t have any information that would traditionally be classed as “personal information.” This fingerprint has real value in the open market, where it is highly likely to end up — traded, further enriched, and sold again and again.

Why intending to comply with the law is not enough

If the only risk here is companies selling or sharing this data without us knowing about it, then the problem could well be solved through legislation. The problem, however, is that data sets like these can no longer be fully controlled. Let me explain.

Everybody gets hacked

It is well known that many, many sites are getting hacked every single day (this site tracks them). Many different data sets are being exposed each day and immediately hoovered up by many online actors who will seek to correlate them in this way in order to increase their resale value.

Organisations are big

Hundreds, if not thousands of people work on software applications in many organisations. Even where the compliant storage and processing of personal data is taken very seriously, the handling of simple, non-personal, supposedly anonymised, data is often not treated with the same rigour.

The nature of this accidental fingerprinting process means that often, organisations will have no clue when it is happening. It’s only once that data set leaves the company — maliciously, inadvertently or for a benevolent purpose that is later hijacked, that the dangers of this situation become clear.

Data analysis is getting more efficient

Artificial intelligence and machine learning in particular are getting extremely efficient at identifying patterns humans can’t spot (and often don’t even understand). This means there are systems able to identify and create fingerprints from ever smaller, more dispersed data points, multiplying the number of identifiable profiles.

OK so what can we actually do about it?

This problem can only be solved through a basic change in the way we build software applications. Contrary to what we have often all been taught, we can build systems that do not make use of unique identifiers.

Kryder’s law has made us complacent

Our addiction to unique identifiers and application architectures that are built on them can be traced to two trends in particular:

Increased storage capacity has made it too easy to store huge volumes of data and tie each piece back to a unique identifiers.
The widespread usage of the object model in development, which formalised the practise of creating an “object” and tying properties to it that describe it.

Although the benefits of the object model are indisputable, more consideration should be given to the real necessity of every single piece of data getting tied back to a user. Data privacy by design starts at the object model.

Change your data patterns from ID-based to zero-data-based

Rather than tying every single piece of data to a user and building regular aggregation and reporting tools on top (which leaves the individual user records bundled with a lot of data and thus vulnerable to the creation of fingerprints), I would recommend building systems that do not correlate this data to any unique records in the first place. In other words, we need to move

from the approach where we collect every piece of data because we are scared we might need it at a later stage to an approach where we think about the data we want to collect, as well as the structure of it

If we do this correctly, the result is that we can move from data stores that appear to be anonymous, but have unique identifiers to datastores that are truly anonymous

Example of what this means when designing a new user application:

Obviously the above principles cannot be applied to every single datastore (user accounts, authentication, etc. are a very valid use case), but simply changing our default position to a zero-data approach will drastically change the game in terms of unique data points being gathered on individuals without any loss of functionality.

Legislation should focus on data collection, not data usage

Although one can look at what data leaves a device, it is a lot harder to look at what happens with the data once it is stored on a server (and only accessible by the company that owns it). This creates the situation where most publishers store enriched user IDs simply because the risk of being found out is small in theory.

COPPA and GDPR (and GDPR-K) have established an important blueprint for laws being implemented in other countries. However they would be a lot more effective if they focused on what data is being collected rather than what it is being used for.

These principles fundamentally contradict the way that applications have been built around the world. In 20 years we’re likely to talk about this approach the same way we talk about smoking today: how could we have thought it was a good idea to store and transmit individual user IDs enriched with countless behavioural data points?

P.S. We are only at the start of our journey to make the internet a safer and better place for kids and have a long way to go. If you want to help us go faster, we’re hiring! (https://www.superawesome.tv/careers)