May 17, 2019, by Digital Research Team
Individual 1, who at that point had become the President of the United States
Anonymisation and research
Human data are among the richest and most complex available to researchers. With such data comes great responsibility, however, not least that of protecting the safety and privacy of subjects. Recent changes in the legal landscape, most prominently the General Data Protection Regulation, have enshrined in law the obligation to safeguard any personal data that we collect. Personal data cannot be published without the explicit and informed consent of the individual, nor should individuals be identifiable from published data that purport to be anonymous. Below, I summarise three of the most common techniques for reducing the linkability of a dataset with the original identity of its subjects (aka anonymisation), followed by a brief look at future trends in anonymising data.
Label anonymisation is one of the first steps to preventing the identification of individuals. The simplest method is to replace subjects’ names and any other direct identifiers with numbers (or a comparable referent). Examples from outside the research sphere include UK National Insurance and NHS Numbers. However, some high-profile cases have revealed the potential weakness of label anonymisation, especially in an age where personal data abound on the internet. For example, Narayanan and Shmatikov, from the University of Texas at Austin, combined anonymised Netflix customer data with ratings posted on the Internet Movie Database (www.imdb.com), and were able to de-anonymise customer records with a success rate of up to 68%. Other cases have involved the comparison of medical records with local newspaper stories to isolate and identify supposedly anonymous health records. When additional attribute data exist, bypassing label anonymisation can be relatively straightforward (cf. Denning, Denning, & Schwartz ).
Alongside label anonymisation, generalisation is a common means of preventing the identification of individual subjects in a dataset. Instead of listing exact addresses, for example, subjects can be grouped by city or country; instead of dates of birth, subjects can be grouped by age range. While it is indeed harder to identify individuals from data aggregated in this manner, the downside is that generalisation leads ipso facto to a less fine-grained and therefore less accurate picture. There can be a tension (analysed and discussed by e.g. Dinur and Nissim ) between carrying out robust anonymisation and the desire to extract the most precise and meaningful information from one’s data.
The third basic tool in the anonymiser’s kit is the suppression of records, usually employed in cases where the number of data points is too low to guarantee anonymity. If a survey of 10,000 households includes only two where the number of children in the family is, say, eight (rare cases, so easily identifiable), suppression of those records would stop anyone from recognising the households in question. Suppression has the advantage of leaving the remaining data intact. However, it also distorts one’s overall results based on a subjective choice (what about the four households with seven children? the twenty households with six children? where do we draw the line?). Suppression is also inapplicable in cases with low numbers of subjects, for example studies of rare diseases where researchers work with just a handful of individuals.
Achieving the balance between protecting subjects’ identities and publishing detailed research results can be difficult, but a number of tools can smooth the path. The Anonymisation Decision-Making Framework has been published by UKAN (UK Anonymisation Network) and explores the techniques above as well as more sophisticated approaches (e.g. k-anonymisation). The European Medicines Agency has full guidelines on the publication of clinical data. Eticas and The Open Data Institute have teamed up to produce a report on the relationship between anonymisation and open data. Finally, the European Union has an opinion on anonymisation techniques with recommendations for risk reduction.
Looking to the future, we may soon be in a position to sidestep the anonymisation issues described above thanks to synthetic data, that is, data that mimic the statistical patterns of an original dataset without reproducing its individual characteristics. Organisations have begun to explore the potential of this technique. An example is The Simulacrum by Health Data Insight, an NHS-aligned entity that specialises in artificial cancer data. But you don’t need to rely on others for your supply of synthetic data. The Open Data Institute has created a really nice tutorial (Python based) that shows you how to make your own. I encourage you to take a look and start creating data!
[Addendum: for those who work with R, take a look at sdcMicro. CRAN link: https://cran.r-project.org/web/packages/sdcMicro/index.html. Informative guide: https://sdcpractice.readthedocs.io/en/latest/index.html]