November 22, 2021, by Digital Research

Automated anonymisation of texts and transcripts

In this blog, we discuss an automated process for anonymising interview transcripts, patient notes, or other free-text data containing personal information. 

Colleagues wishing to share participant notes or interview transcripts, for example as publication appendices or in a research data repository, will likely need to anonymise the data. Anonymisation also comes with a number of benefits, such as these listed by the UK Information Commissioner’s Office:

  • developing greater public trust and confidence that data is being used for the public good, while privacy is protected
  • incentivising researchers and others to use anonymous information instead of personal data, where possible
  • economic and societal benefits deriving from the availability of rich data sources

We have been exploring ways to automate the anonymisation process.

Below, we provide you with a link to some copy-and-paste code that will anonymise texts automatically. But first:

What is de-identification?

De-identification is the process of removing or obscuring personally identifiable information (PII) from a text or dataset. Such data typically includes names, locations, social security/national insurance numbers, contact details, and the like. The process can be approached in a number of ways, but the output usually falls into one of two camps:

a. the masking of PII with labels (“my name is Jane” becomes “my name is <NAME>”)
b. the replacement of PII with dummy data (“my name is Jane” becomes “my name is John”)

We will focus on the first of these. Here is a fuller example of what an output looks like:

Original text:

00:00:02 Speaker 1: hi john, it’s nice to see you again. how was your weekend? do anything special?

00:00:06 Speaker 2: yep, all good thanks. i was with my sister in derby. We saw, you know, that james bond film. what’s it called? then got a couple of drinks at the pitcher and piano, back in nottingham.

00:00:18 Speaker 1: that’s close to your flat, right?

00:00:25 Speaker 2: yeah, about five minutes away. i live on parliament street, remember?

00:00:39 Speaker 1: of course, i remember. you moved last year after you left your parents’ place.

00:00:39 Speaker 2: yeah, it was my sister’s birthday on sunday, susie, the older one. i told you last time about that new job she got. sainsbury’s, the one by victoria centre.

De-identified text:

00:00:02 Speaker 1: hi PER, it’s nice to see you again. how was your weekend? do anything special?

00:00:06 Speaker 2: yep, all good thanks. i was with my sister in LOC. We saw, you know, that MISC film. what’s it called? then got a couple of drinks at the pitcher and piano, back in LOC.

00:00:18 Speaker 1: that’s close to your flat, right?

00:00:25 Speaker 2: yeah, about five minutes away. i live on LOC, remember?

00:00:39 Speaker 1: of course, i remember. you moved last year after you left your parents’ place.

00:00:39 Speaker 2: yeah, it was my sister’s birthday on sunday, PER, the older one. i told you last time about that new job she got. ORG, the one by LOC.

 

So how can you achieve this?

Colleagues may already be familiar with NLM Scrubber or similar pre-packaged tools. In our case, we have used an open-source model from the hugging face community, a repository of pre-trained models for natural language processing. (The output above is based on this particular model).

With a few lines of Python code, you can produce de-identified text in under a minute.

ANONYMISE YOUR OWN DATA. If you would like to replicate these results, you can find the code here: https://uniofnottm.sharepoint.com/sites/DigitalResearch/SitePages/Automated-an.aspx

Some caveats

Machine-learning algorithms contain bias stemming from the data on which the model has been trained. These biases can lead to errors like false positives (a non-PII word is identified as PII) or false negatives (PII is not identified, and therefore not anonymised). In the case of the model described above, the creators specifically note that:

This model is limited by its training dataset of entity-annotated news articles from a specific span of time. This may not generalize well for all use cases in different domains. Furthermore, the model occasionally tags subword tokens as entities and post-processing of results may be necessary to handle those cases.

Outputs reliant on pre-trained models should always be checked for errors.

Next steps

The process described above requires basic knowledge of running Python scripts. Out next step is to investigate ways of packaging the process in a way that requires no coding knowledge.

If you are interested in talking to us about this work and getting involved, please feel free to contact one of the team.

Further reading

Berg, H., Henriksson, A., Fors, U., Dalianis, H. (2021) ‘De-identification of clinical text for secondary use: research issues‘. HEALTHINF. pp.592-99.

Johnson, A,. Bulgarelli, L., Pollard, T. (2020) ‘Deidentification of free-text medical records using pre-trained bidirectional transformers‘. CHIL ’20 Proceedings. pp. 214-21.

Infographic: The Future of Privacy Forum, A Visual Guide to Practical Data De-identification.

Posted in Advice and GuidanceAutomated transcriptionCollaborationData AnalyticsProcess AutomationResearch Data Management