Let's talk
Don’t like forms?
Thanks! We'll get back to you very soon.
Oops! Something went wrong while submitting the form.
Tools In Action • 10 minutes • Apr 04, 2024

Anonymising Data with Presidio

Jon Carlton
Jon Carlton
Lead MLOps Engineer

Large Language Models (LLMs) are becoming increasingly ubiquitous and there are privacy concerns in both how they’re trained and how users interact with them. 

To train LLMs, large quantities of data is scraped from the web, inevitably capturing personally identifiable information (PII). This information is embedded into the model’s knowledge base, becoming accessible, intentionally or unintentionally, posing a privacy risk.

Recently, security researchers demonstrated that ChatGPT would begin to regurgitate its training data, which contained PII, in response to a particular input. Our blog series on Cybersecurity for LLMs covers these types of attacks in more detail, so do check it out.

Once put into production, there is a risk of users inadvertently divulging personal information as they interact with the model. And, as user conversational logs are often used as supplementary training data, the problem is self-perpetuating. 

While there are approaches to mitigate some risks, and legal regulations limiting PII storage, wouldn’t it be better to prevent this data from entering the system altogether?

Similar to how a cuttlefish uses camouflage to dynamically disguise themselves, the anonymisation of data is one such technique we can use to hide information in plain sight.

This blog delves into how you can do that through making use of an open source tool called Presidio which obscures PII in text. We’ll cover what it is in more detail, how it works, and demonstrate anonymising text. It’s not specific to LLMs, this approach works with any text.

What is Presidio?

It’s an open source library from Microsoft for data protection and de-identification. Using text-based input, it’s able to identify and anonymise PII. There is also functionality to redact PII entities from images, but we’ll not cover that in this blog. 

Presidio consists of two tools: an analyser and an anonymizer, both of which you’ll see in action later, but let’s first discuss what they do.

Presidio Analyser

This is where most of the work is done. Input text which could contain PII is passed through a detection flow comprising of the following steps:

  • Regular expressions are applied to detect known patterns.
  • Named entity recognition models are used to detect entities in the text, such as a name.
  • The patterns in the text are validated, e.g., validate that an identified phone number is a phone number. 
  • Context words are identified to increase detection confidence.

Each of these steps have default recognisers, which are customisable depending on the use case. The named entity recognition models are spaCy models by default, but transformer models can also be used. The output from the analyser is designed to be ingested by the anonymiser, which we’ll discuss next.

Presidio Anonymizer 

Once potential PII text entities have been identified by the analyser, the results can be passed to the anonymizer.

It has two roles: anonymisation and deanonymization. In the former, it can apply a range of built-in operators to anonymise the text, for example, replacement, masking, redaction, or encryption. In the latter, it can revert the anonymisation operations, for example, decrypting encrypted text. 

Much like the analyser, custom anonymisation operators can be implemented to fit the use case. 

How do I use Presidio to anonymise text?

Out of the box, Presidio is pretty good at anonymising most English (American) text that contains PII. Meaning that it’s straightforward to start adding PII detection and anonymisation into your suite of tools.

Let’s walk through a simple example of using Presidio on (fake) text which contains some PII: a name, location, and phone number.

<pre><code>text = "Aviso Sanchez from Anytown, USA, with phone number (555) 555-0101, just won the grand prize in the local pet photo contest. Her goldfish, Bubbles, whose shimmering scales are said to resemble a disco ball, will be featured on the cover of the upcoming town newsletter."</code></pre>

From there, we can start to use the analyser on the text, to identify the PII contained within it.

<pre><code>from presidio_analyzer import AnalyzerEngine

# initialise the engine
analyzer = AnalyzerEngine()

# analyse the text for PII
analyzer_results = analyzer.analyze(text=text, language='en')
</code></pre>

Two parameters are required for analysis, the candidate text and the language. The language is required as Presidio needs to know which spaCy language model to load. What this means in practice is that any language that is enabled by spaCy can also be used by Presidio, which is pretty neat. The result from the analyser is a list containing the following for each identified PII entity: the type, the start position in the text, the end position, and the confidence score. Here is the output for our text:

<pre><code>[type: PERSON, start: 0, end: 13, score: 0.85,
type: LOCATION, start: 19, end: 26, score: 0.85,
type: LOCATION, start: 28, end: 31, score: 0.85,
type: PHONE_NUMBER, start: 51, end: 65, score: 0.75]</code></pre>

Using the results from the analyser, we can pass it to the anonymizer engine, along with the original text:

<pre><code>from presidio_anonymizer import AnonymizerEngine

# initialise the engine
anonymiser = AnonymizerEngine()

# mask the PII in the original text using the results from the analyzer
results = anonymiser.anonymize(
  text=text,
  analyzer_results=analyzer_results
)</code></pre>

In the default setting, the anonymiser will mask the identified PII with the <code>type</code> that’s shown above, e.g., a phone number will be replaced with <code>PHONE_NUMBER</code>. Here’s the result of running the anonymiser:

<pre><code>print(results.text)

PERSON from LOCATION, LOCATION, with phone number PHONE_NUMBER, just won the grand prize in the local pet photo contest. Her goldfish, Bubbles, whose shimmering scales are said to resemble a disco ball, will be featured on the cover of the upcoming town newsletter.</code></pre>

You can see that the name has been replaced with <code>PERSON</code>, the locations with their <code>LOCATION</code> tags, and the phone number has been masked with <code>PHONE_NUMBER</code>. Unfortunately, Bubbles doesn’t get the same treatment.

Now we're in a position where we have anonymised text, this could be passed to an LLM or some other downstream application for processing. 

There’s still some information encoded in the text, you know that it contains some information about a person, a location, and phone number. In some cases, you might want to anonymise this further. Presidio allows you to do this through operator configuration. Here’s an example of changing the replace operator such that it uses <code>ANONYMISED</code> for all PII entities rather than their tags.

<pre><code>from presidio_anonymizer.entities import OperatorConfig

# Update the replace operator with a new replacement value
anonymised_operator_config = OperatorConfig(
  operator_name='replace',
  params={'new_value': 'ANONYMISED'}
)

# Run the anonymise with the new default operator
results = anonymiser.anonymize(
  text=text,
  analyzer_results=analyzer_results,
  operators={'DEFAULT': anonymised_operator_config}
)

print(results.text)

"ANONYMISED from ANONYMISED, ANONYMISED, with phone number ANONYMISED, just won the grand prize in the local pet photo contest. Her goldfish, Bubbles, whose shimmering scales are said to resemble a disco ball, will be featured on the cover of the upcoming town newsletter."</code></pre>

Using the above, you can now start to anonymise data going into your system. Presidio is a highly configurable tool, and I would encourage you to explore the documentation.

It’s worth noting that the anonymisation procedure could also be applied to the output of an LLM, preventing the accidental generation of text containing PII.

Conclusion

Privacy is incredibly important and it should be at the forefront of the minds of those training models and (like us) building systems that contain LLMs and other ML models. Presidio is a great starting point to ensuring that PII is anonymised when coming in or out of your model or system. 

Share this article