top of page

Is Cleardox anonymizing or pseudonymizing data?

What is the difference, you may ask yourself? Aren´t both supposed to leave you with a document fit for an in-depth GDPR-check? Well, the answer is both yes and no. The two solutions will successfully hide personal and sensitive data from your documents. Yet, one of them is substantially easier to perform, whereas the other exempts you completely from the GDPR-rules. Read here for an easy overview.

Companies want to know the difference

At Cleardox, we help companies redact their documents in faster and more efficient ways by providing a tool for automatic detection of personal information. A welcomed timesaver for many busy companies, who need to comply with the GDPR-rules. Overall, our tool provides the user with two options:

  1. Blacklining the information.

  2. Replacing or re-classifying the information.

Now, companies will often ask us, whether these two techniques can be categorized as:

  1. Anonymization.

  2. Pseudonymization.

A fair and intuitive assumption to make, since blackening information is a way of making it disappear, while replacing a name with a fictional title is an act of pseudonymization. Yet, the assumption does not hold water.

The reason why companies are so interested in learning the technique behind our redaction method, is because the GDPR-rules favor respectively anonymization and pseudonymization differently.

When receiving a redacted document, a company will naturally ask: Do we still need to take GDPR into account, or can we fully ignore it from now on?

Back to basics - identifiers vs. identified

Unfortunately, the answer is not black and white. Simply because it is not the method (for example blackening) that decides, whether a redacted document renders its subjects anonymous. Instead it is the context and possibility for re-identification. We are going to get a bit technical now. But hang in there, and it will soon make perfect sense! First, let’s take one step back and recall the main purpose of redaction. Redaction is all about obscuring the relationship between identifier and identified. Broadly speaking, we can say that identifiers are names, titles and locations that refer to real entities in the world.

Now, anonymization only occurs, when we have obscured the relationship between identifier and identified to such an extent, that outsiders cannot recognize the hidden identity under any circumstances.

Now, anonymization only occurs, when we have obscured the relationship between identifier and identified to such an extent, that outsiders cannot recognize the hidden identity under any circumstances.

Anonymization – a very difficult task

Let's take a look at the official GDPR-definition. It says, that anonymized information is:

“Information which does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is no longer identifiable. This Regulation does not therefore concern the processing of such anonymous information…”

If we cut to the chase, this paragraph tells us the following: If a document has been effectively anonymized for sensitive content, then we no longer need to worry about GDPR. We are outside the reach of the regulative.

Naturally, that is a huge relief for companies, who wish to freely share their documents without being concerned about sky-high fines or the breaking of any EU-rules. Anonymization is always encouraged, since it is the most effective way of limiting your risk, and since it also benefits your data subjects.

However, the bar for anonymization is quite high (more on that later), and therefore pseudonymization is a far more widespread solution today. Keep in mind, that your document may often be pseudonymized rather than anonymized.

Now, time for a turn to pseudonymization.

Pseudonymization: Neither anonymous nor identified

According to GDPR, the official definition of pseudonymization goes:

“…the processing of personal data in such a manner that the personal data can no longer be attributed to a specific data subject without the use of additional information, provided that such additional information is kept separately, and is subject to technical and organizational measures to ensure that the personal data are not attributed to an identified or identifiable natural person.”

A rather long quote, but let's pay attention to the emphasis on “additional information”. What pseudonymization does, is that it adds fictional identifiers, so one would need additional information to discover the hidden identities of the subjects involved. The solution is very useful, if one wants to reduce the risk of exposing sensitive information, while still keeping the meaning of the document intact.

Let's take an example!

Say a group of lawyers use certain documents to deal with a particular case. The documents, which have been handed over by the client, contain highly sensitive and personal information. Yet, the documents are vital to the lawyers, so they can make up their case and defend their client in court. In hindsight, they believe their juridical response to be so unique and valuable, that they wish to share it with the rest of the law firm. In case their fellow lawyers are faced with a similar case in the future.

Luckily, the law firm has a digital knowledge library, where they can upload the case information. Before they transfer the document, they need to make sure that all personal information has been redacted. That way no one can possibly identify who the client is. However, they still want to maintain the meaning in the document. So, they choose to redact it by replacing personal information with pseudonyms.

Names are replaced with abstract titles such as: “Person 1, 2, 3”. Companies are replaced with identifiers such as: “Company X, Y, Z”. The same goes for social security numbers, health information, genetic data and so on. Finally, the document has been redacted, and other lawyers can read it and make sense of it. The law firm now has two copies of the case document: A redacted version and an original version.

Even though outside lawyers would have a hard time connecting the data in the two documents, it is theoretically possible for the organization to link the redacted material back to the real identities. Simply by comparing the redacted and the non-redacted version.

Pseudonymization leaves the door open for re-identification

In the example above, documents were merely pseudonymized rather than anonymized. Why? Because it is possible to single out an individual subject and link it across data sets. All it takes is access to the two documents. Had the law firm instead discarded the original and non-redacted version, it would no longer be possible for other employees in the company to identify individuals in the specific case.

When we said earlier, that the bar for anonymization was high, we weren't kidding. In fact, that label is only valid, when not even the sharpest detective can go back and detect the identities involved. Pseudonymization is a strong security measure. Yet, it doesn't change the status of the data – which would still be classified as “personal”. As the GDPR-regulative says:

“Pseudonymization may facilitate processing personal data beyond original collection purposes. Controllers can use pseudonymization to help meet the GDPR data security requirements.”

How does GDPR regulate anonymization and pseudonymization?

Pseudonymization is a very useful way of safeguarding the processing of personal data for scientific, historical and statistical purposes. Yet, you still have to monitor the data. In case of any breaches involving “a risk to the rights and freedom of natural persons”, you are obliged to quickly inform authorities and notify the individuals in question. Pseudonymization is encouraged in different forms of research, since it keeps the meaning intact but reduces the risk of harm to data subjects significantly. In our hypothetical case of pseudonymization, the law firm still needs to comply with the GDPR-regulative – yet on less restrictive terms.

On the other hand, anonymized documents are not regulated by the GDPR-articles. They are exempted from the rules altogether, since they have been stripped of any personal information and identifiable subjects. That's why companies love anonymization! But while it may be desirable, it leaves no room for mistakes.

Not only must an organization guarantee that data sets cannot be matched – they also need to guarantee a complete redaction of any sensitive information such as social security numbers, license plates, property numbers and phone numbers. And that's even before it starts to get tricky!

What if certain facts in the document are so unique, that an outsider can deduct the hidden information based on press coverage or public knowledge? Or if the name of a small company is enough to reveal its CEO, even though this person is protected?

In the end, proper redaction and anonymization require sound judgement and should always be viewed on a case-by-case basis.

Answer: Cleardox does both!

Let's wrap up the article an go back to our initial question. Is Cleardox anonymizing or pseudonymizing data? Well, we have now established that it is not the redaction technique (blacklining or replacement) that determines, whether a document is anonymized or pseudonymized. It is whether outsiders are able to re-engineer the personal information.

So, it really depends on the case, documents and application of our tool. If no original and compromising additional data sets exist, then the answer is anonymization. Otherwise, the answer is pseudonymization.

However, both solutions will save your business lots of time and energy. Although pseudonymized data still falls within the scope of the GDPR-regulations, some provisions are relaxed to encourage controllers to use the strategy. Thus, controllers that pseudonymize their datasets will have an easier time using personal data for secondary purposes, while still meeting the requirements on data security.


The Cleardox Team.

481 views0 comments


bottom of page