top of page

How to remove metadata and make sure your document is properly redacted!


When we redact documents we want to make sure that everything is correctly redacted - both personal (GDPR) and classified information. Normally when we redact a document we are mostly concerned about the plain text that we as a reader can see. But a document also consists of unhidden text that we don´t necessarily notice, when we examine and redact a document.



This hidden data is often referred to as metadata and it is important to be aware of, since it can contain personal information that should be removed when we redact our document.


But what is metadata? In simple terms metadata can be defined as:



Data that provide information about other data

In other words, metadata is a simple shorthand version of the data to which they refer to - it describes what other data really is. Think about metadata as the keywords you enter into Google when you do a search. The keywords you put in are the metadata. Another way to think about metadata is: data about data.


Example of metadata in documents that sometimes are missed in a redaction process


Metadata comes in many different forms and categories, and is normally categorized into the following categories:


  • descriptive metadata

  • structural metadata

  • statistical metadata

  • reference metadata

  • administrative metadata


Descriptive metadata is descriptive information about a resource. It is used for discovery and identification. It includes elements such as title, abstract, author and keywords.


Descriptive metadata is in most cases the category you need to be aware of, when you redact documents, because this is the category that you often find personal information.


In the example below you can see a list of metadata from a document.


As you can see there are various examples of metadata that directly or indirectly could identify a person:


  • The author is an example of a direct personal identifier - it shows a name.

  • Title could potentially include a name. I used to work for a law firm where storing names in the title of the case document was a common practice (that practice soon changed after the new GDPR rules were implemented)

  • Comments (normally as part of track changes) could potentially also reveal a name.

  • The filepath can in some cases also be very revealing as in the case where it shows a company name. Company names are often entities that need to be redacted - not because the company name itself is personal information (according to GDPR it is not), but because you can use a company name to infer who the person is.


How do I remove metadata when redacting a document?


Well first of all: a professional redaction software tool like Cleardox will do the trick. Most software redaction tools remove metadata as part of the redaction process. However, it is not a guarantee. By the way, if you are looking for a redaction software tool we recommend that you read this post on the ten recommended features any redaction software tool should have.


If you don't have a redaction software tool yet and are forced to manually redact documents, another (more old fashioned) way to make sure that your metadata has been removed is by:

Print the document and scan it!


In this way you ensure that a new document is being created with the original metadata removed.


This redaction process, however, takes time (and is not good for the environment either).

An alternative way is to remove the metadata with a metadata removal tool which can be found with a simple Google search.


But Microsoft Word also has a feature that allows you to remove metadata. As you can see in the figure below, you can access the metadata in word by clicking on File followed by info. This will take you to a screen that shows much of the same metadata as in the figure above.



As you can see, there is a link near the button (Show all properties). This will provide you with a complete list of all the metadata in the document. If you want to remove the metadata simply click on the Inspect Document button. This will take you to a new screen, where you have the option to remove all the meta data.


An embarrassing meta-data and redaction slip up

When former Danish Prime Minister, Anders Fogh Rasmussen, gave his annual new year's speech to the public, his speech was shared with the public in a word file after the ceremony - a standard procedure.





Unfortunately, the document contained metadata that revealed sensitive information. For instance that the speech had been written by someone else - and someone from another organisation outside the Prime Minister's office. In addition, the metadata also showed what corrections the Prime Minister had made in the speech. For instance, it showed that the Prime Minister in one of the last iterations chose to erase a sentence that promised more money to municipalities in Denmark.


Though it is hardly a secret that most Prime Ministers don´t compose their own speeches anymore and even though this case is not as bad as the Manaman case, I´m sure that the Office of the Prime Minister would have wished that they had redacted the document properly and stripped the document for metadata. The Ministry later changed their procedure and now only share the speeches in a PDF-format. So does PDF not contain metadata?



The difference between Word and PDF

As the story above might falsely indicate: As long as we share documents in a PDF-format there is no metadata. But as we learned earlier, that is only the case if you print the document and scan it (thereby creating a new PDF-document).


PDF-documents do in fact also contain descriptive metadata such as the author’s name, keywords, and copyright information, that can contain personal or classified information.


Thus, this should also be considered removed when you redact a document.


How do you find the metadata in a PDF-document?


Select the “Preview” button to view the hidden text. Select the “Show Preview” button at the bottom of the dialogue box. Select “Show Hidden Text” from the preview of the document. You can scroll through the pages of your PDF using the double arrow buttons on the gray Acrobat navigation bar.


Conclusion

Metadata is hidden information in documents that can be used to directly or indirectly identify a person. Therefore you should be careful to check your metadata and remove it when you redact a document prior to a release to third parties. Metadata can fortunately be removed fairly easily and most redaction software tools automatically remove it for you.


Interested in our product? Sign up for a demo here.


Cheers,

Team Cleardox

371 views0 comments
bottom of page