• March 13, 2018
  • News


It is clear that in the world of data security there is a clear distinction between two terms namely: pseudonymization and anonymization. These data techniques are distinct in one major aspect. For pseudonymization data the subject substitutes its identity in a way that there must be a case of additional information to recognize the data subject while for anonymization like the name implies destroys any way of recognizing the subject data. It is quite pertinent to understand the distinction between these two terms since both data categories are classified in very different categories in regulation with GDRP invention.

Let’s make an example for better understanding, think about it from the perspective of pencil production. Say we have 20 pencils produced by an anonymous company, and we don’t have a way of identifying if all 20 pencils were produced by the same pencil company, or rather produced by say 15,16,17 or probably even 20 different pencil manufacturing companies or there about meaning all pencil producers are anonymous. Then we say that we have 20 different pencil products by Richmond woods stationery (a pencil production company). And we know that all 20 pencils where produced by the same company, though we also know Richmond woods stationery as Royce brooks. Therefore, Royce produced under a pseudonym.

The table below would further help us understand and practically examine tokenization. As seen in the table, for each different name a token would be provided which gives rise to a required access to additional information to re-identify the data.

Name Anonymized Token/Pseudonym
Justin XXXXXX Espoins
Dave XXXXXX Jums
James XXXXXX Poqqa
Avery XXXXXX Zwpvs

In the table above with the pseudonymized data, it is assumed we don’t know the data subjects’ identity, but we can correlate entries with specific subjects (records 1 and 7 reference the same person, records 2 and 5 reference the same person, records 3 and 4 reference the same person). We can get back to the real identity, if we have access to re-identify the data via the token lookup tables. However, with the anonymized data, we only know that there are 7 records and there is no method to re-identify the data.

It is a method of data identification through substituting with a reversible and consistent value. Anonymization is the destruction of the identifiable data.

We should be concerned about “indirect re-identification” with Anonymization. Back to our example made from the pencil company, our anonymous pencil may be indirectly identified by analyzing each company’s pencil producing style. It may be quite a task identifying the anonymous producers of the pencils because of the style in which they were produced. It may be difficult to recognize the name but would be able to know that specific pencils were produced by the same company, due to their various styles of pencil production. We may be able to find out the pencil producers if only the producer has produced a pencil under their own name and then we can compare the style and design if it matches other familiar styles of pencil produced

For instance, assuming an organization retains purchase history records of a customer but anonymize easily identifiable records as well as name and address. It may still be possible to identify a record indirectly, since humans are creatures of habit.

Every afternoon, Monday to Friday, jack does a routine of going to the same coffee shop and buys the same coffee and waffles for breakfast. And makes use of his debit card for payment. On Thursday night, he always withdraws $100 from the ATM close to his office, because it’s party night with his buddies on Friday night.

Jack’s behavior would allow us to indirectly re-identify him (all of these transactions reference the same person, because we can identify his predictable behavior) even if the organization has “anonymized” jack’s personally identifiable data (destroyed his name, address, etc.) Therefore, the data set has not been properly anonymized, we may have to use additional methods to hide individual behavior to effectively anonymize this data. For example, we might only store records based on some kind of grouping

“40 people went to this coffee shop every morning.”
“100 people got money from this ATM every Sunday.”
“A total of $200,000 was taken from this ATM on Thursday.”
“40 people bought waffles today”

Now the data has been anonymized, because we have no way of seeing Jack’s predictable pattern of behavior. NXT-Security Enterprise Vaultless Tokenization is an excellent way to accomplish both pseudonymization and anonymization of data. Although, full anonymization, should be undertaken by expert statisticians, data scientists, etc. and based on the individual organization that retain such data.