Data Pseudonymization and Anonymization

Pseudonymization

To remove all direct identifiers, e.g. name, phone number, address, social security number etc, and just give each person a unique id that cannot be linked to them:

Lookup table
- quickly becomes cumbersome to save it and keep it up-to-date as new data arrives
Secret formula
- easy to store
- but easy to 'fit' the secret formula by only a few points
- collisions are an issue
Cryptographic hash functions
- easy to store, easy to compute, no collision, fixed length
- hash function are not a secret information, thus the attacker is able to build a lookup table by iterating over all possible IDs
Cryptographic hash functions with SALT
- salt is a fixed string of arbitrary length (but long!) that is added to the identifier before hashing it
- salt must be kept secret
- md5("19477") => md5("19477youwillneverguesswhatthissaltis")
- if the salt is long enough (and secret) it cannot be brute-forced

However, even a properly pseudonymized table is not safe.

ID	Gender	DOB	Zip-code	Sensitive Data
f1f333...	Male	28-09-1955	4444	****
93db7...	Female	12-03-1959	4334	****

What if I were to know that the person I’m searching for is a man born on 28-09-1955 and his zip-code is 4444? 63% of the US population is unique given a date of birth, zip code, and sex (Golle, 2006).

Terminology

Sensitive information: A piece of information about an individual (e.g. disease, drug use) we’re trying to protect (but is relevant for the application).
Identifier: A piece of information that directly identifies a person (name, address, phone number, ip address, passport number, etc).
Quasi-identifier: A piece of information that does not directly identify a person (e.g. nationality, date of birth). But multiple quasi-identifiers taken together could uniquely identify a person. A set of quasi-identifiers could be known to an attacker for a certain individual (auxiliary info).
Auxiliary information: Information known to an attacker.

Uniqueness Attack

Definition

Uniqueness w.r.t. \(A\): fraction of the dataset that is uniquely identified by the set \(A\) of quasi-identifiers.

\(k\)-anonymity

A table is \(k\)-anonymous if every record in the table is indistinguishable from at least \(k-1\) other records, with respect to every set of quasi-identifiers. This means that even if an attacker knows all possible quasi-identifiers, she cannot identify his target uniquely.

An equivalence class is a set of records that have the same values for all the quasi-identifiers.

Loss of information

\[H(D)=\sum_{i=1}^k\frac{*C_i}{N}\log\frac{*C_i}{N}\]

where \(N\) is the amount of rows in the dataset \(D\); \(C_1 ,\cdots, C_k\) are the equivalence classes; \(*C_i\) indicates the number of rows that belong to \(Ci\). The higher the entropy, the more information is contained in \(D\).

Homogeneity Attack

Definition

A homogeneity attack can take place when individuals in the same equivalence class all have the same sensitive attribute value.

ID	Gender	DOB	Zip-code	Sensitive Data
f1f333...	Male	1955	Las Vegas	gastritis
34dera...	Male	1955	Las Vegas	gastritis

What if I know that the person I’m searching for is a man, born in 1955, living in Las Vegas?

\(l\)-diversity

An equivalence class is \(l\)-diverse if it contains at least \(l\) distinct values for the sensitive attributes. A table is \(l\)-diverse if every equivalence class is \(l\)-diverse.

Not enough yet...

Definition

A semantic attack can take place when sensitive attributes of individuals in an equivalence class are distinct but semantically similar. For example, skin cancer and breast cancer are both cancer.

A skewness attack (here it is probabilistic) takes place when the distribution of the sensitive attributes in a class is skewed. In the general population, 99% might test negative for illegal drugs but, in an equivalence class, only 15% test negative. I learned something about people in this class.

\(t\)-closeness

An equivalence class is said to have \(t\)-closeness if the distance between the distribution of a sensitive attribute in this class and the distribution of this attribute in the whole table is no more than a threshold \(t\). A table is said to have \(t\)-closeness if all equivalence classes have \(t\)-closeness.

Final reminder

Anonymization is hard for small data and probably impossible for big data
- protect a dataset against a whole range of attacks: uniqueness, homogeneity, semantic, skewness, matching (unicity), profiling
- by anonymizing it once and only once
- all the while preserving utility (for all current and future uses)