Pseudonymization
To remove all direct identifiers, e.g. name, phone number, address, social security number etc, and just give each person a unique id that cannot be linked to them:
- Lookup table
- quickly becomes cumbersome to save it and keep it up-to-date as new data arrives
 
 - Secret formula
- easy to store
 - but easy to 'fit' the secret formula by only a few points
 - collisions are an issue
 
 - Cryptographic hash functions
- easy to store, easy to compute, no collision, fixed length
 - hash function are not a secret information, thus the attacker is able to build a lookup table by iterating over all possible IDs
 
 - Cryptographic hash functions with SALT
- salt is a fixed string of arbitrary length (but long!) that is added to the identifier before hashing it
 - salt must be kept secret
 - md5("19477") => md5("19477youwillneverguesswhatthissaltis")
 - if the salt is long enough (and secret) it cannot be brute-forced
 
 
However, even a properly pseudonymized table is not safe.
| ID | Gender | DOB | Zip-code | Sensitive Data | 
|---|---|---|---|---|
| f1f333... | Male | 28-09-1955 | 4444 | **** | 
| 93db7... | Female | 12-03-1959 | 4334 | **** | 
What if I were to know that the person I’m searching for is a man born on 28-09-1955 and his zip-code is 4444? 63% of the US population is unique given a date of birth, zip code, and sex (Golle, 2006).
Terminology
- Sensitive information: A piece of information about an individual (e.g. disease, drug use) we’re trying to protect (but is relevant for the application).
 - Identifier: A piece of information that directly identifies a person (name, address, phone number, ip address, passport number, etc).
 - Quasi-identifier: A piece of information that does not directly identify a person (e.g. nationality, date of birth). But multiple quasi-identifiers taken together could uniquely identify a person. A set of quasi-identifiers could be known to an attacker for a certain individual (auxiliary info).
 - Auxiliary information: Information known to an attacker.
 
Uniqueness Attack
Definition
Uniqueness w.r.t. \(A\): fraction of the dataset that is uniquely identified by the set \(A\) of quasi-identifiers.
\(k\)-anonymity
A table is \(k\)-anonymous if every record in the table is indistinguishable from at least \(k-1\) other records, with respect to every set of quasi-identifiers. This means that even if an attacker knows all possible quasi-identifiers, she cannot identify his target uniquely.
An equivalence class is a set of records that have the same values for all the quasi-identifiers.
Loss of information
\[H(D)=\sum_{i=1}^k\frac{*C_i}{N}\log\frac{*C_i}{N}\]
where \(N\) is the amount of rows in the dataset \(D\); \(C_1 ,\cdots, C_k\) are the equivalence classes; \(*C_i\) indicates the number of rows that belong to \(Ci\). The higher the entropy, the more information is contained in \(D\).
Homogeneity Attack
Definition
A homogeneity attack can take place when individuals in the same equivalence class all have the same sensitive attribute value.
| ID | Gender | DOB | Zip-code | Sensitive Data | 
|---|---|---|---|---|
| f1f333... | Male | 1955 | Las Vegas | gastritis | 
| 34dera... | Male | 1955 | Las Vegas | gastritis | 
What if I know that the person I’m searching for is a man, born in 1955, living in Las Vegas?
\(l\)-diversity
An equivalence class is \(l\)-diverse if it contains at least \(l\) distinct values for the sensitive attributes. A table is \(l\)-diverse if every equivalence class is \(l\)-diverse.
Not enough yet...
Definition
A semantic attack can take place when sensitive attributes of individuals in an equivalence class are distinct but semantically similar. For example, skin cancer and breast cancer are both cancer.
A skewness attack (here it is probabilistic) takes place when the distribution of the sensitive attributes in a class is skewed. In the general population, 99% might test negative for illegal drugs but, in an equivalence class, only 15% test negative. I learned something about people in this class.
\(t\)-closeness
An equivalence class is said to have \(t\)-closeness if the distance between the distribution of a sensitive attribute in this class and the distribution of this attribute in the whole table is no more than a threshold \(t\). A table is said to have \(t\)-closeness if all equivalence classes have \(t\)-closeness.
Final reminder
- Anonymization is hard for small data and probably impossible for big
data
- protect a dataset against a whole range of attacks: uniqueness, homogeneity, semantic, skewness, matching (unicity), profiling
 - by anonymizing it once and only once
 - all the while preserving utility (for all current and future uses)