Pseudonymization
To remove all direct identifiers, e.g. name, phone number, address, social security number etc, and just give each person a unique id that cannot be linked to them:
- Lookup table
- quickly becomes cumbersome to save it and keep it up-to-date as new data arrives
- Secret formula
- easy to store
- but easy to 'fit' the secret formula by only a few points
- collisions are an issue
- Cryptographic hash functions
- easy to store, easy to compute, no collision, fixed length
- hash function are not a secret information, thus the attacker is able to build a lookup table by iterating over all possible IDs
- Cryptographic hash functions with SALT
- salt is a fixed string of arbitrary length (but long!) that is added to the identifier before hashing it
- salt must be kept secret
- md5("19477") => md5("19477youwillneverguesswhatthissaltis")
- if the salt is long enough (and secret) it cannot be brute-forced
However, even a properly pseudonymized table is not safe.
ID | Gender | DOB | Zip-code | Sensitive Data |
---|---|---|---|---|
f1f333... | Male | 28-09-1955 | 4444 | **** |
93db7... | Female | 12-03-1959 | 4334 | **** |
What if I were to know that the person I’m searching for is a man born on 28-09-1955 and his zip-code is 4444? 63% of the US population is unique given a date of birth, zip code, and sex (Golle, 2006).
Terminology
- Sensitive information: A piece of information about an individual (e.g. disease, drug use) we’re trying to protect (but is relevant for the application).
- Identifier: A piece of information that directly identifies a person (name, address, phone number, ip address, passport number, etc).
- Quasi-identifier: A piece of information that does not directly identify a person (e.g. nationality, date of birth). But multiple quasi-identifiers taken together could uniquely identify a person. A set of quasi-identifiers could be known to an attacker for a certain individual (auxiliary info).
- Auxiliary information: Information known to an attacker.
Uniqueness Attack
Definition
Uniqueness w.r.t. \(A\): fraction of the dataset that is uniquely identified by the set \(A\) of quasi-identifiers.
\(k\)-anonymity
A table is \(k\)-anonymous if every record in the table is indistinguishable from at least \(k-1\) other records, with respect to every set of quasi-identifiers. This means that even if an attacker knows all possible quasi-identifiers, she cannot identify his target uniquely.
An equivalence class is a set of records that have the same values for all the quasi-identifiers.
Loss of information
\[H(D)=\sum_{i=1}^k\frac{*C_i}{N}\log\frac{*C_i}{N}\]
where \(N\) is the amount of rows in the dataset \(D\); \(C_1 ,\cdots, C_k\) are the equivalence classes; \(*C_i\) indicates the number of rows that belong to \(Ci\). The higher the entropy, the more information is contained in \(D\).
Homogeneity Attack
Definition
A homogeneity attack can take place when individuals in the same equivalence class all have the same sensitive attribute value.
ID | Gender | DOB | Zip-code | Sensitive Data |
---|---|---|---|---|
f1f333... | Male | 1955 | Las Vegas | gastritis |
34dera... | Male | 1955 | Las Vegas | gastritis |
What if I know that the person I’m searching for is a man, born in 1955, living in Las Vegas?
\(l\)-diversity
An equivalence class is \(l\)-diverse if it contains at least \(l\) distinct values for the sensitive attributes. A table is \(l\)-diverse if every equivalence class is \(l\)-diverse.
Not enough yet...
Definition
A semantic attack can take place when sensitive attributes of individuals in an equivalence class are distinct but semantically similar. For example, skin cancer and breast cancer are both cancer.
A skewness attack (here it is probabilistic) takes place when the distribution of the sensitive attributes in a class is skewed. In the general population, 99% might test negative for illegal drugs but, in an equivalence class, only 15% test negative. I learned something about people in this class.
\(t\)-closeness
An equivalence class is said to have \(t\)-closeness if the distance between the distribution of a sensitive attribute in this class and the distribution of this attribute in the whole table is no more than a threshold \(t\). A table is said to have \(t\)-closeness if all equivalence classes have \(t\)-closeness.
Final reminder
- Anonymization is hard for small data and probably impossible for big
data
- protect a dataset against a whole range of attacks: uniqueness, homogeneity, semantic, skewness, matching (unicity), profiling
- by anonymizing it once and only once
- all the while preserving utility (for all current and future uses)