Sorry, your browser cannot access this site
This page requires browser support (enable) JavaScript
Learn more >

Pseudonymization

To remove all direct identifiers, e.g. name, phone number, address, social security number etc, and just give each person a unique id that cannot be linked to them:

  • Lookup table
    • quickly becomes cumbersome to save it and keep it up-to-date as new data arrives
  • Secret formula
    • easy to store
    • but easy to 'fit' the secret formula by only a few points
    • collisions are an issue
  • Cryptographic hash functions
    • easy to store, easy to compute, no collision, fixed length
    • hash function are not a secret information, thus the attacker is able to build a lookup table by iterating over all possible IDs
  • Cryptographic hash functions with SALT
    • salt is a fixed string of arbitrary length (but long!) that is added to the identifier before hashing it
    • salt must be kept secret
    • md5("19477") => md5("19477youwillneverguesswhatthissaltis")
    • if the salt is long enough (and secret) it cannot be brute-forced

However, even a properly pseudonymized table is not safe.

ID Gender DOB Zip-code Sensitive Data
f1f333... Male 28-09-1955 4444 ****
93db7... Female 12-03-1959 4334 ****

What if I were to know that the person I’m searching for is a man born on 28-09-1955 and his zip-code is 4444? 63% of the US population is unique given a date of birth, zip code, and sex (Golle, 2006).

Terminology

  • Sensitive information: A piece of information about an individual (e.g. disease, drug use) we’re trying to protect (but is relevant for the application).
  • Identifier: A piece of information that directly identifies a person (name, address, phone number, ip address, passport number, etc).
  • Quasi-identifier: A piece of information that does not directly identify a person (e.g. nationality, date of birth). But multiple quasi-identifiers taken together could uniquely identify a person. A set of quasi-identifiers could be known to an attacker for a certain individual (auxiliary info).
  • Auxiliary information: Information known to an attacker.

Uniqueness Attack

Definition

Uniqueness w.r.t. \(A\): fraction of the dataset that is uniquely identified by the set \(A\) of quasi-identifiers.

\(k\)-anonymity

A table is \(k\)-anonymous if every record in the table is indistinguishable from at least \(k-1\) other records, with respect to every set of quasi-identifiers. This means that even if an attacker knows all possible quasi-identifiers, she cannot identify his target uniquely.

An equivalence class is a set of records that have the same values for all the quasi-identifiers.

Loss of information

\[H(D)=\sum_{i=1}^k\frac{*C_i}{N}\log\frac{*C_i}{N}\]

where \(N\) is the amount of rows in the dataset \(D\); \(C_1 ,\cdots, C_k\) are the equivalence classes; \(*C_i\) indicates the number of rows that belong to \(Ci\). The higher the entropy, the more information is contained in \(D\).

Homogeneity Attack

Definition

A homogeneity attack can take place when individuals in the same equivalence class all have the same sensitive attribute value.

ID Gender DOB Zip-code Sensitive Data
f1f333... Male 1955 Las Vegas gastritis
34dera... Male 1955 Las Vegas gastritis

What if I know that the person I’m searching for is a man, born in 1955, living in Las Vegas?

\(l\)-diversity

An equivalence class is \(l\)-diverse if it contains at least \(l\) distinct values for the sensitive attributes. A table is \(l\)-diverse if every equivalence class is \(l\)-diverse.

Not enough yet...

Definition

A semantic attack can take place when sensitive attributes of individuals in an equivalence class are distinct but semantically similar. For example, skin cancer and breast cancer are both cancer.

A skewness attack (here it is probabilistic) takes place when the distribution of the sensitive attributes in a class is skewed. In the general population, 99% might test negative for illegal drugs but, in an equivalence class, only 15% test negative. I learned something about people in this class.

\(t\)-closeness

An equivalence class is said to have \(t\)-closeness if the distance between the distribution of a sensitive attribute in this class and the distribution of this attribute in the whole table is no more than a threshold \(t\). A table is said to have \(t\)-closeness if all equivalence classes have \(t\)-closeness.

Final reminder

  • Anonymization is hard for small data and probably impossible for big data
    • protect a dataset against a whole range of attacks: uniqueness, homogeneity, semantic, skewness, matching (unicity), profiling
    • by anonymizing it once and only once
    • all the while preserving utility (for all current and future uses)