Differential Privacy

Definition

Neighboring dataset

For any dataset \(D,D^{\prime}\in\mathcal{D}\), \(D,D^{\prime}\) are neighboring datasets if and only if one of the following two statements is true:

\(\exists x\in D^{\prime}\), s.t. \(D^{\prime}=D\cup\{x\}\)
\(\exists x\in D\), s.t. \(D=D^{\prime}\cup\{x\}\)

Two neighboring datasets always have different sizes, i.e. \(|D_1|-|D_2|=\pm1\)

Differential privacy

Let \(M:\mathcal{D}\rightarrow Y\) be a randomized mechanism. \(M\) is \(\varepsilon\)-differentially private if, for any neighboring datasets \(D,D^{\prime}\in\mathcal{D}\) and any \(S\subset \text{img }M\), we have:

\[\text{Pr}[M(D)\in S]\leq e^{\varepsilon}\text{Pr}[M(D^{\prime})\in S]\]

Differential privacy guarantees that the result of a certain query on the dataset must be essentially the same irrespective of the presence of any single individual in the dataset.

Laplace mechanism

Global sensitivity

Let \(f:\mathcal{D}\rightarrow \mathbb{R}^k\), with \(f(D)=(f_1(D),f_2(D),\cdots,f_k(D))\). The global sensitivity of \(f\) is:

\[\Delta f=\max_{D_1,D_2}||f(D_1)-f(D_2)||_1=\max_{D_1,D_2}\sum_{i=1}^k|f_i(D_1)-f_i(D_2)|\]

where \(D_1,D_2\) can be any arbitrary neighboring datasets in \(\mathcal{D}~\).

Laplace mechanism

For any \(f:\mathcal{D}\rightarrow\mathbb{R}^k\), the mechanism \(M:\mathcal{D}\rightarrow\mathbb{R}\) defined by

\[M(D)=f(D)+\text{Lap}(\Delta f/\varepsilon)\]

is \(\varepsilon\)-DP. This is called the Laplace mechanism.

When \(\Delta f\) is unbounded (or very large), we cannot apply the simple Laplace mechanism without destroying utility. Fortunately, in many cases it is still possible to use other mechanisms.

Composition Theorem

Let \(M_1,\cdots,M_k\) be such that every \(M_i\) is an \(\varepsilon_i\)-DP mechanism. Then \((M_1,\cdots,M_k)\) is an \(\sum_{i=1}^k\varepsilon_i\)-DP mechanism.

This theorem is a corollary from the global sensitivity definition. Actually, we can decide on an \(\varepsilon\), called the privacy budget, which then defines the total number of queries anyone can run on the dataset and the noise added to the query results. For example, 10 queries at \(\varepsilon_i=0.1\) can protect privacy as good as 2 queries at \(\varepsilon_i=0.5\), because they both have budget \(\varepsilon=1\).

Optimized mechanism for histograms

A histogram consists of queries

\[\text{COUNT } Q_i: x\in X_i\]

where \(\cap_{i=1}^kX_i=\varnothing\) .

If we add \(\text{Lap}(1/\varepsilon)\) noise to each query result, then the whole mechanism is \(\varepsilon\)-DP. This optimization is a corollary from the global sensitivity definiton.

Group privacy

Any \(\varepsilon\)-DP mechanism \(M\) is \(k\varepsilon\)-DL for groups of size \(k\). This means that it's "\(k\varepsilon\) hard" to determine whether any entire group of \(k\) individuals belongs to the dataset, based on the output of \(M\).

Local differential privacy

Introduction

In local differential privacy, every user “adds noise” to his own data (locally) and then shares the “privatized” data with the analyst. The analyst can then run any computation on these privatized data.

Local differential privacy

Let \(R\) be a set of responses. A randomized algorithm \(M:R\rightarrow Y\) is \(\varepsilon\)-local differentially private if, for responses \(r_1,r_2\in R\) and any \(S\subset \text{img }M\), we have:

\[\text{Pr}[M(r_1)\in S]\leq e^{\varepsilon}\text{Pr}[M(r_2)\in S]\]

Example

Suppose a professor wants to conduct a survey among students asking the question: “Have you cheated at the exam?”. Every student is asked to answer YES or NO, following these steps:

Flip a biased coin, with probability of tails \(p > 0.5\)
If tails, then respond truthfully
If heads, then lie.

For this mechanism

\[\varepsilon=\log(\frac{p}{1-p})\]