Query-based System

Introduction

Query-based systems aim at giving researchers and organizations anonymous access to the data without sharing with them the individual-level data
This can be through online interfaces, SQL queries, verified algorithms, etc. which would only return aggregated data
In this post, we only consider COUNT interface

Query size restriction (QSR)

Definition

A query \(Q\) is a logical formula that selects a specific set of rows. Define:

\[Q= C_1\land \cdots \land C_h\]

\[\{Q\}_D=\{x|x\in D,Q(x)=\text{True}\}\]

where \(C_i\) is an expression of the form "attribute ? value".

QSR imposes that every query such that \(|\{Q\}_D|\leq t\) is blocked where \(t\) is the threshold made by the data curator.

Intersection Attack

COUNT Q1: students in DoC AND code with Notepad?

COUNT Q2: students in DoC AND born on 1994-09-23 AND code with Notepad?

Counting \(Q_1\) and \(Q_2\) are likely to have answers \(A_1,A_2>t\), so they won’t be blocked by the query set size restriction. However, if \(A_1-A_2=0\), we will know that the person born on 1994-09-23 doesn’t code with Notepad.

Intersection attack uses the answers of multiple queries to learn information about a single individual. It has been shown that detecting any intersection attacks is an NP-hard problem.

Unbiased noise addition

Bounded noise

Bounded noise will leak information if the attacker knows the noise mechanism. For example, noise \(u\sim U[-2,2]\) is added to the above intersection attack query. If we obtain the result \(\hat{A_1}=A_1+2=586,\hat{A_2}=A_2-2=581\) and we know that only Bob born on that day, we will know \(A_1=584,A_2=583\) with certainty because \(A_1,A_2\) differs at most \(1\).

Unbounded noise

Centered at 0, as to not introduce biases
Large perturbations are possible but unlikely

Averaging attack

Averaging attack asks the same question several times and takes the average to find the right value. Formally, this can be done in two ways (frequentist and Bayesian):

Compute the average of the samples and apply the central limit theorem (CLT). CLT says that this average converges to the mean (i.e. the true value)
Use Bayes’ rule with multiple observations. This method immediately gives not only the most likely value (the average), but also a full posterior distribution

Assume Gaussian noise \(N(0,\sigma^2)\) is added. We can conclude that if the attacker wants to decide whether Bob is in the dataset and his query number \(n\geq 4\sigma^2z_{\alpha}^2\), the attacker will have confidence (probability of making an incorrect prediction) \(\alpha\) .

Consistent noise addition

If our noises were to be consistent, we wouldn’t learn anything by asking the same question again. Consistent noise can be achieved through:

Caching, basically making sure that we cache the noisy result of every query and return this if the same question is asked again
Seeded pseudorandom number generator (PRNG): in short, we seed our noise generator to ensure that when the query is the same, the exact same noise is added. The seed could e.g. be the hash of all the parameters of the query

Semantic attack

Tf the query language we use is expressive enough, there often exist multiple ways to ask the same questions:

COUNT Q1: students in DoC AND code with Notepad AND born between 1994-09-23 00:00 and 1994-09-24 00:00?

COUNT Q2: students in DoC AND code with Notepad AND born between 1994-09-23 00:01 and 1994-09-24 00:01?

When we get a great number of answers, we are able to cancel out the consistent noise by taking average.

Diffix's sticky noise

Diffix's output of counting \(Q\) is:

\[\text{COUNT }\tilde{Q}(D)=\text{COUNT }Q(D)+\sum_{i=1}^h\text{static}[C_i]+\sum_{i=1}^h\text{dynamic}_Q[C_i]\]

Static noise

For each condition \(C_i\), the noise value \(\text{static}[C_i]\) is generated by drawing a random value from \(N(0,1)\). The random value is generated using a PRNG seeded with:

\[\text{static_seed}_{C_i}=\text{hash}(C_i,\text{salt})\]

Dynamic noise

For each condition \(C_i\), the noise value \(\text{dynamic}_Q[C_i]\) is generated by drawing a random value from \(N(0,1)\). The random value is generated using a PRNG seeded with:

\[\text{dynamic_seed}_{C_i,Q}=\text{hash}(\text{static_seed}_{C_i},\{Q\}_D)\]

Bucket suppression

Besides sticky noise, Diffix includes another protection. This is called bucket suppression, and it is a more sophisticated version of QSR.

The idea is the same: block any query that selects a set of users with size smaller than a certain threshold. However, the threshold is not fixed, but noisy as well:

if \(|\{Q\}_D|\leq1\), the query gets suppressed; if \(|\{Q\}_D|>1\), it draws a noisy threshold \(t\) from \(N(4,1/2)\) by the seed:

\[\text{threshold_seed}=\text{hash}(\{Q\}_D,salt)\]

Safe?

Dynamic noise helps protect against many sophisticated attacks, see here. However, the developers of Diffix do not have a mathematical proof that their system protects against any attack. And they acknowledge that.

In October 2018, Aloni Cohen and Kobbi Nissim published a report on their attack on Diffix. This attack is an adaptation of a reconstruction attack, proposed in 2003 by Dinur and Nissim.

In October 2018, some researchers from UCL and EPFL published a blogpost about their attack on Diffix, based on a previous paper. There is no technical report available for this attack yet.

In April 2018, researchers from Imperial College London designed an attack to infer a user’s attribute with high accuracy, see [here] (https://cpg.doc.ic.ac.uk/blog/aircloak-diffix-signal-is-in-the-noise/). They called it a noise-exploitation attack.