Definitions#

Adjacent datasets#

Datasets \(D, D' \in \mathcal{D}\) are adjacent if they are equal up to the addition or removal of all entries sharing the same PID. Note that this is a slightly unusual and restricted definition of adjacency, suited to our practical needs. It is close to that used in the user-level differential privacy literature [LSY+20, WZL+19] where one user can have many samples.

Data Owner#

The Data Owner is the person in charge of managing and protecting the datasets in a database. The Data Owner can use Qrlew to rewrite the SQL queries of the Data Practitioners into Differentially Private equivalents, run them on its behalf and account for privacy loss.

Data Practitioner#

The Data Practitioner is a data scientist, an analyst or any end user who wants to leverage sensitive data to carry out some sort of analysis. They are interested in patterns that are true irrespective of a given individual but still want maximum utility when they query the datasets.

Datasets and Privacy Units (PU)#

In this documentation datasets refer to a collection of elements in some domain \(\mathcal{X}\), labelled with an identifier \(i\in \mathcal{I}\) identifying the entity whose privacy we want to protect. This entity will be called Privacy Unit (PU) and the identifier will be referred to as Privacy ID (PID). Let \(\mathcal{D}\) be the set of datasets of arbitrary sizes with a privacy unit.

Differential Privacy (DP)#

Let \(\mathcal{M}\) be an algorithm that takes a dataset as input and produces a randomized output. The algorithm \(\mathcal{M}\) is said to satisfy \(\varepsilon,\delta\)-differential privacy if, for all pairs of adjacent datasets \(D, D' \in \mathcal{D}\), and for all measurable sets \(S\) in the range of \(\mathcal{M}\):

\[ \Pr[\mathcal{M}(D) \in S] \leq e^{\varepsilon} \cdot \Pr[\mathcal{M}(D') \in S] + \delta \]

For more background on DP, you can refer to: differentialprivacy.org. Or the United Nations PET Guide.

Privacy Enhancing Technologies (PET)#

As defined by the United Nations PET Guide, Privacy-enhancing technologies (PETs) are technologies designed to safely process and share sensitive data. There are two broad categories of PETs, namely PETS for input and for output privacy.

Input privacy focuses on how one or multiple parties can process data in a manner that guarantees the data is not used outside of that strict context.

Output privacy focuses on modifying the results of a computation such that the output data cannot be used to reverse engineer the original inputs. By using these technologies intelligently, safe data life cycles can be constructed, enabling collaboration, trust and providing confidence to data subjects.

Relation#

In Qrlew, queries are parsed and turned into intermediate representations called Relations.

A Relation is a recursive data structure that may be:

Tables: This is simply a data source from a database.
Maps: A Map takes an input Relation, filters the rows and transform them one by one. The filtering conditions and row transforms are expressed with expressions similar to those of SQL. It acts as a SELECT exprs FROM input WHERE expr LIMIT value and therefore preserves the privacy unit ownership structure.
Reduces: A Reduce takes an input Relation and aggregates some columns, possibly group by group. It acts as a SELECT aggregates FROM input GROUP BY expr. This is where the rewriting into DP will happen as described bellow.
Joins: This Relation combines two input Relations as a SELECT * FROM left JOIN right ON expr would do it. The privacy properties are more complex to propagate in this case.

It is close to the notion of relation from relational algebra.

Synthetic Data (SD)#

Synthetic Data refers to data generated by a procedure that preserves the statistical properties of some source dataset.

Generally the records from the source dataset are assumed to be independently drawn from some probability distribution. Then, the probability distribution is estimated using a generative model fitted on the source data. Samples are then drawn from this generative model.

Synthetic Date will not protect privacy per se, unless the generative model is fitted with a Differentially Private procedure such as DP-SGD [PHK+23].

In this documentation — focused on privacy — we will assume that