Getting Started#

Qrlew is an open source library that can turn SQL queries into their differentially private (DP) equivalent.

It takes SQL – the universal language of small and big data analytics – as input, so there is no new language or API to learn.
It returns DP SQL queries that can be executed at scale on any SQL datastore.
The only required extra inputs are the privacy unit specification and privacy budget $(\varepsilon, \delta)$.

With Qrlew a data owner can receive SQL queries from an untrusted data analyst, turn them into DP equivalent, execute them and safely return the result to the data analyst.

In this quick intro, we will setup a database, load a few demo tables. Then we will define the privacy unit and run simple DP SQL queries to show how Qrlew works on various kind of queries with Joins, Subqueries or Common Table Expressions.

Get access to a database#

Let’s start by installing a local PostgreSQL database.

Qrlew currently supports the following SQL dialects:

%%capture
!sudo apt-get -y -qq update
!sudo apt-get -y -qq install postgresql
!sudo service postgresql start
!sudo -u postgres psql -U postgres -c "ALTER USER postgres PASSWORD 'password'"
!pip install psycopg2

Install Qrlew python bindings: `pyqrlew`#

pyqrlew is distributed on PyPI and simply installed with pip install.

If no binary is available for your platform, you will need a Rust compiler to install the library from sources.

Binaries are available for Linux, MacOS, Windows and major architectures including x86, arm or Apple chips.

%%capture
!pip install pyqrlew

Load a few tables#

Let’s load a pyqrlew demo dataset: IMDb, composed of several related tables.

from pyqrlew.io import PostgreSQL

database = PostgreSQL('postgres', 'postgres', 'password', 5432)
database.load_imdb()
engine = database.engine()

We can display a short extract of each table.

from IPython.display import display, Markdown
import pandas as pd

for table in ['actors', 'directors', 'directors_genres', 'movies', 'movies_directors', 'movies_genres', 'roles']:
  display(Markdown(f'## {table}'))
  display(pd.read_sql(f'SELECT * FROM imdb_ijs.{table} LIMIT 3', engine))

actors

	id	first_name	last_name	gender
0	2	Michael	'babeepower' Viera	M
1	3	Eloy	'Chincheta'	M
2	4	Dieguito	'El Cigala'	M

directors

	id	first_name	last_name
0	1	Todd	1
1	2	Les	12 Poissons
2	3	Lejaren	a'Hiller

directors_genres

	director_id	genre	prob
0	2	Short	1.0
1	3	Drama	1.0
2	5	Documentary	1.0

movies

	id	name	year	rank
0	0	#28	2002	NaN
1	1	#7 Train: An Immigrant Journey, The	2000	NaN
2	2	$	1971	6.4

movies_directors

	director_id	movie_id
0	1	378879
1	2	281325
2	3	30621

movies_genres

	movie_id	genre
0	1	Documentary
1	1	Short
2	2	Comedy

roles

	actor_id	movie_id	role
0	2	280088	Stevie
1	2	396232	Various/lyricist
2	3	376687	Gitano 1

Connect to the database and create a `pyqrlew.Dataset`#

To use Qrlew the SQL tables have to be collected into a Dataset object that will enrich the tables with metadata such as privacy unit or value ranges definitions.

Qrlew basic constructs are relations: abstract representation of SQL queries.

Tables are the most basic relations in qrlew.

from pyqrlew import Dataset

source_dataset = Dataset.from_database('imdb', engine, 'imdb_ijs')

from graphviz import Source

for path, relation in source_dataset.relations():
  display(Source(relation.dot()))

../_images/5e624de244d2ebc95c38b82a3d9bb9db094b21f659595afb7f0f72d3febaae49.svg

../_images/f51daabe6fccf8fa0ed3c5922a40a40dff914ca160816f7e27f6c6c8176cc1b5.svg

../_images/16e413cf4d02c5532366e1096e6e89f9ea85a24c41372720cd8588a5625cdf62.svg

../_images/9999dc5ae7dbcc882d4e3f1a7e23d6520474116235f6f9667df59ce4007f01a6.svg

../_images/dfb6b3aea724192d0db4c31664478c56a5d53069779e719a7b9092ead30ceae1.svg

../_images/a0d9fa206eb8ceeaca7b0093310e6104beef6fa47f7751226b23f40d7446cde7.svg

../_images/b3f99149179670d3d8e1faba39870b17a18cbad3dba5118594258ffd025b9be7.svg

Optionally declare value ranges and unique constraints#

If ranges are known for some numeric columns, it is possible to define them with the following syntax new_dataset = dataset.<schema>.<table>.<column>.with_range(min, max).

Note that inferring those ranges from the data may reveal private information if not done right.

bounded_dataset = source_dataset.imdb_ijs.movies.rank.with_range(0, 20)

If some columns have known possible values, they can also be declared.

Here again, simple inference from the data would be unsafe.

bounded_dataset = bounded_dataset.imdb_ijs.actors.gender.with_possible_values(['M', 'F'])

Some column are known to have a distincts values (although not specified in the DB schema).

dataset = (bounded_dataset
  .imdb_ijs.actors.id.with_unique_constraint()
  .imdb_ijs.directors.id.with_unique_constraint()
  .imdb_ijs.movies.id.with_unique_constraint())

The new dataset enriched with new metadata can be visualized.

for path, relation in dataset.relations():
  display(Source(relation.dot()))

../_images/28334ee2160bb6321f95ef3c3f9721cf1e6ce8a8bf1114282ff53755c9febd28.svg

../_images/6ab5db5b678474a490106a4da0bd4cdbf4d96b9e0d933349032c323255d8354c.svg

../_images/d1ff616bd2ea3ba3eb5925333ef9f77b23e8011b521b4a5f0dd979a23e503c45.svg

Define the privacy unit and budget#

Privacy unit definition#

To protect the privacy of individuals in a dataset, we need to define a privacy unit.

The privacy unit is the “element” we are referring to when we say two datasets “differ on a single element” in the DP definition.

It is therefore what is to be protected in the dataset.

Let’s say we want to protect the privacy of actors and directors. For the tables actors and directors the column id will directly define the privacy unit (i.e. entity to protect).

But the tables directors_genres, movies_directors and roles also define directors and actors, because they refer to them through director_id and actor_id.

Budget definition#

Explaining what $\varepsilon$ and $\delta$ precisely mean is beyond the scope of this document, but in a word:

$\varepsilon$ is the amount of information about any individual that is allowed to leak. It should remain small (<1) so that little information is leaked.
$\delta$ is some tolerance that should remain very small, typically lower than one over the number of elements in the dataset.

# Privacy unit definition
privacy_unit = [
    # The column `id` directly defines the privacy unit
    ("actors", [], "id"),
    ("directors", [], "id"),
    # The column `director_id` refers to the column `id` of table `directors`, the `id` of which defines the privacy unit
    ("directors_genres", [("director_id", "directors", "id")], "id"),
    ("movies_directors", [("director_id", "directors", "id")], "id"),
    # The column `actor_id` refers to the column `id` of table `actors`, the `id` of which defines the privacy unit
    ("roles", [("actor_id", "actors", "id")], "id"),
]
# Budget
budget = {"epsilon": 1.0, "delta": 1e-3}

# Privacy unit is propagted accross the relation
pup_actors = dataset.imdb_ijs.actors.relation().rewrite_as_privacy_unit_preserving(
    dataset,
    privacy_unit,
    budget,
)
display(Source(pup_actors.relation().dot()))

../_images/db93ba6c3563e45949dcd729c26b85635366def9598171d50c69b2a6b4e336df.svg

# Privacy unit is propagted accross the relation
pup_roles = dataset.imdb_ijs.roles.relation().rewrite_as_privacy_unit_preserving(
    dataset,
    privacy_unit,
    budget,
)
display(Source(pup_roles.relation().dot()))

../_images/8f3a3256716e7567760b0de4e7c46c9bbfd502418d703f08d0474057d7e0c123.svg

Convert a simple aggregation query into a DP equivalent#

# We create a basic aggregation relation
query = "SELECT gender, COUNT(*) FROM imdb_ijs.actors GROUP BY gender"
relation = dataset.relation(query)
# The relation is turned into a DP equivalent
relation_with_dp_event = relation.rewrite_with_differential_privacy(
    dataset,
    privacy_unit,
    budget,
)
# Rewritten relation
dp_relation = relation_with_dp_event.relation()
# Privacy loss
dpe = relation_with_dp_event.dp_event()
print(dpe)

DpEvent(Gaussian { noise_multiplier: 3.776479532659047 })

dp_events are compatible with Google’s dp_accounting

%%capture
!pip install dp_accounting==0.4.1

# Google's dp_accounting
from dp_accounting import DpEvent
print(DpEvent.from_named_tuple(dpe.to_named_tuple()))

/usr/local/lib/python3.10/dist-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.23.5
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"

GaussianDpEvent(noise_multiplier=3.776479532659047)

display(Source(dp_relation.dot()))

../_images/d15e126309a71797b536e23a3530530b06ef088a5f915d69def817a2781135d7.svg

dp_query = dp_relation.to_query()
display(dp_query)

"WITH field_d4_o (field_d4_o) AS (SELECT * FROM (VALUES ('F'), ('M')) AS field_d4_o (field_d4_o)), map_o7od (_PRIVACY_UNIT_, _PRIVACY_UNIT_WEIGHT_, id, first_name, last_name, gender) AS (SELECT MD5(CAST(id AS TEXT)) AS _PRIVACY_UNIT_, 1 AS _PRIVACY_UNIT_WEIGHT_, id AS id, first_name AS first_name, last_name AS last_name, gender AS gender FROM imdb_ijs.actors), map_q1hj (_PRIVACY_UNIT_, _PRIVACY_UNIT_WEIGHT_, field_n9r3, field_d4_o) AS (SELECT _PRIVACY_UNIT_ AS _PRIVACY_UNIT_, _PRIVACY_UNIT_WEIGHT_ AS _PRIVACY_UNIT_WEIGHT_, 1 AS field_n9r3, gender AS field_d4_o FROM map_o7od), join_xczi (field_d4_o, _PRIVACY_UNIT_, _PRIVACY_UNIT_WEIGHT_, field_n9r3, left__q8p1) AS (SELECT * FROM field_d4_o AS _LEFT_ LEFT JOIN map_q1hj AS _RIGHT_ ON (_LEFT_.field_d4_o) = (_RIGHT_.field_d4_o)), map_1v6w (field_d4_o, _PRIVACY_UNIT_, _PRIVACY_UNIT_WEIGHT_, field_n9r3) AS (SELECT field_d4_o AS field_d4_o, _PRIVACY_UNIT_ AS _PRIVACY_UNIT_, _PRIVACY_UNIT_WEIGHT_ AS _PRIVACY_UNIT_WEIGHT_, field_n9r3 AS field_n9r3 FROM join_xczi), map___wz (_PRIVACY_UNIT_, _PRIVACY_UNIT_WEIGHT_, field_d4_o, _ONE_field_n9r3) AS (SELECT COALESCE(CAST(_PRIVACY_UNIT_ AS TEXT), '_PRIVACY_UNIT_DEFAULT_') AS _PRIVACY_UNIT_, COALESCE(_PRIVACY_UNIT_WEIGHT_, 0) AS _PRIVACY_UNIT_WEIGHT_, field_d4_o AS field_d4_o, CASE WHEN field_n9r3 IS NULL THEN 0 ELSE 1 END AS _ONE_field_n9r3 FROM map_1v6w), map_mvf9 (field_1iro, field_grep, field_zyse) AS (SELECT _ONE_field_n9r3 AS field_1iro, _PRIVACY_UNIT_ AS field_grep, field_d4_o AS field_zyse FROM map___wz), reduce_mq2i (_PRIVACY_UNIT_, field_d4_o, _NORM__ONE_field_n9r3) AS (SELECT field_grep AS _PRIVACY_UNIT_, field_zyse AS field_d4_o, SUM(field_1iro) AS _NORM__ONE_field_n9r3 FROM map_mvf9 GROUP BY field_grep, field_zyse), map_5820 (_PRIVACY_UNIT_, field_d4_o, _NORM__ONE_field_n9r3) AS (SELECT _PRIVACY_UNIT_ AS _PRIVACY_UNIT_, field_d4_o AS field_d4_o, ABS((_NORM__ONE_field_n9r3) * (_NORM__ONE_field_n9r3)) AS _NORM__ONE_field_n9r3 FROM reduce_mq2i), map_txex (field_7ra5, field_grep) AS (SELECT _NORM__ONE_field_n9r3 AS field_7ra5, _PRIVACY_UNIT_ AS field_grep FROM map_5820), reduce_sa4i (_PRIVACY_UNIT_, _ONE_field_n9r3) AS (SELECT field_grep AS _PRIVACY_UNIT_, SUM(field_7ra5) AS _ONE_field_n9r3 FROM map_txex GROUP BY field_grep), map_z1nz (_PRIVACY_UNIT_, _ONE_field_n9r3) AS (SELECT _PRIVACY_UNIT_ AS _PRIVACY_UNIT_, (1) / (GREATEST(1, (SQRT(_ONE_field_n9r3)) / (1))) AS _ONE_field_n9r3 FROM reduce_sa4i), join_rcg8 (_PRIVACY_UNIT_, _PRIVACY_UNIT_WEIGHT_, field_d4_o, _ONE_field_n9r3, _SCALE_FACTOR__PRIVACY_UNIT_, _SCALE_FACTOR__ONE_field_n9r3) AS (SELECT * FROM map___wz AS _LEFT_ JOIN map_z1nz AS _RIGHT_ ON (_LEFT_._PRIVACY_UNIT_) = (_RIGHT_._PRIVACY_UNIT_)), map_7dpd (_PRIVACY_UNIT_, _PRIVACY_UNIT_WEIGHT_, field_d4_o, _ONE_field_n9r3, _SCALE_FACTOR__PRIVACY_UNIT_, _SCALE_FACTOR__ONE_field_n9r3, _CLIPPED__ONE_field_n9r3) AS (SELECT _PRIVACY_UNIT_ AS _PRIVACY_UNIT_, _PRIVACY_UNIT_WEIGHT_ AS _PRIVACY_UNIT_WEIGHT_, field_d4_o AS field_d4_o, _ONE_field_n9r3 AS _ONE_field_n9r3, _SCALE_FACTOR__PRIVACY_UNIT_ AS _SCALE_FACTOR__PRIVACY_UNIT_, _SCALE_FACTOR__ONE_field_n9r3 AS _SCALE_FACTOR__ONE_field_n9r3, (_ONE_field_n9r3) * (_SCALE_FACTOR__ONE_field_n9r3) AS _CLIPPED__ONE_field_n9r3 FROM join_rcg8), map_klrf (field_8d51, field_zyse) AS (SELECT _CLIPPED__ONE_field_n9r3 AS field_8d51, field_d4_o AS field_zyse FROM map_7dpd), reduce_14x4 (field_d4_o, _COUNT_field_n9r3) AS (SELECT field_zyse AS field_d4_o, SUM(field_8d51) AS _COUNT_field_n9r3 FROM map_klrf GROUP BY field_zyse), map_uira (field_d4_o, _COUNT_field_n9r3) AS (SELECT field_d4_o AS field_d4_o, LEAST(1635436, GREATEST(0, (COALESCE(_COUNT_field_n9r3, 0)) + ((3.776479532659047) * ((SQRT((-2) * (LN(RANDOM())))) * (COS((6.283185307179586) * (RANDOM()))))))) AS _COUNT_field_n9r3 FROM reduce_14x4), map_wrho (field_i4xy, field_d4_o) AS (SELECT CAST(_COUNT_field_n9r3 AS INTEGER) AS field_i4xy, field_d4_o AS field_d4_o FROM map_uira), map_su7n (field_d4_o, field_i4xy) AS (SELECT field_d4_o AS field_d4_o, field_i4xy AS field_i4xy FROM map_wrho), map__i7v (gender, field_34j2) AS (SELECT field_d4_o AS gender, field_i4xy AS field_34j2 FROM map_su7n) SELECT * FROM map__i7v"

import pandas as pd

true_res = pd.read_sql(query, database.engine())
dp_res = pd.read_sql(dp_query, database.engine())

display(Markdown(f'## Result'))
display(true_res)

display(Markdown(f'## DP Result'))
display(dp_res)

Result

	gender	count
0	F	304412
1	M	513306

DP Result

	gender	field_34j2
0	M	513300
1	F	304413