User Guide#

View On GitHub Open In Colab

Qrlew is an open source library that can turn SQL queries into their differentially private (DP) equivalent.

  • It takes SQL – the universal language of small and big data analytics – as input, so there is no new language or API to learn.

  • It returns DP SQL queries that can be executed at scale on any SQL datastore.

  • The only required extra inputs are the privacy unit specification and privacy budget \((\varepsilon, \delta)\).

With Qrlew a data owner can receive SQL queries from an untrusted data analyst, turn them into DP equivalent, execute them and safely return the result to the data analyst.

This guide will cover several useful notions from Qrlew:

  • We will define a Dataset from a database connection.

  • We will declare value ranges and unique constraints

  • We will define a privacy unit.

  • And we will run simple DP SQL queries.

We will also explore more advanced features:

  • The support of several SQL dialects: PostgreSQL (default), MSSQL or BigQuery.

  • Privacy accounting compatibility with Google’s popular Differential Privacy accounting: dp_accounting.

  • The use of Synthetic Data to further improve the quality of the results.

Define a Dataset#

Let’s define a Dataset object from a SQL connection.

We will first run a local postgresql database. You can connect to any Postgresql, MSSQL or BigQuery database.

%%capture
!sudo apt-get -y -qq update
!sudo apt-get -y -qq install postgresql
!sudo service postgresql start
!sudo -u postgres psql -U postgres -c "ALTER USER postgres PASSWORD 'password'"
!pip install psycopg2
!pip install pyqrlew
# A few utility functions
from IPython.display import Markdown

def print_sql(query):
  display(Markdown(f"""```sql
{query}
```"""))

def print_result(title, result):
  display(Markdown(f'### {title}'))
  display(result)

def print_dot(dot_graph):
  display(Source(dot_graph))

def print_md(md):
  display(Markdown(md))

Then we load a demo dataset.

from pyqrlew.io import PostgreSQL

database = PostgreSQL('postgres', 'postgres', 'password', 5432)
database.load_imdb()
engine = database.engine()

We can then build a Dataset object that can be used to parse queries into Relation objects that can be transformed into differentially private Relations and rendered into SQL queries.

from pyqrlew import Dataset

source_dataset = Dataset.from_database('imdb', engine, 'imdb_ijs')
for path, relation in source_dataset.relations():
  print_md(f"**{'.'.join(path)}** (`{relation}`)")

imdb.imdb_ijs.actors (SELECT * FROM "imdb_ijs"."actors")

imdb.imdb_ijs.directors (SELECT * FROM "imdb_ijs"."directors")

imdb.imdb_ijs.directors_genres (SELECT * FROM "imdb_ijs"."directors_genres")

imdb.imdb_ijs.movies (SELECT * FROM "imdb_ijs"."movies")

imdb.imdb_ijs.movies_directors (SELECT * FROM "imdb_ijs"."movies_directors")

imdb.imdb_ijs.movies_genres (SELECT * FROM "imdb_ijs"."movies_genres")

imdb.imdb_ijs.roles (SELECT * FROM "imdb_ijs"."roles")

Declare value ranges and unique constraints#

The more information Qrlew knows about data, the more accurate the results. Qrlew enables the user to input:

  • the range of some variables,

  • the list of possible values whan appropriate

  • and UNIQUE constraints on some values (all values are distincts)

bounded_dataset = source_dataset.imdb_ijs.movies.rank.with_range(0, 20)
bounded_dataset = bounded_dataset.imdb_ijs.actors.gender.with_possible_values(['M', 'F'])
dataset = (bounded_dataset
  .imdb_ijs.actors.id.with_unique_constraint()
  .imdb_ijs.directors.id.with_unique_constraint()
  .imdb_ijs.movies.id.with_unique_constraint())

Display a graphical representation of the Relation objects#

SQL tables and their transformations have an abstract internal representation in Qrlew: the Relation (inspired by Relations in the relational DB sense).

Transformed Relation can be a complex graph of intermediate relations so it is possible to represent them graphically in the Graphviz format.

Hide code cell source
# @title Example representation
%%svg
<svg width="448pt" height="260pt" viewBox="0.00 0.00 448.00 260.00" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink">
<g id="graph0" class="graph" transform="scale(1 1) rotate(0) translate(4 256)">
<title>graph_m0sq</title>
<polygon fill="transparent" stroke="transparent" points="-4,4 -4,-256 444,-256 444,4 -4,4"></polygon>
<!-- graph_m0sq -->
<g id="node1" class="node">
<title>graph_m0sq</title>
<path fill="#428e92" stroke="#000000" stroke-opacity="0.333333" d="M428,-252C428,-252 12,-252 12,-252 6,-252 0,-246 0,-240 0,-240 0,-214 0,-214 0,-208 6,-202 12,-202 12,-202 428,-202 428,-202 434,-202 440,-208 440,-214 440,-214 440,-240 440,-240 440,-246 434,-252 428,-252"></path>
<text text-anchor="start" x="59" y="-230.2" font-family="Red Hat Display,sans-serif" font-weight="bold" font-size="11.00" fill="#000000" fill-opacity="0.733333">RESULTING_RELATION size  int[min_size max_size] (size range)</text>
<text text-anchor="start" x="14" y="-219.2" font-family="Red Hat Display,sans-serif" font-size="11.00" fill="#000000" fill-opacity="0.733333">field_name = expression_of(input_field)  float[0. 2.] (values are in a range of float)</text>
<text text-anchor="start" x="14" y="-207.2" font-family="Red Hat Display,sans-serif" font-size="11.00" fill="#000000" fill-opacity="0.733333">The rows of this relation are computed based on the input Relation below</text>
</g>
<!-- graph_eu9v -->
<g id="node2" class="node">
<title>graph_eu9v</title>
<path fill="#00363a" stroke="#000000" stroke-opacity="0.333333" d="M377,-173C377,-173 63,-173 63,-173 57,-173 51,-167 51,-161 51,-161 51,-124 51,-124 51,-118 57,-112 63,-112 63,-112 377,-112 377,-112 383,-112 389,-118 389,-124 389,-124 389,-161 389,-161 389,-167 383,-173 377,-173"></path>
<text text-anchor="start" x="108.5" y="-151.2" font-family="Red Hat Display,sans-serif" font-weight="bold" font-size="11.00" fill="#ffffff" fill-opacity="0.733333">REDUCE_RELATION size  int[min_size max_size]</text>
<text text-anchor="start" x="65" y="-140.2" font-family="Red Hat Display,sans-serif" font-size="11.00" fill="#ffffff" fill-opacity="0.733333">input_field = avg(rank)  float[0 20] UNIQUE</text>
<text text-anchor="start" x="162.5" y="-129.2" font-family="Red Hat Display,sans-serif" font-size="11.00" fill="#ffffff" fill-opacity="0.733333">GROUP BY (field_name)</text>
<text text-anchor="start" x="62.5" y="-117.2" font-family="Red Hat Display,sans-serif" font-size="11.00" fill="#ffffff" fill-opacity="0.733333">This Relation will aggregate the values of the relation below.</text>
</g>
<!-- graph_m0sq&#45;&gt;graph_eu9v -->
<g id="edge2" class="edge">
<title>graph_m0sq-&gt;graph_eu9v</title>
<path fill="none" stroke="#2b303a" d="M220,-201.69C220,-195.97 220,-189.72 220,-183.5"></path>
<polygon fill="#2b303a" stroke="#2b303a" points="223.5,-183.18 220,-173.18 216.5,-183.18 223.5,-183.18"></polygon>
</g>
<!-- graph_773q -->
<g id="node3" class="node">
<title>graph_773q</title>
<path fill="#ff1744" stroke="#000000" stroke-opacity="0.333333" d="M340,-83C340,-83 100,-83 100,-83 94,-83 88,-77 88,-71 88,-71 88,-12 88,-12 88,-6 94,0 100,0 100,0 340,0 340,0 346,0 352,-6 352,-12 352,-12 352,-71 352,-71 352,-77 346,-83 340,-83"></path>
<text text-anchor="start" x="102" y="-61.2" font-family="Red Hat Display,sans-serif" font-weight="bold" font-size="11.00" fill="#ffffff" fill-opacity="0.733333">NAME_OF_RELATION size  int{known_size}</text>
<text text-anchor="start" x="109" y="-50.2" font-family="Red Hat Display,sans-serif" font-size="11.00" fill="#ffffff" fill-opacity="0.733333">field_name = name_in_source  data_type</text>
<text text-anchor="start" x="109.5" y="-39.2" font-family="Red Hat Display,sans-serif" font-size="11.00" fill="#ffffff" fill-opacity="0.733333">name = name  str{list, of, possible, values}</text>
<text text-anchor="start" x="159.5" y="-28.2" font-family="Red Hat Display,sans-serif" font-size="11.00" fill="#ffffff" fill-opacity="0.733333">year = year  int</text>
<text text-anchor="start" x="132.5" y="-17.2" font-family="Red Hat Display,sans-serif" font-size="11.00" fill="#ffffff" fill-opacity="0.733333">rank = rank  float[min max]</text>
</g>
<!-- graph_eu9v&#45;&gt;graph_773q -->
<g id="edge1" class="edge">
<title>graph_eu9v-&gt;graph_773q</title>
<path fill="none" stroke="#2b303a" d="M220,-111.8C220,-106.03 220,-99.84 220,-93.61"></path>
<polygon fill="#2b303a" stroke="#2b303a" points="223.5,-93.17 220,-83.17 216.5,-93.17 223.5,-93.17"></polygon>
</g>
</g>
</svg>
../_images/202964d8697543806b8b752c3daede2aa65bb5336a3476502ca2c92fe3fe944b.svg

Tables are the most basic forms of Relation.

from graphviz import Source

for path, relation in source_dataset.relations():
  print_dot(relation.dot())
../_images/5e624de244d2ebc95c38b82a3d9bb9db094b21f659595afb7f0f72d3febaae49.svg../_images/f51daabe6fccf8fa0ed3c5922a40a40dff914ca160816f7e27f6c6c8176cc1b5.svg../_images/16e413cf4d02c5532366e1096e6e89f9ea85a24c41372720cd8588a5625cdf62.svg../_images/9999dc5ae7dbcc882d4e3f1a7e23d6520474116235f6f9667df59ce4007f01a6.svg../_images/dfb6b3aea724192d0db4c31664478c56a5d53069779e719a7b9092ead30ceae1.svg../_images/a0d9fa206eb8ceeaca7b0093310e6104beef6fa47f7751226b23f40d7446cde7.svg../_images/b3f99149179670d3d8e1faba39870b17a18cbad3dba5118594258ffd025b9be7.svg

Qrlew turns queries into composed Relation objects.

query = "SELECT gender, COUNT(role) as number_of_roles FROM imdb_ijs.actors JOIN imdb_ijs.roles ON roles.actor_id=actors.id GROUP BY gender"
relation = dataset.relation(query)
print_dot(relation.dot())
../_images/4a101f9cce0dca2fa9ec9f9b02fdeca7a4e96ff96250eacaf7665a1d10c67667.svg

Those representations can become complex, but they are very usefull to track each step of the computation and how Qrlew infer value ranges step by step.

Define a Privacy Unit#

To protect privacy Qrlew needs to know what is to be protected.

Each row has to be attached to an individual (Privacy Unit or PU) whose privacy will be protected.

Differential Priacy (DP) is all about bounding the influence of each PU on a result and perturbating the results so that the remaining influence is not noticeable.

# Privacy unit definition
privacy_unit = [
    # The column `id` directly defines the privacy unit
    ("actors", [], "id"),
    ("directors", [], "id"),
    # The column `director_id` refers to the column `id` of table `directors`, the `id` of which defines the privacy unit
    ("directors_genres", [("director_id", "directors", "id")], "id"),
    ("movies_directors", [("director_id", "directors", "id")], "id"),
    # The column `actor_id` refers to the column `id` of table `actors`, the `id` of which defines the privacy unit
    ("roles", [("actor_id", "actors", "id")], "id"),
]
# Privacy paramers
budget = {"epsilon": 1.0, "delta": 1e-3}

Run simple DP SQL queries#

Relation objects can be turned into DP equivalent by invoking the rewrite_with_differential_privacy method with the following arguments:

  • A dataset because the definition of the Privacy Unit may require lookups in many tables.

  • A Privacy Unit definition object.

  • An (\(\varepsilon\), \(\delta\)) budget.

# We create a basic aggregation relation
query = "SELECT gender, COUNT(*) as number_of_actors FROM imdb_ijs.actors GROUP BY gender"
relation = dataset.relation(query)
# The relation is turned into a DP equivalent
relation_with_dp_event = relation.rewrite_with_differential_privacy(
    dataset,
    privacy_unit,
    budget,
)
# Rewritten relation
dp_relation = relation_with_dp_event.relation()
dp_query = dp_relation.to_query()
print_sql(dp_query)
# Privacy loss
dpe = relation_with_dp_event.dp_event()
print(dpe)
WITH "field_d4_o" ("field_d4_o") AS (SELECT * FROM (VALUES ('F'), ('M')) AS "field_d4_o" ("field_d4_o")), "map_o7od" ("_PRIVACY_UNIT_", "_PRIVACY_UNIT_WEIGHT_", "id", "first_name", "last_name", "gender") AS (SELECT MD5(CAST("id" AS TEXT)) AS "_PRIVACY_UNIT_", 1 AS "_PRIVACY_UNIT_WEIGHT_", "id" AS "id", "first_name" AS "first_name", "last_name" AS "last_name", "gender" AS "gender" FROM "imdb_ijs"."actors"), "map_q1hj" ("_PRIVACY_UNIT_", "_PRIVACY_UNIT_WEIGHT_", "field_n9r3", "field_d4_o") AS (SELECT "_PRIVACY_UNIT_" AS "_PRIVACY_UNIT_", "_PRIVACY_UNIT_WEIGHT_" AS "_PRIVACY_UNIT_WEIGHT_", 1 AS "field_n9r3", "gender" AS "field_d4_o" FROM "map_o7od"), "join_j0lq" ("field_d4_o", "_PRIVACY_UNIT_", "_PRIVACY_UNIT_WEIGHT_", "field_n9r3", "left__q8p1") AS (SELECT * FROM "field_d4_o" AS "_LEFT_" LEFT JOIN "map_q1hj" AS "_RIGHT_" ON ("_LEFT_"."field_d4_o") = ("_RIGHT_"."field_d4_o")), "map_j37p" ("field_d4_o", "_PRIVACY_UNIT_", "_PRIVACY_UNIT_WEIGHT_", "field_n9r3") AS (SELECT "field_d4_o" AS "field_d4_o", "_PRIVACY_UNIT_" AS "_PRIVACY_UNIT_", "_PRIVACY_UNIT_WEIGHT_" AS "_PRIVACY_UNIT_WEIGHT_", "field_n9r3" AS "field_n9r3" FROM "join_j0lq"), "map_03mx" ("_PRIVACY_UNIT_", "_PRIVACY_UNIT_WEIGHT_", "field_d4_o", "_ONE_field_n9r3") AS (SELECT COALESCE(CAST("_PRIVACY_UNIT_" AS TEXT), '_PRIVACY_UNIT_DEFAULT_') AS "_PRIVACY_UNIT_", COALESCE("_PRIVACY_UNIT_WEIGHT_", 0) AS "_PRIVACY_UNIT_WEIGHT_", "field_d4_o" AS "field_d4_o", CASE WHEN "field_n9r3" IS NULL THEN 0 ELSE 1 END AS "_ONE_field_n9r3" FROM "map_j37p"), "map_a6kd" ("field_1iro", "field_grep", "field_zyse") AS (SELECT "_ONE_field_n9r3" AS "field_1iro", "_PRIVACY_UNIT_" AS "field_grep", "field_d4_o" AS "field_zyse" FROM "map_03mx"), "reduce_65ak" ("_PRIVACY_UNIT_", "field_d4_o", "_NORM__ONE_field_n9r3") AS (SELECT field_grep AS "_PRIVACY_UNIT_", field_zyse AS "field_d4_o", SUM("field_1iro") AS "_NORM__ONE_field_n9r3" FROM "map_a6kd" GROUP BY "field_grep", "field_zyse"), "map_lvtk" ("_PRIVACY_UNIT_", "field_d4_o", "_NORM__ONE_field_n9r3") AS (SELECT "_PRIVACY_UNIT_" AS "_PRIVACY_UNIT_", "field_d4_o" AS "field_d4_o", ABS(("_NORM__ONE_field_n9r3") * ("_NORM__ONE_field_n9r3")) AS "_NORM__ONE_field_n9r3" FROM "reduce_65ak"), "map_u172" ("field_7ra5", "field_grep") AS (SELECT "_NORM__ONE_field_n9r3" AS "field_7ra5", "_PRIVACY_UNIT_" AS "field_grep" FROM "map_lvtk"), "reduce_x325" ("_PRIVACY_UNIT_", "_ONE_field_n9r3") AS (SELECT field_grep AS "_PRIVACY_UNIT_", SUM("field_7ra5") AS "_ONE_field_n9r3" FROM "map_u172" GROUP BY "field_grep"), "map_a9ps" ("_PRIVACY_UNIT_", "_ONE_field_n9r3") AS (SELECT "_PRIVACY_UNIT_" AS "_PRIVACY_UNIT_", (1) / (GREATEST(1, (SQRT("_ONE_field_n9r3")) / (1))) AS "_ONE_field_n9r3" FROM "reduce_x325"), "join_ecw5" ("_PRIVACY_UNIT_", "_PRIVACY_UNIT_WEIGHT_", "field_d4_o", "_ONE_field_n9r3", "_SCALE_FACTOR__PRIVACY_UNIT_", "_SCALE_FACTOR__ONE_field_n9r3") AS (SELECT * FROM "map_03mx" AS "_LEFT_" JOIN "map_a9ps" AS "_RIGHT_" ON ("_LEFT_"."_PRIVACY_UNIT_") = ("_RIGHT_"."_PRIVACY_UNIT_")), "map_gyzu" ("_PRIVACY_UNIT_", "_PRIVACY_UNIT_WEIGHT_", "field_d4_o", "_ONE_field_n9r3", "_SCALE_FACTOR__PRIVACY_UNIT_", "_SCALE_FACTOR__ONE_field_n9r3", "_CLIPPED__ONE_field_n9r3") AS (SELECT "_PRIVACY_UNIT_" AS "_PRIVACY_UNIT_", "_PRIVACY_UNIT_WEIGHT_" AS "_PRIVACY_UNIT_WEIGHT_", "field_d4_o" AS "field_d4_o", "_ONE_field_n9r3" AS "_ONE_field_n9r3", "_SCALE_FACTOR__PRIVACY_UNIT_" AS "_SCALE_FACTOR__PRIVACY_UNIT_", "_SCALE_FACTOR__ONE_field_n9r3" AS "_SCALE_FACTOR__ONE_field_n9r3", ("_ONE_field_n9r3") * ("_SCALE_FACTOR__ONE_field_n9r3") AS "_CLIPPED__ONE_field_n9r3" FROM "join_ecw5"), "map_in4e" ("field_8d51", "field_zyse") AS (SELECT "_CLIPPED__ONE_field_n9r3" AS "field_8d51", "field_d4_o" AS "field_zyse" FROM "map_gyzu"), "reduce__03f" ("field_d4_o", "_COUNT_field_n9r3") AS (SELECT field_zyse AS "field_d4_o", SUM("field_8d51") AS "_COUNT_field_n9r3" FROM "map_in4e" GROUP BY "field_zyse"), "map_51tw" ("field_d4_o", "_COUNT_field_n9r3") AS (SELECT "field_d4_o" AS "field_d4_o", LEAST(817718, GREATEST(0, (COALESCE("_COUNT_field_n9r3", 0)) + ((3.776479532659047) * ((SQRT((-2) * (LN(RANDOM())))) * (COS((6.283185307179586) * (RANDOM()))))))) AS "_COUNT_field_n9r3" FROM "reduce__03f"), "map_zoir" ("field_i4xy", "field_d4_o") AS (SELECT CAST("_COUNT_field_n9r3" AS INTEGER) AS "field_i4xy", "field_d4_o" AS "field_d4_o" FROM "map_51tw"), "map_a33c" ("field_d4_o", "field_i4xy") AS (SELECT "field_d4_o" AS "field_d4_o", "field_i4xy" AS "field_i4xy" FROM "map_zoir"), "map_fshk" ("gender", "number_of_actors") AS (SELECT "field_d4_o" AS "gender", "field_i4xy" AS "number_of_actors" FROM "map_a33c") SELECT * FROM "map_fshk"
DpEvent(Gaussian { noise_multiplier: 3.776479532659047 })
import pandas as pd

true_res = pd.read_sql(query, database.engine())
dp_res = pd.read_sql(dp_query, database.engine())

print_result('Result', true_res.sort_values('gender'))
print_result('DP Result', dp_res.sort_values('gender'))

Result

gender number_of_actors
0 F 304412
1 M 513306

DP Result

gender number_of_actors
1 F 304408
0 M 513310

Run DP SQL queries with non-trivial Privacy Unit multiplicity#

When multiple rows belong to the same privacy unit. Bounding the influence of one privacy unit can be tricky. Qrlew evaluates the worst case multiplicity of rows per privacy unit.

It can also be passed as an argument.

# We create a basic aggregation relation
query = "SELECT gender, COUNT(*) as total_number_of_roles FROM imdb_ijs.actors JOIN imdb_ijs.roles ON roles.actor_id=actors.id GROUP BY gender"
relation = dataset.relation(query)
# The relation is turned into a DP equivalent
relation_with_dp_event = relation.rewrite_with_differential_privacy(
    dataset,
    privacy_unit,
    budget,
)
# Rewritten relation
dp_relation = relation_with_dp_event.relation()
dp_query = dp_relation.to_query()
print_sql(dp_query)
# Privacy loss
dpe = relation_with_dp_event.dp_event()
print(dpe)
WITH "field_8dt4" ("field_8dt4") AS (SELECT * FROM (VALUES ('F'), ('M')) AS "field_8dt4" ("field_8dt4")), "map_o7od" ("_PRIVACY_UNIT_", "_PRIVACY_UNIT_WEIGHT_", "id", "first_name", "last_name", "gender") AS (SELECT MD5(CAST("id" AS TEXT)) AS "_PRIVACY_UNIT_", 1 AS "_PRIVACY_UNIT_WEIGHT_", "id" AS "id", "first_name" AS "first_name", "last_name" AS "last_name", "gender" AS "gender" FROM "imdb_ijs"."actors"), "join_5pxq" ("field_6w80", "field_adyv", "field_mg8l", "field_ey3i", "field_rvp5", "field_kwgs", "field_cenr") AS (SELECT * FROM "imdb_ijs"."actors" AS "_LEFT_" JOIN "imdb_ijs"."roles" AS "_RIGHT_" ON ("_RIGHT_"."actor_id") = ("_LEFT_"."id")), "map_azcr" ("_PRIVACY_UNIT_", "_PRIVACY_UNIT_WEIGHT_", "actor_id", "movie_id", "role") AS (SELECT MD5(CAST("field_6w80" AS TEXT)) AS "_PRIVACY_UNIT_", 1 AS "_PRIVACY_UNIT_WEIGHT_", "field_rvp5" AS "actor_id", "field_kwgs" AS "movie_id", "field_cenr" AS "role" FROM "join_5pxq"), "join_2qan" ("_LEFT_PRIVACY_UNIT_", "_LEFT_PRIVACY_UNIT_WEIGHT_", "field_6w80", "field_adyv", "field_mg8l", "field_ey3i", "_RIGHT_PRIVACY_UNIT_", "_RIGHT_PRIVACY_UNIT_WEIGHT_", "field_rvp5", "field_kwgs", "field_cenr") AS (SELECT * FROM "map_o7od" AS "_LEFT_" JOIN "map_azcr" AS "_RIGHT_" ON (("_LEFT_"."_PRIVACY_UNIT_") = ("_RIGHT_"."_PRIVACY_UNIT_")) AND (("_RIGHT_"."actor_id") = ("_LEFT_"."id"))), "map_pc5d" ("_PRIVACY_UNIT_", "_PRIVACY_UNIT_WEIGHT_", "field_adyv", "field_ey3i", "field_6w80", "field_mg8l", "field_rvp5", "field_kwgs", "field_cenr") AS (SELECT "_LEFT_PRIVACY_UNIT_" AS "_PRIVACY_UNIT_", ("_LEFT_PRIVACY_UNIT_WEIGHT_") * ("_RIGHT_PRIVACY_UNIT_WEIGHT_") AS "_PRIVACY_UNIT_WEIGHT_", "field_adyv" AS "field_adyv", "field_ey3i" AS "field_ey3i", "field_6w80" AS "field_6w80", "field_mg8l" AS "field_mg8l", "field_rvp5" AS "field_rvp5", "field_kwgs" AS "field_kwgs", "field_cenr" AS "field_cenr" FROM "join_2qan"), "map_jyze" ("_PRIVACY_UNIT_", "_PRIVACY_UNIT_WEIGHT_", "field_n9r3", "field_8dt4") AS (SELECT "_PRIVACY_UNIT_" AS "_PRIVACY_UNIT_", "_PRIVACY_UNIT_WEIGHT_" AS "_PRIVACY_UNIT_WEIGHT_", 1 AS "field_n9r3", "field_ey3i" AS "field_8dt4" FROM "map_pc5d"), "join_jvsu" ("field_8dt4", "_PRIVACY_UNIT_", "_PRIVACY_UNIT_WEIGHT_", "field_n9r3", "left__ybda") AS (SELECT * FROM "field_8dt4" AS "_LEFT_" LEFT JOIN "map_jyze" AS "_RIGHT_" ON ("_LEFT_"."field_8dt4") = ("_RIGHT_"."field_8dt4")), "map_qpyp" ("field_8dt4", "_PRIVACY_UNIT_", "_PRIVACY_UNIT_WEIGHT_", "field_n9r3") AS (SELECT "field_8dt4" AS "field_8dt4", "_PRIVACY_UNIT_" AS "_PRIVACY_UNIT_", "_PRIVACY_UNIT_WEIGHT_" AS "_PRIVACY_UNIT_WEIGHT_", "field_n9r3" AS "field_n9r3" FROM "join_jvsu"), "map_pt6d" ("_PRIVACY_UNIT_", "_PRIVACY_UNIT_WEIGHT_", "field_8dt4", "_ONE_field_n9r3") AS (SELECT COALESCE(CAST("_PRIVACY_UNIT_" AS TEXT), '_PRIVACY_UNIT_DEFAULT_') AS "_PRIVACY_UNIT_", COALESCE("_PRIVACY_UNIT_WEIGHT_", 0) AS "_PRIVACY_UNIT_WEIGHT_", "field_8dt4" AS "field_8dt4", CASE WHEN "field_n9r3" IS NULL THEN 0 ELSE 1 END AS "_ONE_field_n9r3" FROM "map_qpyp"), "map_any6" ("field_1iro", "field_grep", "field_kggk") AS (SELECT "_ONE_field_n9r3" AS "field_1iro", "_PRIVACY_UNIT_" AS "field_grep", "field_8dt4" AS "field_kggk" FROM "map_pt6d"), "reduce_x8hs" ("_PRIVACY_UNIT_", "field_8dt4", "_NORM__ONE_field_n9r3") AS (SELECT field_grep AS "_PRIVACY_UNIT_", field_kggk AS "field_8dt4", SUM("field_1iro") AS "_NORM__ONE_field_n9r3" FROM "map_any6" GROUP BY "field_grep", "field_kggk"), "map_1too" ("_PRIVACY_UNIT_", "field_8dt4", "_NORM__ONE_field_n9r3") AS (SELECT "_PRIVACY_UNIT_" AS "_PRIVACY_UNIT_", "field_8dt4" AS "field_8dt4", ABS(("_NORM__ONE_field_n9r3") * ("_NORM__ONE_field_n9r3")) AS "_NORM__ONE_field_n9r3" FROM "reduce_x8hs"), "map_xd2x" ("field_7ra5", "field_grep") AS (SELECT "_NORM__ONE_field_n9r3" AS "field_7ra5", "_PRIVACY_UNIT_" AS "field_grep" FROM "map_1too"), "reduce_vvwi" ("_PRIVACY_UNIT_", "_ONE_field_n9r3") AS (SELECT field_grep AS "_PRIVACY_UNIT_", SUM("field_7ra5") AS "_ONE_field_n9r3" FROM "map_xd2x" GROUP BY "field_grep"), "map_hidt" ("_PRIVACY_UNIT_", "_ONE_field_n9r3") AS (SELECT "_PRIVACY_UNIT_" AS "_PRIVACY_UNIT_", (1) / (GREATEST(1, (SQRT("_ONE_field_n9r3")) / (100))) AS "_ONE_field_n9r3" FROM "reduce_vvwi"), "join_ax3z" ("_PRIVACY_UNIT_", "_PRIVACY_UNIT_WEIGHT_", "field_8dt4", "_ONE_field_n9r3", "_SCALE_FACTOR__PRIVACY_UNIT_", "_SCALE_FACTOR__ONE_field_n9r3") AS (SELECT * FROM "map_pt6d" AS "_LEFT_" JOIN "map_hidt" AS "_RIGHT_" ON ("_LEFT_"."_PRIVACY_UNIT_") = ("_RIGHT_"."_PRIVACY_UNIT_")), "map_cra4" ("_PRIVACY_UNIT_", "_PRIVACY_UNIT_WEIGHT_", "field_8dt4", "_ONE_field_n9r3", "_SCALE_FACTOR__PRIVACY_UNIT_", "_SCALE_FACTOR__ONE_field_n9r3", "_CLIPPED__ONE_field_n9r3") AS (SELECT "_PRIVACY_UNIT_" AS "_PRIVACY_UNIT_", "_PRIVACY_UNIT_WEIGHT_" AS "_PRIVACY_UNIT_WEIGHT_", "field_8dt4" AS "field_8dt4", "_ONE_field_n9r3" AS "_ONE_field_n9r3", "_SCALE_FACTOR__PRIVACY_UNIT_" AS "_SCALE_FACTOR__PRIVACY_UNIT_", "_SCALE_FACTOR__ONE_field_n9r3" AS "_SCALE_FACTOR__ONE_field_n9r3", ("_ONE_field_n9r3") * ("_SCALE_FACTOR__ONE_field_n9r3") AS "_CLIPPED__ONE_field_n9r3" FROM "join_ax3z"), "map_wg5e" ("field_8d51", "field_kggk") AS (SELECT "_CLIPPED__ONE_field_n9r3" AS "field_8d51", "field_8dt4" AS "field_kggk" FROM "map_cra4"), "reduce_kn50" ("field_8dt4", "_COUNT_field_n9r3") AS (SELECT field_kggk AS "field_8dt4", SUM("field_8d51") AS "_COUNT_field_n9r3" FROM "map_wg5e" GROUP BY "field_kggk"), "map_s9if" ("field_8dt4", "_COUNT_field_n9r3") AS (SELECT "field_8dt4" AS "field_8dt4", LEAST(3431966, GREATEST(0, (COALESCE("_COUNT_field_n9r3", 0)) + ((377.6479532659047) * ((SQRT((-2) * (LN(RANDOM())))) * (COS((6.283185307179586) * (RANDOM()))))))) AS "_COUNT_field_n9r3" FROM "reduce_kn50"), "map_687m" ("field_i4xy", "field_8dt4") AS (SELECT CAST("_COUNT_field_n9r3" AS INTEGER) AS "field_i4xy", "field_8dt4" AS "field_8dt4" FROM "map_s9if"), "map_0efo" ("field_8dt4", "field_i4xy") AS (SELECT "field_8dt4" AS "field_8dt4", "field_i4xy" AS "field_i4xy" FROM "map_687m"), "map_4gye" ("gender", "total_number_of_roles") AS (SELECT "field_8dt4" AS "gender", "field_i4xy" AS "total_number_of_roles" FROM "map_0efo") SELECT * FROM "map_4gye"
DpEvent(Gaussian { noise_multiplier: 377.6479532659047 })
print_dot(dp_relation.dot())
../_images/8c3f51e3a768df6232e4b6f66875fdbca0f8e694a2f1d65f36bc590f25317c11.svg
import pandas as pd

true_res = pd.read_sql(query, database.engine())
dp_res = pd.read_sql(dp_query, database.engine())

print_result('Result', true_res.sort_values('gender'))
print_result('DP Result', dp_res.sort_values('gender'))

Result

gender total_number_of_roles
0 F 1135174
1 M 2296792

DP Result

gender total_number_of_roles
1 F 1120108
0 M 2186684

In this example Qrlew made assumptions on the number of roles per actor to clip the influence of one actor. This is tunable. We can for example set a max_multiplicity parameter of 1.

# The relation is turned into a DP equivalent
relation_with_dp_event = relation.rewrite_with_differential_privacy(
    dataset,
    privacy_unit,
    budget,
    max_multiplicity=1,
)
# Rewritten relation
dp_relation = relation_with_dp_event.relation()
dp_query = dp_relation.to_query()
dp_res = pd.read_sql(dp_query, database.engine())

print_result('Result', true_res.sort_values('gender'))
print_result('DP Result', dp_res.sort_values('gender'))

Result

gender total_number_of_roles
0 F 1135174
1 M 2296792

DP Result

gender total_number_of_roles
1 F 304411
0 M 513306

The result will have litle noise, but we see it is very biased. We actually count the number of actor instead of the number of role.

We can now set a max_multiplicity parameter of 100000 and a max_multiplicity_share of 100% of estimated size of the dataset as qrlew will use the smallest of both.

# The relation is turned into a DP equivalent
relation_with_dp_event = relation.rewrite_with_differential_privacy(
    dataset,
    privacy_unit,
    budget,
    max_multiplicity=100000,
    max_multiplicity_share=1.,
)
# Rewritten relation
dp_relation = relation_with_dp_event.relation()
dp_query = dp_relation.to_query()
dp_res = pd.read_sql(dp_query, database.engine())

print_result('Result', true_res.sort_values('gender'))
print_result('DP Result', dp_res.sort_values('gender'))

Result

gender total_number_of_roles
0 F 1135174
1 M 2296792

DP Result

gender total_number_of_roles
1 F 1251391
0 M 1988560

Now the value will be unbiased, but very noisy.

Taking a value of 1000 yields good results here.

The defaults is max_multiplicity=100.0 and max_multiplicity_share=0.1.

# The relation is turned into a DP equivalent
relation_with_dp_event = relation.rewrite_with_differential_privacy(
    dataset,
    privacy_unit,
    budget,
    max_multiplicity=1000,
    max_multiplicity_share=1.,
)
# Rewritten relation
dp_relation = relation_with_dp_event.relation()
dp_query = dp_relation.to_query()
dp_res = pd.read_sql(dp_query, database.engine())

print_result('Result', true_res.sort_values('gender'))
print_result('DP Result', dp_res.sort_values('gender'))

Result

gender total_number_of_roles
0 F 1135174
1 M 2296792

DP Result

gender total_number_of_roles
1 F 1140708
0 M 2299726

Qrlew supports several SQL dialects: PostgreSQL (default), MSSQL or BigQuery.#

from pyqrlew import Dialect

# We create a basic relation
query = "SELECT LN(1+rank) as log_rank FROM imdb_ijs.movies GROUP BY gender"
relation = dataset.relation(query)
print_md('### Relation rendered as PostgreSQL')
query = relation.to_query()
print_sql(query)

Relation rendered as PostgreSQL

WITH "reduce_9y_x" ("field_07kk") AS (SELECT rank AS "field_07kk" FROM "imdb_ijs"."movies" GROUP BY "gender"), "map_ucb9" ("log_rank") AS (SELECT LN((1) + ("field_07kk")) AS "log_rank" FROM "reduce_9y_x") SELECT * FROM "map_ucb9"
print_md('### Relation rendered as BigQuery')
query = relation.to_query(dialect=Dialect.BigQuery)
print_sql(query)

Relation rendered as BigQuery

WITH `reduce_9y_x` AS (SELECT rank AS `field_07kk` FROM `imdb_ijs`.`movies` GROUP BY `gender`), `map_ucb9` AS (SELECT LN((1) + (`field_07kk`)) AS `log_rank` FROM `reduce_9y_x`) SELECT * FROM `map_ucb9`
print_md('### Relation rendered as MsSql')
query = relation.to_query(dialect=Dialect.MsSql)
print_sql(query)

Relation rendered as MsSql

WITH "reduce_9y_x" ("field_07kk") AS (SELECT rank AS "field_07kk" FROM "imdb_ijs"."movies" GROUP BY "gender"), "map_ucb9" ("log_rank") AS (SELECT LOG((1) + ("field_07kk")) AS "log_rank" FROM "reduce_9y_x") SELECT * FROM "map_ucb9"

Note how the quoting varies and how LN is converted into LOG for MSSQL.

Not only Qrlew can render relation ito many dialects it can also parse different dialects.

# We create a basic relation
query = """SELECT LN(1+rank) as "log rank" FROM imdb_ijs.movies GROUP BY gender"""
relation = dataset.relation(query, dialect=Dialect.PostgreSql)
print_dot(relation.dot())
../_images/42931158767dcf143b4d3ea3e20a892a9d7fa239e591768d2cf769f11713669d.svg
# We create a basic relation in BigQuery dialect
query = """SELECT LN(1+rank) as `log rank` FROM imdb_ijs.movies GROUP BY gender"""
relation = dataset.relation(query, dialect=Dialect.BigQuery)
print_dot(relation.dot())
../_images/42931158767dcf143b4d3ea3e20a892a9d7fa239e591768d2cf769f11713669d.svg
# We create a basic relation in MSSQL dialect
query = """SELECT LOG(1+rank) as "log rank" FROM imdb_ijs.movies GROUP BY gender"""
relation = dataset.relation(query, dialect=Dialect.MsSql)
print_dot(relation.dot())
../_images/42931158767dcf143b4d3ea3e20a892a9d7fa239e591768d2cf769f11713669d.svg

They all return the same abstract Relation object