User Guide#
Qrlew is an open source library that can turn SQL queries into their differentially private (DP) equivalent.
It takes SQL – the universal language of small and big data analytics – as input, so there is no new language or API to learn.
It returns DP SQL queries that can be executed at scale on any SQL datastore.
The only required extra inputs are the privacy unit specification and privacy budget \((\varepsilon, \delta)\).
With Qrlew a data owner can receive SQL queries from an untrusted data analyst, turn them into DP equivalent, execute them and safely return the result to the data analyst.
This guide will cover several useful notions from Qrlew:
We will define a
Dataset
from a database connection.We will declare value ranges and unique constraints
We will define a privacy unit.
And we will run simple DP SQL queries.
We will also explore more advanced features:
The support of several SQL dialects: PostgreSQL (default), MSSQL or BigQuery.
Privacy accounting compatibility with Google’s popular Differential Privacy accounting: dp_accounting.
The use of Synthetic Data to further improve the quality of the results.
Define a Dataset#
Let’s define a Dataset
object from a SQL connection.
We will first run a local postgresql database. You can connect to any Postgresql, MSSQL or BigQuery database.
%%capture
!sudo apt-get -y -qq update
!sudo apt-get -y -qq install postgresql
!sudo service postgresql start
!sudo -u postgres psql -U postgres -c "ALTER USER postgres PASSWORD 'password'"
!pip install psycopg2
!pip install pyqrlew
# A few utility functions
from IPython.display import Markdown
def print_sql(query):
display(Markdown(f"""```sql
{query}
```"""))
def print_result(title, result):
display(Markdown(f'### {title}'))
display(result)
def print_dot(dot_graph):
display(Source(dot_graph))
def print_md(md):
display(Markdown(md))
Then we load a demo dataset.
from pyqrlew.io import PostgreSQL
database = PostgreSQL('postgres', 'postgres', 'password', 5432)
database.load_imdb()
engine = database.engine()
We can then build a Dataset
object that can be used to parse queries into Relation
objects that can be transformed into differentially private Relation
s and rendered into SQL queries.
from pyqrlew import Dataset
source_dataset = Dataset.from_database('imdb', engine, 'imdb_ijs')
for path, relation in source_dataset.relations():
print_md(f"**{'.'.join(path)}** (`{relation}`)")
imdb.imdb_ijs.actors (SELECT * FROM "imdb_ijs"."actors"
)
imdb.imdb_ijs.directors (SELECT * FROM "imdb_ijs"."directors"
)
imdb.imdb_ijs.directors_genres (SELECT * FROM "imdb_ijs"."directors_genres"
)
imdb.imdb_ijs.movies (SELECT * FROM "imdb_ijs"."movies"
)
imdb.imdb_ijs.movies_directors (SELECT * FROM "imdb_ijs"."movies_directors"
)
imdb.imdb_ijs.movies_genres (SELECT * FROM "imdb_ijs"."movies_genres"
)
imdb.imdb_ijs.roles (SELECT * FROM "imdb_ijs"."roles"
)
Declare value ranges and unique constraints#
The more information Qrlew knows about data, the more accurate the results. Qrlew enables the user to input:
the range of some variables,
the list of possible values whan appropriate
and
UNIQUE
constraints on some values (all values are distincts)
bounded_dataset = source_dataset.imdb_ijs.movies.rank.with_range(0, 20)
bounded_dataset = bounded_dataset.imdb_ijs.actors.gender.with_possible_values(['M', 'F'])
dataset = (bounded_dataset
.imdb_ijs.actors.id.with_unique_constraint()
.imdb_ijs.directors.id.with_unique_constraint()
.imdb_ijs.movies.id.with_unique_constraint())
Display a graphical representation of the Relation
objects#
SQL tables and their transformations have an abstract internal representation in Qrlew: the Relation
(inspired by Relations in the relational DB sense).
Transformed Relation
can be a complex graph of intermediate relations so it is possible to represent them graphically in the Graphviz format.
Show code cell source
# @title Example representation
%%svg
<svg width="448pt" height="260pt" viewBox="0.00 0.00 448.00 260.00" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink">
<g id="graph0" class="graph" transform="scale(1 1) rotate(0) translate(4 256)">
<title>graph_m0sq</title>
<polygon fill="transparent" stroke="transparent" points="-4,4 -4,-256 444,-256 444,4 -4,4"></polygon>
<!-- graph_m0sq -->
<g id="node1" class="node">
<title>graph_m0sq</title>
<path fill="#428e92" stroke="#000000" stroke-opacity="0.333333" d="M428,-252C428,-252 12,-252 12,-252 6,-252 0,-246 0,-240 0,-240 0,-214 0,-214 0,-208 6,-202 12,-202 12,-202 428,-202 428,-202 434,-202 440,-208 440,-214 440,-214 440,-240 440,-240 440,-246 434,-252 428,-252"></path>
<text text-anchor="start" x="59" y="-230.2" font-family="Red Hat Display,sans-serif" font-weight="bold" font-size="11.00" fill="#000000" fill-opacity="0.733333">RESULTING_RELATION size ∈ int[min_size max_size] (size range)</text>
<text text-anchor="start" x="14" y="-219.2" font-family="Red Hat Display,sans-serif" font-size="11.00" fill="#000000" fill-opacity="0.733333">field_name = expression_of(input_field) ∈ float[0. 2.] (values are in a range of float)</text>
<text text-anchor="start" x="14" y="-207.2" font-family="Red Hat Display,sans-serif" font-size="11.00" fill="#000000" fill-opacity="0.733333">The rows of this relation are computed based on the input Relation below</text>
</g>
<!-- graph_eu9v -->
<g id="node2" class="node">
<title>graph_eu9v</title>
<path fill="#00363a" stroke="#000000" stroke-opacity="0.333333" d="M377,-173C377,-173 63,-173 63,-173 57,-173 51,-167 51,-161 51,-161 51,-124 51,-124 51,-118 57,-112 63,-112 63,-112 377,-112 377,-112 383,-112 389,-118 389,-124 389,-124 389,-161 389,-161 389,-167 383,-173 377,-173"></path>
<text text-anchor="start" x="108.5" y="-151.2" font-family="Red Hat Display,sans-serif" font-weight="bold" font-size="11.00" fill="#ffffff" fill-opacity="0.733333">REDUCE_RELATION size ∈ int[min_size max_size]</text>
<text text-anchor="start" x="65" y="-140.2" font-family="Red Hat Display,sans-serif" font-size="11.00" fill="#ffffff" fill-opacity="0.733333">input_field = avg(rank) ∈ float[0 20] UNIQUE</text>
<text text-anchor="start" x="162.5" y="-129.2" font-family="Red Hat Display,sans-serif" font-size="11.00" fill="#ffffff" fill-opacity="0.733333">GROUP BY (field_name)</text>
<text text-anchor="start" x="62.5" y="-117.2" font-family="Red Hat Display,sans-serif" font-size="11.00" fill="#ffffff" fill-opacity="0.733333">This Relation will aggregate the values of the relation below.</text>
</g>
<!-- graph_m0sq->graph_eu9v -->
<g id="edge2" class="edge">
<title>graph_m0sq->graph_eu9v</title>
<path fill="none" stroke="#2b303a" d="M220,-201.69C220,-195.97 220,-189.72 220,-183.5"></path>
<polygon fill="#2b303a" stroke="#2b303a" points="223.5,-183.18 220,-173.18 216.5,-183.18 223.5,-183.18"></polygon>
</g>
<!-- graph_773q -->
<g id="node3" class="node">
<title>graph_773q</title>
<path fill="#ff1744" stroke="#000000" stroke-opacity="0.333333" d="M340,-83C340,-83 100,-83 100,-83 94,-83 88,-77 88,-71 88,-71 88,-12 88,-12 88,-6 94,0 100,0 100,0 340,0 340,0 346,0 352,-6 352,-12 352,-12 352,-71 352,-71 352,-77 346,-83 340,-83"></path>
<text text-anchor="start" x="102" y="-61.2" font-family="Red Hat Display,sans-serif" font-weight="bold" font-size="11.00" fill="#ffffff" fill-opacity="0.733333">NAME_OF_RELATION size ∈ int{known_size}</text>
<text text-anchor="start" x="109" y="-50.2" font-family="Red Hat Display,sans-serif" font-size="11.00" fill="#ffffff" fill-opacity="0.733333">field_name = name_in_source ∈ data_type</text>
<text text-anchor="start" x="109.5" y="-39.2" font-family="Red Hat Display,sans-serif" font-size="11.00" fill="#ffffff" fill-opacity="0.733333">name = name ∈ str{list, of, possible, values}</text>
<text text-anchor="start" x="159.5" y="-28.2" font-family="Red Hat Display,sans-serif" font-size="11.00" fill="#ffffff" fill-opacity="0.733333">year = year ∈ int</text>
<text text-anchor="start" x="132.5" y="-17.2" font-family="Red Hat Display,sans-serif" font-size="11.00" fill="#ffffff" fill-opacity="0.733333">rank = rank ∈ float[min max]</text>
</g>
<!-- graph_eu9v->graph_773q -->
<g id="edge1" class="edge">
<title>graph_eu9v->graph_773q</title>
<path fill="none" stroke="#2b303a" d="M220,-111.8C220,-106.03 220,-99.84 220,-93.61"></path>
<polygon fill="#2b303a" stroke="#2b303a" points="223.5,-93.17 220,-83.17 216.5,-93.17 223.5,-93.17"></polygon>
</g>
</g>
</svg>
Tables are the most basic forms of Relation
.
from graphviz import Source
for path, relation in source_dataset.relations():
print_dot(relation.dot())
Qrlew turns queries into composed Relation
objects.
query = "SELECT gender, COUNT(role) as number_of_roles FROM imdb_ijs.actors JOIN imdb_ijs.roles ON roles.actor_id=actors.id GROUP BY gender"
relation = dataset.relation(query)
print_dot(relation.dot())
Those representations can become complex, but they are very usefull to track each step of the computation and how Qrlew infer value ranges step by step.
Define a Privacy Unit#
To protect privacy Qrlew needs to know what is to be protected.
Each row has to be attached to an individual (Privacy Unit or PU) whose privacy will be protected.
Differential Priacy (DP) is all about bounding the influence of each PU on a result and perturbating the results so that the remaining influence is not noticeable.
# Privacy unit definition
privacy_unit = [
# The column `id` directly defines the privacy unit
("actors", [], "id"),
("directors", [], "id"),
# The column `director_id` refers to the column `id` of table `directors`, the `id` of which defines the privacy unit
("directors_genres", [("director_id", "directors", "id")], "id"),
("movies_directors", [("director_id", "directors", "id")], "id"),
# The column `actor_id` refers to the column `id` of table `actors`, the `id` of which defines the privacy unit
("roles", [("actor_id", "actors", "id")], "id"),
]
# Privacy paramers
budget = {"epsilon": 1.0, "delta": 1e-3}
Run simple DP SQL queries#
Relation
objects can be turned into DP equivalent by invoking the rewrite_with_differential_privacy
method with the following arguments:
A
dataset
because the definition of the Privacy Unit may require lookups in many tables.A Privacy Unit definition object.
An (\(\varepsilon\), \(\delta\)) budget.
# We create a basic aggregation relation
query = "SELECT gender, COUNT(*) as number_of_actors FROM imdb_ijs.actors GROUP BY gender"
relation = dataset.relation(query)
# The relation is turned into a DP equivalent
relation_with_dp_event = relation.rewrite_with_differential_privacy(
dataset,
privacy_unit,
budget,
)
# Rewritten relation
dp_relation = relation_with_dp_event.relation()
dp_query = dp_relation.to_query()
print_sql(dp_query)
# Privacy loss
dpe = relation_with_dp_event.dp_event()
print(dpe)
WITH "field_d4_o" ("field_d4_o") AS (SELECT * FROM (VALUES ('F'), ('M')) AS "field_d4_o" ("field_d4_o")), "map_o7od" ("_PRIVACY_UNIT_", "_PRIVACY_UNIT_WEIGHT_", "id", "first_name", "last_name", "gender") AS (SELECT MD5(CAST("id" AS TEXT)) AS "_PRIVACY_UNIT_", 1 AS "_PRIVACY_UNIT_WEIGHT_", "id" AS "id", "first_name" AS "first_name", "last_name" AS "last_name", "gender" AS "gender" FROM "imdb_ijs"."actors"), "map_q1hj" ("_PRIVACY_UNIT_", "_PRIVACY_UNIT_WEIGHT_", "field_n9r3", "field_d4_o") AS (SELECT "_PRIVACY_UNIT_" AS "_PRIVACY_UNIT_", "_PRIVACY_UNIT_WEIGHT_" AS "_PRIVACY_UNIT_WEIGHT_", 1 AS "field_n9r3", "gender" AS "field_d4_o" FROM "map_o7od"), "join_j0lq" ("field_d4_o", "_PRIVACY_UNIT_", "_PRIVACY_UNIT_WEIGHT_", "field_n9r3", "left__q8p1") AS (SELECT * FROM "field_d4_o" AS "_LEFT_" LEFT JOIN "map_q1hj" AS "_RIGHT_" ON ("_LEFT_"."field_d4_o") = ("_RIGHT_"."field_d4_o")), "map_j37p" ("field_d4_o", "_PRIVACY_UNIT_", "_PRIVACY_UNIT_WEIGHT_", "field_n9r3") AS (SELECT "field_d4_o" AS "field_d4_o", "_PRIVACY_UNIT_" AS "_PRIVACY_UNIT_", "_PRIVACY_UNIT_WEIGHT_" AS "_PRIVACY_UNIT_WEIGHT_", "field_n9r3" AS "field_n9r3" FROM "join_j0lq"), "map_03mx" ("_PRIVACY_UNIT_", "_PRIVACY_UNIT_WEIGHT_", "field_d4_o", "_ONE_field_n9r3") AS (SELECT COALESCE(CAST("_PRIVACY_UNIT_" AS TEXT), '_PRIVACY_UNIT_DEFAULT_') AS "_PRIVACY_UNIT_", COALESCE("_PRIVACY_UNIT_WEIGHT_", 0) AS "_PRIVACY_UNIT_WEIGHT_", "field_d4_o" AS "field_d4_o", CASE WHEN "field_n9r3" IS NULL THEN 0 ELSE 1 END AS "_ONE_field_n9r3" FROM "map_j37p"), "map_a6kd" ("field_1iro", "field_grep", "field_zyse") AS (SELECT "_ONE_field_n9r3" AS "field_1iro", "_PRIVACY_UNIT_" AS "field_grep", "field_d4_o" AS "field_zyse" FROM "map_03mx"), "reduce_65ak" ("_PRIVACY_UNIT_", "field_d4_o", "_NORM__ONE_field_n9r3") AS (SELECT field_grep AS "_PRIVACY_UNIT_", field_zyse AS "field_d4_o", SUM("field_1iro") AS "_NORM__ONE_field_n9r3" FROM "map_a6kd" GROUP BY "field_grep", "field_zyse"), "map_lvtk" ("_PRIVACY_UNIT_", "field_d4_o", "_NORM__ONE_field_n9r3") AS (SELECT "_PRIVACY_UNIT_" AS "_PRIVACY_UNIT_", "field_d4_o" AS "field_d4_o", ABS(("_NORM__ONE_field_n9r3") * ("_NORM__ONE_field_n9r3")) AS "_NORM__ONE_field_n9r3" FROM "reduce_65ak"), "map_u172" ("field_7ra5", "field_grep") AS (SELECT "_NORM__ONE_field_n9r3" AS "field_7ra5", "_PRIVACY_UNIT_" AS "field_grep" FROM "map_lvtk"), "reduce_x325" ("_PRIVACY_UNIT_", "_ONE_field_n9r3") AS (SELECT field_grep AS "_PRIVACY_UNIT_", SUM("field_7ra5") AS "_ONE_field_n9r3" FROM "map_u172" GROUP BY "field_grep"), "map_a9ps" ("_PRIVACY_UNIT_", "_ONE_field_n9r3") AS (SELECT "_PRIVACY_UNIT_" AS "_PRIVACY_UNIT_", (1) / (GREATEST(1, (SQRT("_ONE_field_n9r3")) / (1))) AS "_ONE_field_n9r3" FROM "reduce_x325"), "join_ecw5" ("_PRIVACY_UNIT_", "_PRIVACY_UNIT_WEIGHT_", "field_d4_o", "_ONE_field_n9r3", "_SCALE_FACTOR__PRIVACY_UNIT_", "_SCALE_FACTOR__ONE_field_n9r3") AS (SELECT * FROM "map_03mx" AS "_LEFT_" JOIN "map_a9ps" AS "_RIGHT_" ON ("_LEFT_"."_PRIVACY_UNIT_") = ("_RIGHT_"."_PRIVACY_UNIT_")), "map_gyzu" ("_PRIVACY_UNIT_", "_PRIVACY_UNIT_WEIGHT_", "field_d4_o", "_ONE_field_n9r3", "_SCALE_FACTOR__PRIVACY_UNIT_", "_SCALE_FACTOR__ONE_field_n9r3", "_CLIPPED__ONE_field_n9r3") AS (SELECT "_PRIVACY_UNIT_" AS "_PRIVACY_UNIT_", "_PRIVACY_UNIT_WEIGHT_" AS "_PRIVACY_UNIT_WEIGHT_", "field_d4_o" AS "field_d4_o", "_ONE_field_n9r3" AS "_ONE_field_n9r3", "_SCALE_FACTOR__PRIVACY_UNIT_" AS "_SCALE_FACTOR__PRIVACY_UNIT_", "_SCALE_FACTOR__ONE_field_n9r3" AS "_SCALE_FACTOR__ONE_field_n9r3", ("_ONE_field_n9r3") * ("_SCALE_FACTOR__ONE_field_n9r3") AS "_CLIPPED__ONE_field_n9r3" FROM "join_ecw5"), "map_in4e" ("field_8d51", "field_zyse") AS (SELECT "_CLIPPED__ONE_field_n9r3" AS "field_8d51", "field_d4_o" AS "field_zyse" FROM "map_gyzu"), "reduce__03f" ("field_d4_o", "_COUNT_field_n9r3") AS (SELECT field_zyse AS "field_d4_o", SUM("field_8d51") AS "_COUNT_field_n9r3" FROM "map_in4e" GROUP BY "field_zyse"), "map_51tw" ("field_d4_o", "_COUNT_field_n9r3") AS (SELECT "field_d4_o" AS "field_d4_o", LEAST(817718, GREATEST(0, (COALESCE("_COUNT_field_n9r3", 0)) + ((3.776479532659047) * ((SQRT((-2) * (LN(RANDOM())))) * (COS((6.283185307179586) * (RANDOM()))))))) AS "_COUNT_field_n9r3" FROM "reduce__03f"), "map_zoir" ("field_i4xy", "field_d4_o") AS (SELECT CAST("_COUNT_field_n9r3" AS INTEGER) AS "field_i4xy", "field_d4_o" AS "field_d4_o" FROM "map_51tw"), "map_a33c" ("field_d4_o", "field_i4xy") AS (SELECT "field_d4_o" AS "field_d4_o", "field_i4xy" AS "field_i4xy" FROM "map_zoir"), "map_fshk" ("gender", "number_of_actors") AS (SELECT "field_d4_o" AS "gender", "field_i4xy" AS "number_of_actors" FROM "map_a33c") SELECT * FROM "map_fshk"
DpEvent(Gaussian { noise_multiplier: 3.776479532659047 })
import pandas as pd
true_res = pd.read_sql(query, database.engine())
dp_res = pd.read_sql(dp_query, database.engine())
print_result('Result', true_res.sort_values('gender'))
print_result('DP Result', dp_res.sort_values('gender'))
Result
gender | number_of_actors | |
---|---|---|
0 | F | 304412 |
1 | M | 513306 |
DP Result
gender | number_of_actors | |
---|---|---|
1 | F | 304408 |
0 | M | 513310 |
Run DP SQL queries with non-trivial Privacy Unit multiplicity#
When multiple rows belong to the same privacy unit. Bounding the influence of one privacy unit can be tricky. Qrlew evaluates the worst case multiplicity of rows per privacy unit.
It can also be passed as an argument.
# We create a basic aggregation relation
query = "SELECT gender, COUNT(*) as total_number_of_roles FROM imdb_ijs.actors JOIN imdb_ijs.roles ON roles.actor_id=actors.id GROUP BY gender"
relation = dataset.relation(query)
# The relation is turned into a DP equivalent
relation_with_dp_event = relation.rewrite_with_differential_privacy(
dataset,
privacy_unit,
budget,
)
# Rewritten relation
dp_relation = relation_with_dp_event.relation()
dp_query = dp_relation.to_query()
print_sql(dp_query)
# Privacy loss
dpe = relation_with_dp_event.dp_event()
print(dpe)
WITH "field_8dt4" ("field_8dt4") AS (SELECT * FROM (VALUES ('F'), ('M')) AS "field_8dt4" ("field_8dt4")), "map_o7od" ("_PRIVACY_UNIT_", "_PRIVACY_UNIT_WEIGHT_", "id", "first_name", "last_name", "gender") AS (SELECT MD5(CAST("id" AS TEXT)) AS "_PRIVACY_UNIT_", 1 AS "_PRIVACY_UNIT_WEIGHT_", "id" AS "id", "first_name" AS "first_name", "last_name" AS "last_name", "gender" AS "gender" FROM "imdb_ijs"."actors"), "join_5pxq" ("field_6w80", "field_adyv", "field_mg8l", "field_ey3i", "field_rvp5", "field_kwgs", "field_cenr") AS (SELECT * FROM "imdb_ijs"."actors" AS "_LEFT_" JOIN "imdb_ijs"."roles" AS "_RIGHT_" ON ("_RIGHT_"."actor_id") = ("_LEFT_"."id")), "map_azcr" ("_PRIVACY_UNIT_", "_PRIVACY_UNIT_WEIGHT_", "actor_id", "movie_id", "role") AS (SELECT MD5(CAST("field_6w80" AS TEXT)) AS "_PRIVACY_UNIT_", 1 AS "_PRIVACY_UNIT_WEIGHT_", "field_rvp5" AS "actor_id", "field_kwgs" AS "movie_id", "field_cenr" AS "role" FROM "join_5pxq"), "join_2qan" ("_LEFT_PRIVACY_UNIT_", "_LEFT_PRIVACY_UNIT_WEIGHT_", "field_6w80", "field_adyv", "field_mg8l", "field_ey3i", "_RIGHT_PRIVACY_UNIT_", "_RIGHT_PRIVACY_UNIT_WEIGHT_", "field_rvp5", "field_kwgs", "field_cenr") AS (SELECT * FROM "map_o7od" AS "_LEFT_" JOIN "map_azcr" AS "_RIGHT_" ON (("_LEFT_"."_PRIVACY_UNIT_") = ("_RIGHT_"."_PRIVACY_UNIT_")) AND (("_RIGHT_"."actor_id") = ("_LEFT_"."id"))), "map_pc5d" ("_PRIVACY_UNIT_", "_PRIVACY_UNIT_WEIGHT_", "field_adyv", "field_ey3i", "field_6w80", "field_mg8l", "field_rvp5", "field_kwgs", "field_cenr") AS (SELECT "_LEFT_PRIVACY_UNIT_" AS "_PRIVACY_UNIT_", ("_LEFT_PRIVACY_UNIT_WEIGHT_") * ("_RIGHT_PRIVACY_UNIT_WEIGHT_") AS "_PRIVACY_UNIT_WEIGHT_", "field_adyv" AS "field_adyv", "field_ey3i" AS "field_ey3i", "field_6w80" AS "field_6w80", "field_mg8l" AS "field_mg8l", "field_rvp5" AS "field_rvp5", "field_kwgs" AS "field_kwgs", "field_cenr" AS "field_cenr" FROM "join_2qan"), "map_jyze" ("_PRIVACY_UNIT_", "_PRIVACY_UNIT_WEIGHT_", "field_n9r3", "field_8dt4") AS (SELECT "_PRIVACY_UNIT_" AS "_PRIVACY_UNIT_", "_PRIVACY_UNIT_WEIGHT_" AS "_PRIVACY_UNIT_WEIGHT_", 1 AS "field_n9r3", "field_ey3i" AS "field_8dt4" FROM "map_pc5d"), "join_jvsu" ("field_8dt4", "_PRIVACY_UNIT_", "_PRIVACY_UNIT_WEIGHT_", "field_n9r3", "left__ybda") AS (SELECT * FROM "field_8dt4" AS "_LEFT_" LEFT JOIN "map_jyze" AS "_RIGHT_" ON ("_LEFT_"."field_8dt4") = ("_RIGHT_"."field_8dt4")), "map_qpyp" ("field_8dt4", "_PRIVACY_UNIT_", "_PRIVACY_UNIT_WEIGHT_", "field_n9r3") AS (SELECT "field_8dt4" AS "field_8dt4", "_PRIVACY_UNIT_" AS "_PRIVACY_UNIT_", "_PRIVACY_UNIT_WEIGHT_" AS "_PRIVACY_UNIT_WEIGHT_", "field_n9r3" AS "field_n9r3" FROM "join_jvsu"), "map_pt6d" ("_PRIVACY_UNIT_", "_PRIVACY_UNIT_WEIGHT_", "field_8dt4", "_ONE_field_n9r3") AS (SELECT COALESCE(CAST("_PRIVACY_UNIT_" AS TEXT), '_PRIVACY_UNIT_DEFAULT_') AS "_PRIVACY_UNIT_", COALESCE("_PRIVACY_UNIT_WEIGHT_", 0) AS "_PRIVACY_UNIT_WEIGHT_", "field_8dt4" AS "field_8dt4", CASE WHEN "field_n9r3" IS NULL THEN 0 ELSE 1 END AS "_ONE_field_n9r3" FROM "map_qpyp"), "map_any6" ("field_1iro", "field_grep", "field_kggk") AS (SELECT "_ONE_field_n9r3" AS "field_1iro", "_PRIVACY_UNIT_" AS "field_grep", "field_8dt4" AS "field_kggk" FROM "map_pt6d"), "reduce_x8hs" ("_PRIVACY_UNIT_", "field_8dt4", "_NORM__ONE_field_n9r3") AS (SELECT field_grep AS "_PRIVACY_UNIT_", field_kggk AS "field_8dt4", SUM("field_1iro") AS "_NORM__ONE_field_n9r3" FROM "map_any6" GROUP BY "field_grep", "field_kggk"), "map_1too" ("_PRIVACY_UNIT_", "field_8dt4", "_NORM__ONE_field_n9r3") AS (SELECT "_PRIVACY_UNIT_" AS "_PRIVACY_UNIT_", "field_8dt4" AS "field_8dt4", ABS(("_NORM__ONE_field_n9r3") * ("_NORM__ONE_field_n9r3")) AS "_NORM__ONE_field_n9r3" FROM "reduce_x8hs"), "map_xd2x" ("field_7ra5", "field_grep") AS (SELECT "_NORM__ONE_field_n9r3" AS "field_7ra5", "_PRIVACY_UNIT_" AS "field_grep" FROM "map_1too"), "reduce_vvwi" ("_PRIVACY_UNIT_", "_ONE_field_n9r3") AS (SELECT field_grep AS "_PRIVACY_UNIT_", SUM("field_7ra5") AS "_ONE_field_n9r3" FROM "map_xd2x" GROUP BY "field_grep"), "map_hidt" ("_PRIVACY_UNIT_", "_ONE_field_n9r3") AS (SELECT "_PRIVACY_UNIT_" AS "_PRIVACY_UNIT_", (1) / (GREATEST(1, (SQRT("_ONE_field_n9r3")) / (100))) AS "_ONE_field_n9r3" FROM "reduce_vvwi"), "join_ax3z" ("_PRIVACY_UNIT_", "_PRIVACY_UNIT_WEIGHT_", "field_8dt4", "_ONE_field_n9r3", "_SCALE_FACTOR__PRIVACY_UNIT_", "_SCALE_FACTOR__ONE_field_n9r3") AS (SELECT * FROM "map_pt6d" AS "_LEFT_" JOIN "map_hidt" AS "_RIGHT_" ON ("_LEFT_"."_PRIVACY_UNIT_") = ("_RIGHT_"."_PRIVACY_UNIT_")), "map_cra4" ("_PRIVACY_UNIT_", "_PRIVACY_UNIT_WEIGHT_", "field_8dt4", "_ONE_field_n9r3", "_SCALE_FACTOR__PRIVACY_UNIT_", "_SCALE_FACTOR__ONE_field_n9r3", "_CLIPPED__ONE_field_n9r3") AS (SELECT "_PRIVACY_UNIT_" AS "_PRIVACY_UNIT_", "_PRIVACY_UNIT_WEIGHT_" AS "_PRIVACY_UNIT_WEIGHT_", "field_8dt4" AS "field_8dt4", "_ONE_field_n9r3" AS "_ONE_field_n9r3", "_SCALE_FACTOR__PRIVACY_UNIT_" AS "_SCALE_FACTOR__PRIVACY_UNIT_", "_SCALE_FACTOR__ONE_field_n9r3" AS "_SCALE_FACTOR__ONE_field_n9r3", ("_ONE_field_n9r3") * ("_SCALE_FACTOR__ONE_field_n9r3") AS "_CLIPPED__ONE_field_n9r3" FROM "join_ax3z"), "map_wg5e" ("field_8d51", "field_kggk") AS (SELECT "_CLIPPED__ONE_field_n9r3" AS "field_8d51", "field_8dt4" AS "field_kggk" FROM "map_cra4"), "reduce_kn50" ("field_8dt4", "_COUNT_field_n9r3") AS (SELECT field_kggk AS "field_8dt4", SUM("field_8d51") AS "_COUNT_field_n9r3" FROM "map_wg5e" GROUP BY "field_kggk"), "map_s9if" ("field_8dt4", "_COUNT_field_n9r3") AS (SELECT "field_8dt4" AS "field_8dt4", LEAST(3431966, GREATEST(0, (COALESCE("_COUNT_field_n9r3", 0)) + ((377.6479532659047) * ((SQRT((-2) * (LN(RANDOM())))) * (COS((6.283185307179586) * (RANDOM()))))))) AS "_COUNT_field_n9r3" FROM "reduce_kn50"), "map_687m" ("field_i4xy", "field_8dt4") AS (SELECT CAST("_COUNT_field_n9r3" AS INTEGER) AS "field_i4xy", "field_8dt4" AS "field_8dt4" FROM "map_s9if"), "map_0efo" ("field_8dt4", "field_i4xy") AS (SELECT "field_8dt4" AS "field_8dt4", "field_i4xy" AS "field_i4xy" FROM "map_687m"), "map_4gye" ("gender", "total_number_of_roles") AS (SELECT "field_8dt4" AS "gender", "field_i4xy" AS "total_number_of_roles" FROM "map_0efo") SELECT * FROM "map_4gye"
DpEvent(Gaussian { noise_multiplier: 377.6479532659047 })
print_dot(dp_relation.dot())
import pandas as pd
true_res = pd.read_sql(query, database.engine())
dp_res = pd.read_sql(dp_query, database.engine())
print_result('Result', true_res.sort_values('gender'))
print_result('DP Result', dp_res.sort_values('gender'))
Result
gender | total_number_of_roles | |
---|---|---|
0 | F | 1135174 |
1 | M | 2296792 |
DP Result
gender | total_number_of_roles | |
---|---|---|
1 | F | 1120108 |
0 | M | 2186684 |
In this example Qrlew made assumptions on the number of roles per actor to clip the influence of one actor. This is tunable.
We can for example set a max_multiplicity
parameter of 1.
# The relation is turned into a DP equivalent
relation_with_dp_event = relation.rewrite_with_differential_privacy(
dataset,
privacy_unit,
budget,
max_multiplicity=1,
)
# Rewritten relation
dp_relation = relation_with_dp_event.relation()
dp_query = dp_relation.to_query()
dp_res = pd.read_sql(dp_query, database.engine())
print_result('Result', true_res.sort_values('gender'))
print_result('DP Result', dp_res.sort_values('gender'))
Result
gender | total_number_of_roles | |
---|---|---|
0 | F | 1135174 |
1 | M | 2296792 |
DP Result
gender | total_number_of_roles | |
---|---|---|
1 | F | 304411 |
0 | M | 513306 |
The result will have litle noise, but we see it is very biased. We actually count the number of actor instead of the number of role.
We can now set a max_multiplicity
parameter of 100000 and a max_multiplicity_share
of 100% of estimated size of the dataset as qrlew will use the smallest of both.
# The relation is turned into a DP equivalent
relation_with_dp_event = relation.rewrite_with_differential_privacy(
dataset,
privacy_unit,
budget,
max_multiplicity=100000,
max_multiplicity_share=1.,
)
# Rewritten relation
dp_relation = relation_with_dp_event.relation()
dp_query = dp_relation.to_query()
dp_res = pd.read_sql(dp_query, database.engine())
print_result('Result', true_res.sort_values('gender'))
print_result('DP Result', dp_res.sort_values('gender'))
Result
gender | total_number_of_roles | |
---|---|---|
0 | F | 1135174 |
1 | M | 2296792 |
DP Result
gender | total_number_of_roles | |
---|---|---|
1 | F | 1251391 |
0 | M | 1988560 |
Now the value will be unbiased, but very noisy.
Taking a value of 1000 yields good results here.
The defaults is max_multiplicity=100.0
and max_multiplicity_share=0.1
.
# The relation is turned into a DP equivalent
relation_with_dp_event = relation.rewrite_with_differential_privacy(
dataset,
privacy_unit,
budget,
max_multiplicity=1000,
max_multiplicity_share=1.,
)
# Rewritten relation
dp_relation = relation_with_dp_event.relation()
dp_query = dp_relation.to_query()
dp_res = pd.read_sql(dp_query, database.engine())
print_result('Result', true_res.sort_values('gender'))
print_result('DP Result', dp_res.sort_values('gender'))
Result
gender | total_number_of_roles | |
---|---|---|
0 | F | 1135174 |
1 | M | 2296792 |
DP Result
gender | total_number_of_roles | |
---|---|---|
1 | F | 1140708 |
0 | M | 2299726 |
Qrlew supports several SQL dialects: PostgreSQL (default), MSSQL or BigQuery.#
from pyqrlew import Dialect
# We create a basic relation
query = "SELECT LN(1+rank) as log_rank FROM imdb_ijs.movies GROUP BY gender"
relation = dataset.relation(query)
print_md('### Relation rendered as PostgreSQL')
query = relation.to_query()
print_sql(query)
Relation rendered as PostgreSQL
WITH "reduce_9y_x" ("field_07kk") AS (SELECT rank AS "field_07kk" FROM "imdb_ijs"."movies" GROUP BY "gender"), "map_ucb9" ("log_rank") AS (SELECT LN((1) + ("field_07kk")) AS "log_rank" FROM "reduce_9y_x") SELECT * FROM "map_ucb9"
print_md('### Relation rendered as BigQuery')
query = relation.to_query(dialect=Dialect.BigQuery)
print_sql(query)
Relation rendered as BigQuery
WITH `reduce_9y_x` AS (SELECT rank AS `field_07kk` FROM `imdb_ijs`.`movies` GROUP BY `gender`), `map_ucb9` AS (SELECT LN((1) + (`field_07kk`)) AS `log_rank` FROM `reduce_9y_x`) SELECT * FROM `map_ucb9`
print_md('### Relation rendered as MsSql')
query = relation.to_query(dialect=Dialect.MsSql)
print_sql(query)
Relation rendered as MsSql
WITH "reduce_9y_x" ("field_07kk") AS (SELECT rank AS "field_07kk" FROM "imdb_ijs"."movies" GROUP BY "gender"), "map_ucb9" ("log_rank") AS (SELECT LOG((1) + ("field_07kk")) AS "log_rank" FROM "reduce_9y_x") SELECT * FROM "map_ucb9"
Note how the quoting varies and how LN is converted into LOG for MSSQL.
Not only Qrlew can render relation ito many dialects it can also parse different dialects.
# We create a basic relation
query = """SELECT LN(1+rank) as "log rank" FROM imdb_ijs.movies GROUP BY gender"""
relation = dataset.relation(query, dialect=Dialect.PostgreSql)
print_dot(relation.dot())
# We create a basic relation in BigQuery dialect
query = """SELECT LN(1+rank) as `log rank` FROM imdb_ijs.movies GROUP BY gender"""
relation = dataset.relation(query, dialect=Dialect.BigQuery)
print_dot(relation.dot())
# We create a basic relation in MSSQL dialect
query = """SELECT LOG(1+rank) as "log rank" FROM imdb_ijs.movies GROUP BY gender"""
relation = dataset.relation(query, dialect=Dialect.MsSql)
print_dot(relation.dot())
They all return the same abstract Relation
object