SQL analysis with differential privacy guarantees#
The purpose of this notebook is to illustrate the process of transforming a standard SQL query into a differentially private SQL query.
Our goal here is to demonstrate how this complex transformation of an SQL query into a differentially private version can be achieved with ease using Qrlew.
The main idea of Qrlew is to parse and rewrite SQL queries into Relations: an Intermediate Representation (IR) that is well-suited for various rewriting tasks.
Relations may be of different kinds:
Map
s typically represent the transform of an input Relation row by row, therefore, naturaly preserving the privacy unit,Reduce
s represent aggregations that can be turned into DP equivalentJoin
s combine two input RelationsTable
s are simply sources.
This representation simplifies the process of rewriting queries and reduces dependencies on the diverse range of syntactic constructs present in SQL.
Install the database and packages#
%%capture
# Load the database
# Inspired by https://colab.research.google.com/github/tensorflow/io/blob/master/docs/tutorials/postgresql.ipynb#scrollTo=YUj0878jPyz7
!sudo apt-get -y -qq update
!sudo apt-get -y -qq install postgresql-14 graphviz
# Start postgresql server
!sudo sed -i "s/port = 5432/port = 5433/g" /etc/postgresql/14/main/postgresql.conf
!sudo service postgresql start
# Set password
!sudo -u postgres psql -U postgres -c "ALTER USER postgres PASSWORD 'pyqrlew-db'"
!pip install -U pyqrlew numpy pandas matplotlib graphviz
from pyqrlew.io import PostgreSQL
from pyqrlew import Dataset
# Read data
database = PostgreSQL()
dataset = database.retail()
For our analysis, we will use the Qrlew built-in sample datasets retail
and an Postgres connection.
Note that you can also load one or several csv files or use another SQL engine.
The Relation
: an intermediate represention for SQL rewritting#
The retail
dataset contains 3 tables, each can be converted into a Relation::Table
:
import graphviz
display(graphviz.Source(dataset.retail.features.relation().dot()))
display(graphviz.Source(dataset.retail.sales.relation().dot()))
display(graphviz.Source(dataset.retail.stores.relation().dot()))
Each Relation::Table
contains the size of the table (which can be approximate with DP) and the columns in the tables with their name and type.
Before using any differential privacy mechanism, it is essential to determine:
the privacy unit, i.e. the user identifier,
the sensitivity of the aggregation functions, i.e. the maximal contribution a user may have on the result,
the values of the
GROUP BY
columns that can be safely released
With Qrlew is very easy to set ranges, unique constraints and distinct possible values for specific columns
dataset = dataset.retail.features.unemployment.with_range(0, 20)
dataset = dataset.retail.features.unemployment.with_unique_constraint()
dataset = dataset.retail.features.store.with_possible_values(["1", "2", "3"])
display(graphviz.Source(dataset.retail.features.relation().dot()))
The privacy unit#
In this example, let’s consider that we want to protect the id
column of the tables features
and sales
.
The tables store
is linked to features
via foreign keys.
Then each table have a privacy unit (even if it is not directly in the table):
features
: columnid
stores
: the privacy unit is the columnid
of thefeature
table. It is obtained by joining the tablesfeatures
andstores
with the conditionfeatures.store = stores.store
sales
: columnid
We code the privacy unit as:
privacy_unit = [
("features", [], "id"),
("stores", [("store", "features", "store")], "id"),
("sales", [], "id")
]
# Other arguments that will be explained later
budget = {"epsilon": 1.0, "delta": 5e-4}
synthetic_data = [
(["retail", "features"], ["retail", "features_sd"]),
]
The privacy_unit
is a vector whose each element is a tuple with the convention: (table_name, join_path, id_column)
with:
table_name
the name of the table,join_path
a vector storing the joins whose each item(column_name, table_name, column)
represents a foreign key,id_column
the name of the column which is the privacy unit.
The Relation
method protect
allows to propagate the privacy unit through all the relations:
pup_features = dataset.retail.features.relation().rewrite_as_privacy_unit_preserving(
dataset=dataset,
privacy_unit=privacy_unit,
epsilon_delta=budget,
synthetic_data=synthetic_data,
)
display(graphviz.Source(pup_features.relation().dot()))
pup_stores = dataset.retail.stores.relation().rewrite_as_privacy_unit_preserving(
dataset=dataset,
privacy_unit=privacy_unit,
epsilon_delta=budget,
synthetic_data=synthetic_data,
)
display(graphviz.Source(pup_stores.relation().dot()))
We notice that to protect the features
table we just have to hash the values of the id
column and store them into the _PRIVACY_UNIT_
while to protect the store
table we had to make a join before.
The sensitivity#
In the case we want to relase the result of query : SELECT SUM(3 * fuel_price + temperature / 10) FROM retail.features
, we have to compute the sensitivity of the SUM(3 * fuel_price + temperature / 10)
aggregation.
The user provides the ranges of the inputs column in the WHERE
clause of the query, then these values are propagated through all the relations.
relation = dataset.relation('SELECT SUM(3 * fuel_price + temperature / 10) FROM retail.features WHERE fuel_price > 2.5 AND fuel_price < 3.9 AND temperature > 28 AND temperature < 93')
display(graphviz.Source(relation.dot()))
More information about the ranges propagation can be found here.
The protection of the GROUP BY keys#
As explained in Wilson et al. (2019) and this post, releasing the values of the grouping columns may leak some sensitive information.
To avoid that, two methods can be used:
if the grouping keys have public values, we can release the propagated public values,
otherwise we can use the tau-thresholding mechanism.
In the following cell, we show how the public values are propagated and then can be used in the DP rewriting.
relation = dataset.relation("SELECT store, SUM(fuel_price) FROM retail.features WHERE store IN ('a', 'b') GROUP BY store ")
display(graphviz.Source(relation.dot()))
Rewritting with DP#
The rewritting with DP is done in 3 steps:
Protect the input
Relation
as described beforeDP-rewritting of the grouping keys: a grouping key can be released if already public or if it appears in enough users rows,
Add noise scaled by the privacy parameters to each aggregation.
We use the rewrite_with_differential_privacy
method that transforms a Relation
into its differentially private equivalent.
It inputs:
dataset
: theDataset
we want to query,privacy_unit
: the privacy unit described previously,privacy parameters
: the \((\varepsilon, \delta)\) privacy parameters,synthetic_data
: an optional list giving the correpondance between the original tables and their synthetic version. Each table must be specified. The list is made of two-element tuples whose first element is a list representing the path to the original table (e.g.:["retail", "features"]
for “retail.features” table or["features"]
for the “feature” table) and the second element, the path to the synthetic table.
privacy_unit = [
("stores", [("store", "features", "store")], "id"),
("features", [], "id"),
("sales", [], "id")
]
budget = {"epsilon": 1.0, "delta": 5e-4}
synthetic_data = None
query = "SELECT SUM(fuel_price) AS my_sum FROM retail.features WHERE fuel_price > 2.514 AND fuel_price < 3.907"
relation = dataset.relation(query)
relation_with_dp_event = relation.rewrite_with_differential_privacy(
dataset=dataset,
privacy_unit=privacy_unit,
epsilon_delta=budget,
synthetic_data=synthetic_data,
)
dp_relation = relation_with_dp_event.relation()
mechanisms_used = relation_with_dp_event.dp_event()
dp_events
are compatible with Google’s dp_accounting
%%capture
!pip install dp_accounting==0.4.1
from dp_accounting import DpEvent
print(mechanisms_used)
print(DpEvent.from_named_tuple(mechanisms_used.to_named_tuple()))
DpEvent(Gaussian { noise_multiplier: 1406.4215077657532 })
GaussianDpEvent(noise_multiplier=1406.4215077657532)
The ouput is a tuple made of two elements:
The first element is the
Relation
rewritten with dp,the second element is a
DpEvent
which stores the DP mechanims invoked during the rewritting.
In the present case, we have used only a gaussian mechanim since our query contain only a simple aggregate and no GROUP BY
.
display(graphviz.Source(dp_relation.dot()))
The Relation
can be translated into an SQL query:
MAGENTA_COLOR = '\033[35m'
BLUE_COLOR = '\033[34m'
RESET_COLOR = '\033[0m'
def print_query(query: str):
keywords = ["SELECT", "AS", "GROUP BY", "LIMIT", "ORDER BY", "WHERE"]
colored_query = query
colored_query = colored_query.replace("WITH", "WITH\n ")
colored_query = colored_query.replace(" SELECT", "\nSELECT")
colored_query = colored_query.replace("),", "),\n ")
for word in keywords:
colored_query = colored_query.replace(word, MAGENTA_COLOR + word + RESET_COLOR)
colored_query = colored_query.replace("WITH", BLUE_COLOR + "WITH" + RESET_COLOR)
print(colored_query)
dp_query = dp_relation.to_query()
print_query(dp_query)
WITH
"map_2p3w" ("_PRIVACY_UNIT_", "_PRIVACY_UNIT_WEIGHT_", "id", "store", "date", "temperature", "fuel_price", "cpi", "unemployment", "isholiday") AS (SELECT MD5(CAST("id" AS TEXT)) AS "_PRIVACY_UNIT_", 1 AS "_PRIVACY_UNIT_WEIGHT_", "id" AS "id", "store" AS "store", "date" AS "date", "temperature" AS "temperature", "fuel_price" AS "fuel_price", "cpi" AS "cpi", "unemployment" AS "unemployment", "isholiday" AS "isholiday" FROM "retail"."features"),
"map_y19b" ("_PRIVACY_UNIT_", "_PRIVACY_UNIT_WEIGHT_", "field__6vl") AS (SELECT "_PRIVACY_UNIT_" AS "_PRIVACY_UNIT_", "_PRIVACY_UNIT_WEIGHT_" AS "_PRIVACY_UNIT_WEIGHT_", "fuel_price" AS "field__6vl" FROM "map_2p3w" WHERE (("fuel_price") > (2.514)) AND (("fuel_price") < (3.907))),
"map_6c_b" ("_PRIVACY_UNIT_", "_PRIVACY_UNIT_WEIGHT_", "field__6vl") AS (SELECT COALESCE(CAST("_PRIVACY_UNIT_" AS TEXT),
'_PRIVACY_UNIT_DEFAULT_') AS "_PRIVACY_UNIT_", COALESCE("_PRIVACY_UNIT_WEIGHT_", 0) AS "_PRIVACY_UNIT_WEIGHT_", "field__6vl" AS "field__6vl" FROM "map_y19b"),
"map_pil5" ("field_3d25", "field_grep") AS (SELECT "field__6vl" AS "field_3d25", "_PRIVACY_UNIT_" AS "field_grep" FROM "map_6c_b"),
"reduce_q7zo" ("_PRIVACY_UNIT_", "_NORM_field__6vl") AS (SELECT field_grep AS "_PRIVACY_UNIT_", SUM("field_3d25") AS "_NORM_field__6vl" FROM "map_pil5" GROUP BY "field_grep"),
"map_sulf" ("_PRIVACY_UNIT_", "_NORM_field__6vl") AS (SELECT "_PRIVACY_UNIT_" AS "_PRIVACY_UNIT_", ABS(("_NORM_field__6vl") * ("_NORM_field__6vl")) AS "_NORM_field__6vl" FROM "reduce_q7zo"),
"map_c59q" ("field_czvy", "field_grep") AS (SELECT "_NORM_field__6vl" AS "field_czvy", "_PRIVACY_UNIT_" AS "field_grep" FROM "map_sulf"),
"reduce_k2tq" ("_PRIVACY_UNIT_", "field__6vl") AS (SELECT field_grep AS "_PRIVACY_UNIT_", SUM("field_czvy") AS "field__6vl" FROM "map_c59q" GROUP BY "field_grep"),
"map_gqz_" ("_PRIVACY_UNIT_", "field__6vl") AS (SELECT "_PRIVACY_UNIT_" AS "_PRIVACY_UNIT_", (1) / (GREATEST(1, (SQRT("field__6vl")) / (355.537))) AS "field__6vl" FROM "reduce_k2tq"),
"join_8dfm" ("_PRIVACY_UNIT_", "_PRIVACY_UNIT_WEIGHT_", "field__6vl", "_SCALE_FACTOR__PRIVACY_UNIT_", "_SCALE_FACTOR_field__6vl") AS (SELECT * FROM "map_6c_b" AS "_LEFT_" JOIN "map_gqz_" AS "_RIGHT_" ON ("_LEFT_"."_PRIVACY_UNIT_") = ("_RIGHT_"."_PRIVACY_UNIT_")),
"map_82ah" ("_PRIVACY_UNIT_", "_PRIVACY_UNIT_WEIGHT_", "field__6vl", "_SCALE_FACTOR__PRIVACY_UNIT_", "_SCALE_FACTOR_field__6vl", "_CLIPPED_field__6vl") AS (SELECT "_PRIVACY_UNIT_" AS "_PRIVACY_UNIT_", "_PRIVACY_UNIT_WEIGHT_" AS "_PRIVACY_UNIT_WEIGHT_", "field__6vl" AS "field__6vl", "_SCALE_FACTOR__PRIVACY_UNIT_" AS "_SCALE_FACTOR__PRIVACY_UNIT_", "_SCALE_FACTOR_field__6vl" AS "_SCALE_FACTOR_field__6vl", ("field__6vl") * ("_SCALE_FACTOR_field__6vl") AS "_CLIPPED_field__6vl" FROM "join_8dfm"),
"map_nlz1" ("field_q9u_") AS (SELECT "_CLIPPED_field__6vl" AS "field_q9u_" FROM "map_82ah"),
"reduce_8iqs" ("_SUM_field__6vl") AS (SELECT SUM("field_q9u_") AS "_SUM_field__6vl" FROM "map_nlz1"),
"map_c_77" ("_SUM_field__6vl") AS (SELECT LEAST(3555.37, GREATEST(0, (COALESCE("_SUM_field__6vl", 0)) + ((1406.4215077657532) * ((SQRT((-2) * (LN(RANDOM())))) * (COS((6.283185307179586) * (RANDOM()))))))) AS "_SUM_field__6vl" FROM "reduce_8iqs"),
"map_815z" ("my_sum") AS (SELECT "_SUM_field__6vl" AS "my_sum" FROM "map_c_77"),
"map_5kny" ("my_sum") AS (SELECT "my_sum" AS "my_sum" FROM "map_815z")
SELECT * FROM "map_5kny"
Then the query is sent to the databqe:
import pandas as pd
true_res = pd.read_sql(query, database.engine())
dp_res = pd.read_sql(dp_query, database.engine())
print(f"Initial query: {true_res.iloc[0].iloc[0]}")
print("DP query: ", dp_res["my_sum"][0])
Initial query: 2939.4329999999995
DP query: 3555.37
We can sent the DP query multiple times wo see the variantions among the results:
N_RUNS = 300
def run(dataset, query, epsilons):
data = {}
privacy_unit = [
("stores", [("store", "features", "store")], "id"),
("features", [], "id"),
("sales", [], "id")
]
synthetic_data = None
delta = 1e-3
relation = dataset.relation(query)
for epsilon in epsilons:
budget = {"epsilon": epsilon, "delta": delta}
rel_with_pq = relation.rewrite_with_differential_privacy(
dataset=dataset,
privacy_unit=privacy_unit,
epsilon_delta=budget,
synthetic_data=synthetic_data,
)
dp_query = rel_with_pq.relation().to_query()
data[epsilon] = [
pd.read_sql(dp_query, database.engine())["my_sum"][0] for _ in range(N_RUNS)
]
return data
true_value = pd.read_sql(query, database.engine())["my_sum"][0]
data = run(dataset, query, [0.5, 1., 5.])
print(query)
SELECT SUM(fuel_price) AS my_sum FROM retail.features WHERE fuel_price > 2.514 AND fuel_price < 3.907
import matplotlib.pyplot as plt
import matplotlib as mpl
mpl.rcParams['text.usetex'] = False
def plot_hist(true_value, data, query):
plt.axvline(true_value, color='red', label="True")
for e, d in data.items():
plt.hist(d, bins=10, alpha=0.5, label = f"eps = {e}")
plt.legend()
plt.title(query.encode('unicode_escape').decode().replace("<", "$<$").replace(">", "$>$"))
plot_hist(true_value, data, query)