{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "P7J2YlmX2kcL" }, "source": [ "# User Guide\n", "[![View On GitHub](https://img.shields.io/badge/View_in_Github-grey?logo=github)](https://github.com/Qrlew/docs/blob/main/tutorials/user_guide.ipynb)\n", "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Qrlew/docs/blob/main/tutorials/user_guide.ipynb)\n", "\n", "[Qrlew](https://qrlew.github.io/) is an [open source library](https://github.com/Qrlew) that can turn SQL queries into their [differentially private (DP)](https://en.wikipedia.org/wiki/Differential_privacy) equivalent.\n", "* It takes SQL – the universal language of small and big data analytics – as input, so there is no new language or API to learn.\n", "* It returns DP SQL queries that can be executed at scale on any SQL datastore.\n", "* The only required extra inputs are the *privacy unit* specification and privacy budget $(\\varepsilon, \\delta)$.\n", "\n", "With [Qrlew](https://qrlew.github.io/) a *data owner* can receive SQL queries from an untrusted *data analyst*, turn them into DP equivalent, execute them and safely return the result to the *data analyst*.\n", "\n", "This guide will cover several useful notions from [Qrlew](https://qrlew.github.io/):\n", "* We will define a `Dataset` from a database connection.\n", "* We will declare value ranges and unique constraints\n", "* We will define a *privacy unit*.\n", "* And we will run simple DP SQL queries.\n", "\n", "We will also explore more advanced features:\n", "* The support of several SQL dialects: PostgreSQL (default), MSSQL or BigQuery.\n", "* Privacy accounting compatibility with Google's popular Differential Privacy accounting: [dp_accounting](https://github.com/google/differential-privacy/tree/main/python/dp_accounting).\n", "* The use of Synthetic Data to further improve the quality of the results." ] }, { "cell_type": "markdown", "metadata": { "id": "tkygHzm42vnP" }, "source": [ "## Define a Dataset\n", "\n", "Let's define a `Dataset` object from a SQL connection.\n", "\n", "We will first run a local postgresql database. You can connect to any Postgresql, MSSQL or BigQuery database." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "id": "nUJv2pcm2kcQ" }, "outputs": [], "source": [ "%%capture\n", "!sudo apt-get -y -qq update\n", "!sudo apt-get -y -qq install postgresql\n", "!sudo service postgresql start\n", "!sudo -u postgres psql -U postgres -c \"ALTER USER postgres PASSWORD 'password'\"\n", "!pip install psycopg2\n", "!pip install pyqrlew" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "id": "E39a8Do_CVow" }, "outputs": [], "source": [ "# A few utility functions\n", "from IPython.display import Markdown\n", "\n", "def print_sql(query):\n", " display(Markdown(f\"\"\"```sql\n", "{query}\n", "```\"\"\"))\n", "\n", "def print_result(title, result):\n", " display(Markdown(f'### {title}'))\n", " display(result)\n", "\n", "def print_dot(dot_graph):\n", " display(Source(dot_graph))\n", "\n", "def print_md(md):\n", " display(Markdown(md))" ] }, { "cell_type": "markdown", "metadata": { "id": "K71zLRH031J5" }, "source": [ "Then we load a demo dataset." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "id": "6IJvQNm83LQi" }, "outputs": [], "source": [ "from pyqrlew.io import PostgreSQL\n", "\n", "database = PostgreSQL('postgres', 'postgres', 'password', 5432)\n", "database.load_imdb()\n", "engine = database.engine()" ] }, { "cell_type": "markdown", "metadata": { "id": "VOIbC7BL36WS" }, "source": [ "We can then build a `Dataset` object that can be used to parse queries into `Relation` objects that can be transformed into differentially private `Relation`s and rendered into SQL queries." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "id": "YpfySH-n4Asg" }, "outputs": [], "source": [ "from pyqrlew import Dataset\n", "\n", "source_dataset = Dataset.from_database('imdb', engine, 'imdb_ijs')" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 187 }, "id": "GEvK8XB-9bTy", "outputId": "2136e744-ae86-444c-d9c2-c2ed0f3b094a" }, "outputs": [ { "data": { "text/markdown": [ "**imdb.imdb_ijs.actors** (`SELECT * FROM \"imdb_ijs\".\"actors\"`)" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ "**imdb.imdb_ijs.directors** (`SELECT * FROM \"imdb_ijs\".\"directors\"`)" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ "**imdb.imdb_ijs.directors_genres** (`SELECT * FROM \"imdb_ijs\".\"directors_genres\"`)" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ "**imdb.imdb_ijs.movies** (`SELECT * FROM \"imdb_ijs\".\"movies\"`)" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ "**imdb.imdb_ijs.movies_directors** (`SELECT * FROM \"imdb_ijs\".\"movies_directors\"`)" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ "**imdb.imdb_ijs.movies_genres** (`SELECT * FROM \"imdb_ijs\".\"movies_genres\"`)" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ "**imdb.imdb_ijs.roles** (`SELECT * FROM \"imdb_ijs\".\"roles\"`)" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "for path, relation in source_dataset.relations():\n", " print_md(f\"**{'.'.join(path)}** (`{relation}`)\")" ] }, { "cell_type": "markdown", "metadata": { "id": "myFaXX92419n" }, "source": [ "## Declare value ranges and unique constraints\n", "\n", "The more information Qrlew knows about data, the more accurate the results.\n", "Qrlew enables the user to input:\n", "\n", "* the range of some variables,\n", "* the list of possible values whan appropriate\n", "* and `UNIQUE` constraints on some values (all values are distincts)\n", "\n" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "id": "G8Mjbp3H41XT" }, "outputs": [], "source": [ "bounded_dataset = source_dataset.imdb_ijs.movies.rank.with_range(0, 20)\n", "bounded_dataset = bounded_dataset.imdb_ijs.actors.gender.with_possible_values(['M', 'F'])\n", "dataset = (bounded_dataset\n", " .imdb_ijs.actors.id.with_unique_constraint()\n", " .imdb_ijs.directors.id.with_unique_constraint()\n", " .imdb_ijs.movies.id.with_unique_constraint())" ] }, { "cell_type": "markdown", "metadata": { "id": "4CS51vDa8Bjo" }, "source": [ "## Display a graphical representation of the `Relation` objects\n", "\n", "SQL tables and their transformations have an abstract internal representation in Qrlew: the `Relation` (inspired by [Relations in the relational DB sense](https://en.wikipedia.org/wiki/Relation_(database))).\n", "\n", "Transformed `Relation` can be a complex graph of intermediate relations so it is possible to represent them graphically in the [Graphviz](https://graphviz.org/) format." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "cellView": "form", "colab": { "base_uri": "https://localhost:8080/", "height": 368 }, "id": "s0HbD-tGrK5d", "outputId": "77bf9eb2-c8da-4318-810d-191f7d9e4ae0", "tags": [ "hide-input" ] }, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "graph_m0sq\n", "\n", "\n", "\n", "graph_m0sq\n", "\n", "RESULTING_RELATION size ∈ int[min_size max_size] (size range)\n", "field_name = expression_of(input_field) ∈ float[0. 2.] (values are in a range of float)\n", "The rows of this relation are computed based on the input Relation below\n", "\n", "\n", "\n", "graph_eu9v\n", "\n", "REDUCE_RELATION size ∈ int[min_size max_size]\n", "input_field = avg(rank) ∈ float[0 20] UNIQUE\n", "GROUP BY (field_name)\n", "This Relation will aggregate the values of the relation below.\n", "\n", "\n", "\n", "graph_m0sq->graph_eu9v\n", "\n", "\n", "\n", "\n", "\n", "graph_773q\n", "\n", "NAME_OF_RELATION size ∈ int{known_size}\n", "field_name = name_in_source ∈ data_type\n", "name = name ∈ str{list, of, possible, values}\n", "year = year ∈ int\n", "rank = rank ∈ float[min max]\n", "\n", "\n", "\n", "graph_eu9v->graph_773q\n", "\n", "\n", "\n", "\n", "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# @title Example representation\n", "%%svg\n", "\n", "\n", "graph_m0sq\n", "\n", "\n", "\n", "graph_m0sq\n", "\n", "RESULTING_RELATION size ∈ int[min_size max_size] (size range)\n", "field_name = expression_of(input_field) ∈ float[0. 2.] (values are in a range of float)\n", "The rows of this relation are computed based on the input Relation below\n", "\n", "\n", "\n", "graph_eu9v\n", "\n", "REDUCE_RELATION size ∈ int[min_size max_size]\n", "input_field = avg(rank) ∈ float[0 20] UNIQUE\n", "GROUP BY (field_name)\n", "This Relation will aggregate the values of the relation below.\n", "\n", "\n", "\n", "graph_m0sq->graph_eu9v\n", "\n", "\n", "\n", "\n", "\n", "graph_773q\n", "\n", "NAME_OF_RELATION size ∈ int{known_size}\n", "field_name = name_in_source ∈ data_type\n", "name = name ∈ str{list, of, possible, values}\n", "year = year ∈ int\n", "rank = rank ∈ float[min max]\n", "\n", "\n", "\n", "graph_eu9v->graph_773q\n", "\n", "\n", "\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": { "id": "nduYCeIizWFk" }, "source": [ "Tables are the most basic forms of `Relation`." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 794 }, "id": "_ak643bs8Q5X", "outputId": "c03b1ac5-dd70-4e24-d079-feaf4166dd3a" }, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "graph_ma77\n", "\n", "\n", "\n", "graph_ma77\n", "\n", "IMDB_IJS_ACTORS size ∈ int{817718}\n", "id = id ∈ int\n", "first_name = first_name ∈ str\n", "last_name = last_name ∈ str\n", "gender = gender ∈ str\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "graph_8nf1\n", "\n", "\n", "\n", "graph_8nf1\n", "\n", "IMDB_IJS_DIRECTORS size ∈ int{86880}\n", "id = id ∈ int\n", "first_name = first_name ∈ str\n", "last_name = last_name ∈ str\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "graph_yn_2\n", "\n", "\n", "\n", "graph_yn_2\n", "\n", "IMDB_IJS_DIRECTORS_GENRES size ∈ int{156562}\n", "director_id = director_id ∈ int\n", "genre = genre ∈ str\n", "prob = prob ∈ float\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "graph_3yxb\n", "\n", "\n", "\n", "graph_3yxb\n", "\n", "IMDB_IJS_MOVIES size ∈ int{388269}\n", "id = id ∈ int\n", "name = name ∈ str\n", "year = year ∈ int\n", "rank = rank ∈ float\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "graph_mmzc\n", "\n", "\n", "\n", "graph_mmzc\n", "\n", "IMDB_IJS_MOVIES_DIRECTORS size ∈ int{371180}\n", "director_id = director_id ∈ int\n", "movie_id = movie_id ∈ int\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "graph_qpgf\n", "\n", "\n", "\n", "graph_qpgf\n", "\n", "IMDB_IJS_MOVIES_GENRES size ∈ int{395119}\n", "movie_id = movie_id ∈ int\n", "genre = genre ∈ str\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "graph_6mn0\n", "\n", "\n", "\n", "graph_6mn0\n", "\n", "IMDB_IJS_ROLES size ∈ int{3431966}\n", "actor_id = actor_id ∈ int\n", "movie_id = movie_id ∈ int\n", "role = role ∈ str\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from graphviz import Source\n", "\n", "for path, relation in source_dataset.relations():\n", " print_dot(relation.dot())" ] }, { "cell_type": "markdown", "metadata": { "id": "t-y-LAid-OXA" }, "source": [ "Qrlew turns queries into composed `Relation` objects." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 725 }, "id": "9m5lamry-Lo6", "outputId": "a47d3759-2c63-4625-a1ec-8df8ba48b6b4" }, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "graph_3cf9\n", "\n", "\n", "\n", "graph_3cf9\n", "\n", "MAP_Z2OJ size ∈ int[0 3431966]\n", "gender = field_8dt4 ∈ str{F, M} UNIQUE\n", "number_of_roles = field_k8_r ∈ int[0 3431966]\n", "\n", "\n", "\n", "graph_esfe\n", "\n", "REDUCE_5QQ9 size ∈ int[0 3431966]\n", "field_8dt4 = first(field_8dt4) ∈ str{F, M} UNIQUE\n", "field_k8_r = count(field_2xvv) ∈ int[0 3431966]\n", "GROUP BY (field_8dt4)\n", "\n", "\n", "\n", "graph_3cf9->graph_esfe\n", "\n", "\n", "\n", "\n", "\n", "graph_m27x\n", "\n", "MAP_VMK6 size ∈ int[0 3431966]\n", "field_2xvv = field_cenr ∈ str\n", "field_8dt4 = field_ey3i ∈ str{F, M}\n", "\n", "\n", "\n", "graph_esfe->graph_m27x\n", "\n", "\n", "\n", "\n", "\n", "graph_mgxe\n", "\n", "JOIN_5PXQ size ∈ int[0 3431966]\n", "field_6w80 = _LEFT_.id ∈ int\n", "field_adyv = _LEFT_.first_name ∈ str\n", "field_mg8l = _LEFT_.last_name ∈ str\n", "field_ey3i = _LEFT_.gender ∈ str{F, M}\n", "field_rvp5 = _RIGHT_.actor_id ∈ int\n", "field_kwgs = _RIGHT_.movie_id ∈ int\n", "field_cenr = _RIGHT_.role ∈ str\n", "INNER ON (_RIGHT_.actor_id = _LEFT_.id)\n", "\n", "\n", "\n", "graph_m27x->graph_mgxe\n", "\n", "\n", "\n", "\n", "\n", "graph_hq3c\n", "\n", "IMDB_IJS_ACTORS size ∈ int{817718}\n", "id = id ∈ int UNIQUE\n", "first_name = first_name ∈ str\n", "last_name = last_name ∈ str\n", "gender = gender ∈ str{F, M}\n", "\n", "\n", "\n", "graph_mgxe->graph_hq3c\n", "\n", "\n", "\n", "\n", "\n", "graph_6mn0\n", "\n", "IMDB_IJS_ROLES size ∈ int{3431966}\n", "actor_id = actor_id ∈ int\n", "movie_id = movie_id ∈ int\n", "role = role ∈ str\n", "\n", "\n", "\n", "graph_mgxe->graph_6mn0\n", "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "query = \"SELECT gender, COUNT(role) as number_of_roles FROM imdb_ijs.actors JOIN imdb_ijs.roles ON roles.actor_id=actors.id GROUP BY gender\"\n", "relation = dataset.relation(query)\n", "print_dot(relation.dot())" ] }, { "cell_type": "markdown", "metadata": { "id": "OGlIfDDgzxHU" }, "source": [ "Those representations can become complex, but they are very usefull to track each step of the computation and how Qrlew infer value ranges step by step." ] }, { "cell_type": "markdown", "metadata": { "id": "G5o4BTBv4jct" }, "source": [ "## Define a Privacy Unit\n", "\n", "To protect privacy Qrlew needs to know what is to be protected.\n", "\n", "Each row has to be attached to an individual (Privacy Unit or PU) whose privacy will be protected.\n", "\n", "Differential Priacy (DP) is all about bounding the influence of each PU on a result and perturbating the results so that the remaining influence is not noticeable." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "id": "QlCij33c4ioh" }, "outputs": [], "source": [ "# Privacy unit definition\n", "privacy_unit = [\n", " # The column `id` directly defines the privacy unit\n", " (\"actors\", [], \"id\"),\n", " (\"directors\", [], \"id\"),\n", " # The column `director_id` refers to the column `id` of table `directors`, the `id` of which defines the privacy unit\n", " (\"directors_genres\", [(\"director_id\", \"directors\", \"id\")], \"id\"),\n", " (\"movies_directors\", [(\"director_id\", \"directors\", \"id\")], \"id\"),\n", " # The column `actor_id` refers to the column `id` of table `actors`, the `id` of which defines the privacy unit\n", " (\"roles\", [(\"actor_id\", \"actors\", \"id\")], \"id\"),\n", "]\n", "# Privacy paramers\n", "budget = {\"epsilon\": 1.0, \"delta\": 1e-3}" ] }, { "cell_type": "markdown", "metadata": { "id": "f2HUrWSH5Xis" }, "source": [ "## Run simple DP SQL queries\n", "\n", "`Relation` objects can be turned into DP equivalent by invoking the `rewrite_with_differential_privacy` method with the following arguments:\n", "\n", "* A `dataset` because the definition of the Privacy Unit may require lookups in many tables.\n", "* A Privacy Unit definition object.\n", "* An ($\\varepsilon$, $\\delta$) budget.\n", "\n" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 72 }, "id": "R6PmCjDA5YDo", "outputId": "aff974b0-abfd-4027-ebac-747fef56c1ac" }, "outputs": [ { "data": { "text/markdown": [ "```sql\n", "WITH \"field_d4_o\" (\"field_d4_o\") AS (SELECT * FROM (VALUES ('F'), ('M')) AS \"field_d4_o\" (\"field_d4_o\")), \"map_o7od\" (\"_PRIVACY_UNIT_\", \"_PRIVACY_UNIT_WEIGHT_\", \"id\", \"first_name\", \"last_name\", \"gender\") AS (SELECT MD5(CAST(\"id\" AS TEXT)) AS \"_PRIVACY_UNIT_\", 1 AS \"_PRIVACY_UNIT_WEIGHT_\", \"id\" AS \"id\", \"first_name\" AS \"first_name\", \"last_name\" AS \"last_name\", \"gender\" AS \"gender\" FROM \"imdb_ijs\".\"actors\"), \"map_q1hj\" (\"_PRIVACY_UNIT_\", \"_PRIVACY_UNIT_WEIGHT_\", \"field_n9r3\", \"field_d4_o\") AS (SELECT \"_PRIVACY_UNIT_\" AS \"_PRIVACY_UNIT_\", \"_PRIVACY_UNIT_WEIGHT_\" AS \"_PRIVACY_UNIT_WEIGHT_\", 1 AS \"field_n9r3\", \"gender\" AS \"field_d4_o\" FROM \"map_o7od\"), \"join_j0lq\" (\"field_d4_o\", \"_PRIVACY_UNIT_\", \"_PRIVACY_UNIT_WEIGHT_\", \"field_n9r3\", \"left__q8p1\") AS (SELECT * FROM \"field_d4_o\" AS \"_LEFT_\" LEFT JOIN \"map_q1hj\" AS \"_RIGHT_\" ON (\"_LEFT_\".\"field_d4_o\") = (\"_RIGHT_\".\"field_d4_o\")), \"map_j37p\" (\"field_d4_o\", \"_PRIVACY_UNIT_\", \"_PRIVACY_UNIT_WEIGHT_\", \"field_n9r3\") AS (SELECT \"field_d4_o\" AS \"field_d4_o\", \"_PRIVACY_UNIT_\" AS \"_PRIVACY_UNIT_\", \"_PRIVACY_UNIT_WEIGHT_\" AS \"_PRIVACY_UNIT_WEIGHT_\", \"field_n9r3\" AS \"field_n9r3\" FROM \"join_j0lq\"), \"map_03mx\" (\"_PRIVACY_UNIT_\", \"_PRIVACY_UNIT_WEIGHT_\", \"field_d4_o\", \"_ONE_field_n9r3\") AS (SELECT COALESCE(CAST(\"_PRIVACY_UNIT_\" AS TEXT), '_PRIVACY_UNIT_DEFAULT_') AS \"_PRIVACY_UNIT_\", COALESCE(\"_PRIVACY_UNIT_WEIGHT_\", 0) AS \"_PRIVACY_UNIT_WEIGHT_\", \"field_d4_o\" AS \"field_d4_o\", CASE WHEN \"field_n9r3\" IS NULL THEN 0 ELSE 1 END AS \"_ONE_field_n9r3\" FROM \"map_j37p\"), \"map_a6kd\" (\"field_1iro\", \"field_grep\", \"field_zyse\") AS (SELECT \"_ONE_field_n9r3\" AS \"field_1iro\", \"_PRIVACY_UNIT_\" AS \"field_grep\", \"field_d4_o\" AS \"field_zyse\" FROM \"map_03mx\"), \"reduce_65ak\" (\"_PRIVACY_UNIT_\", \"field_d4_o\", \"_NORM__ONE_field_n9r3\") AS (SELECT field_grep AS \"_PRIVACY_UNIT_\", field_zyse AS \"field_d4_o\", SUM(\"field_1iro\") AS \"_NORM__ONE_field_n9r3\" FROM \"map_a6kd\" GROUP BY \"field_grep\", \"field_zyse\"), \"map_lvtk\" (\"_PRIVACY_UNIT_\", \"field_d4_o\", \"_NORM__ONE_field_n9r3\") AS (SELECT \"_PRIVACY_UNIT_\" AS \"_PRIVACY_UNIT_\", \"field_d4_o\" AS \"field_d4_o\", ABS((\"_NORM__ONE_field_n9r3\") * (\"_NORM__ONE_field_n9r3\")) AS \"_NORM__ONE_field_n9r3\" FROM \"reduce_65ak\"), \"map_u172\" (\"field_7ra5\", \"field_grep\") AS (SELECT \"_NORM__ONE_field_n9r3\" AS \"field_7ra5\", \"_PRIVACY_UNIT_\" AS \"field_grep\" FROM \"map_lvtk\"), \"reduce_x325\" (\"_PRIVACY_UNIT_\", \"_ONE_field_n9r3\") AS (SELECT field_grep AS \"_PRIVACY_UNIT_\", SUM(\"field_7ra5\") AS \"_ONE_field_n9r3\" FROM \"map_u172\" GROUP BY \"field_grep\"), \"map_a9ps\" (\"_PRIVACY_UNIT_\", \"_ONE_field_n9r3\") AS (SELECT \"_PRIVACY_UNIT_\" AS \"_PRIVACY_UNIT_\", (1) / (GREATEST(1, (SQRT(\"_ONE_field_n9r3\")) / (1))) AS \"_ONE_field_n9r3\" FROM \"reduce_x325\"), \"join_ecw5\" (\"_PRIVACY_UNIT_\", \"_PRIVACY_UNIT_WEIGHT_\", \"field_d4_o\", \"_ONE_field_n9r3\", \"_SCALE_FACTOR__PRIVACY_UNIT_\", \"_SCALE_FACTOR__ONE_field_n9r3\") AS (SELECT * FROM \"map_03mx\" AS \"_LEFT_\" JOIN \"map_a9ps\" AS \"_RIGHT_\" ON (\"_LEFT_\".\"_PRIVACY_UNIT_\") = (\"_RIGHT_\".\"_PRIVACY_UNIT_\")), \"map_gyzu\" (\"_PRIVACY_UNIT_\", \"_PRIVACY_UNIT_WEIGHT_\", \"field_d4_o\", \"_ONE_field_n9r3\", \"_SCALE_FACTOR__PRIVACY_UNIT_\", \"_SCALE_FACTOR__ONE_field_n9r3\", \"_CLIPPED__ONE_field_n9r3\") AS (SELECT \"_PRIVACY_UNIT_\" AS \"_PRIVACY_UNIT_\", \"_PRIVACY_UNIT_WEIGHT_\" AS \"_PRIVACY_UNIT_WEIGHT_\", \"field_d4_o\" AS \"field_d4_o\", \"_ONE_field_n9r3\" AS \"_ONE_field_n9r3\", \"_SCALE_FACTOR__PRIVACY_UNIT_\" AS \"_SCALE_FACTOR__PRIVACY_UNIT_\", \"_SCALE_FACTOR__ONE_field_n9r3\" AS \"_SCALE_FACTOR__ONE_field_n9r3\", (\"_ONE_field_n9r3\") * (\"_SCALE_FACTOR__ONE_field_n9r3\") AS \"_CLIPPED__ONE_field_n9r3\" FROM \"join_ecw5\"), \"map_in4e\" (\"field_8d51\", \"field_zyse\") AS (SELECT \"_CLIPPED__ONE_field_n9r3\" AS \"field_8d51\", \"field_d4_o\" AS \"field_zyse\" FROM \"map_gyzu\"), \"reduce__03f\" (\"field_d4_o\", \"_COUNT_field_n9r3\") AS (SELECT field_zyse AS \"field_d4_o\", SUM(\"field_8d51\") AS \"_COUNT_field_n9r3\" FROM \"map_in4e\" GROUP BY \"field_zyse\"), \"map_51tw\" (\"field_d4_o\", \"_COUNT_field_n9r3\") AS (SELECT \"field_d4_o\" AS \"field_d4_o\", LEAST(817718, GREATEST(0, (COALESCE(\"_COUNT_field_n9r3\", 0)) + ((3.776479532659047) * ((SQRT((-2) * (LN(RANDOM())))) * (COS((6.283185307179586) * (RANDOM()))))))) AS \"_COUNT_field_n9r3\" FROM \"reduce__03f\"), \"map_zoir\" (\"field_i4xy\", \"field_d4_o\") AS (SELECT CAST(\"_COUNT_field_n9r3\" AS INTEGER) AS \"field_i4xy\", \"field_d4_o\" AS \"field_d4_o\" FROM \"map_51tw\"), \"map_a33c\" (\"field_d4_o\", \"field_i4xy\") AS (SELECT \"field_d4_o\" AS \"field_d4_o\", \"field_i4xy\" AS \"field_i4xy\" FROM \"map_zoir\"), \"map_fshk\" (\"gender\", \"number_of_actors\") AS (SELECT \"field_d4_o\" AS \"gender\", \"field_i4xy\" AS \"number_of_actors\" FROM \"map_a33c\") SELECT * FROM \"map_fshk\"\n", "```" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "DpEvent(Gaussian { noise_multiplier: 3.776479532659047 })\n" ] } ], "source": [ "# We create a basic aggregation relation\n", "query = \"SELECT gender, COUNT(*) as number_of_actors FROM imdb_ijs.actors GROUP BY gender\"\n", "relation = dataset.relation(query)\n", "# The relation is turned into a DP equivalent\n", "relation_with_dp_event = relation.rewrite_with_differential_privacy(\n", " dataset,\n", " privacy_unit,\n", " budget,\n", ")\n", "# Rewritten relation\n", "dp_relation = relation_with_dp_event.relation()\n", "dp_query = dp_relation.to_query()\n", "print_sql(dp_query)\n", "# Privacy loss\n", "dpe = relation_with_dp_event.dp_event()\n", "print(dpe)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 289 }, "id": "gHe8VMey53-W", "outputId": "c7654ffe-8e17-45ed-a59c-35fc204e1ce8" }, "outputs": [ { "data": { "text/markdown": [ "### Result" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
gendernumber_of_actors
0F304412
1M513306
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "
\n", "
\n" ], "text/plain": [ " gender number_of_actors\n", "0 F 304412\n", "1 M 513306" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ "### DP Result" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
gendernumber_of_actors
1F304408
0M513310
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "
\n", "
\n" ], "text/plain": [ " gender number_of_actors\n", "1 F 304408\n", "0 M 513310" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import pandas as pd\n", "\n", "true_res = pd.read_sql(query, database.engine())\n", "dp_res = pd.read_sql(dp_query, database.engine())\n", "\n", "print_result('Result', true_res.sort_values('gender'))\n", "print_result('DP Result', dp_res.sort_values('gender'))" ] }, { "cell_type": "markdown", "metadata": { "id": "WTUME5SL6Yez" }, "source": [ "## Run DP SQL queries with non-trivial Privacy Unit multiplicity\n", "\n", "When multiple rows *belong* to the same privacy unit. Bounding the influence of one privacy unit can be tricky.\n", "Qrlew evaluates the worst case *multiplicity* of rows per privacy unit.\n", "\n", "It can also be passed as an argument." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 72 }, "id": "C4W3jAP66Sum", "outputId": "2a8568cc-5b0f-438f-b8ad-754b20ca923b" }, "outputs": [ { "data": { "text/markdown": [ "```sql\n", "WITH \"field_8dt4\" (\"field_8dt4\") AS (SELECT * FROM (VALUES ('F'), ('M')) AS \"field_8dt4\" (\"field_8dt4\")), \"map_o7od\" (\"_PRIVACY_UNIT_\", \"_PRIVACY_UNIT_WEIGHT_\", \"id\", \"first_name\", \"last_name\", \"gender\") AS (SELECT MD5(CAST(\"id\" AS TEXT)) AS \"_PRIVACY_UNIT_\", 1 AS \"_PRIVACY_UNIT_WEIGHT_\", \"id\" AS \"id\", \"first_name\" AS \"first_name\", \"last_name\" AS \"last_name\", \"gender\" AS \"gender\" FROM \"imdb_ijs\".\"actors\"), \"join_5pxq\" (\"field_6w80\", \"field_adyv\", \"field_mg8l\", \"field_ey3i\", \"field_rvp5\", \"field_kwgs\", \"field_cenr\") AS (SELECT * FROM \"imdb_ijs\".\"actors\" AS \"_LEFT_\" JOIN \"imdb_ijs\".\"roles\" AS \"_RIGHT_\" ON (\"_RIGHT_\".\"actor_id\") = (\"_LEFT_\".\"id\")), \"map_azcr\" (\"_PRIVACY_UNIT_\", \"_PRIVACY_UNIT_WEIGHT_\", \"actor_id\", \"movie_id\", \"role\") AS (SELECT MD5(CAST(\"field_6w80\" AS TEXT)) AS \"_PRIVACY_UNIT_\", 1 AS \"_PRIVACY_UNIT_WEIGHT_\", \"field_rvp5\" AS \"actor_id\", \"field_kwgs\" AS \"movie_id\", \"field_cenr\" AS \"role\" FROM \"join_5pxq\"), \"join_2qan\" (\"_LEFT_PRIVACY_UNIT_\", \"_LEFT_PRIVACY_UNIT_WEIGHT_\", \"field_6w80\", \"field_adyv\", \"field_mg8l\", \"field_ey3i\", \"_RIGHT_PRIVACY_UNIT_\", \"_RIGHT_PRIVACY_UNIT_WEIGHT_\", \"field_rvp5\", \"field_kwgs\", \"field_cenr\") AS (SELECT * FROM \"map_o7od\" AS \"_LEFT_\" JOIN \"map_azcr\" AS \"_RIGHT_\" ON ((\"_LEFT_\".\"_PRIVACY_UNIT_\") = (\"_RIGHT_\".\"_PRIVACY_UNIT_\")) AND ((\"_RIGHT_\".\"actor_id\") = (\"_LEFT_\".\"id\"))), \"map_pc5d\" (\"_PRIVACY_UNIT_\", \"_PRIVACY_UNIT_WEIGHT_\", \"field_adyv\", \"field_ey3i\", \"field_6w80\", \"field_mg8l\", \"field_rvp5\", \"field_kwgs\", \"field_cenr\") AS (SELECT \"_LEFT_PRIVACY_UNIT_\" AS \"_PRIVACY_UNIT_\", (\"_LEFT_PRIVACY_UNIT_WEIGHT_\") * (\"_RIGHT_PRIVACY_UNIT_WEIGHT_\") AS \"_PRIVACY_UNIT_WEIGHT_\", \"field_adyv\" AS \"field_adyv\", \"field_ey3i\" AS \"field_ey3i\", \"field_6w80\" AS \"field_6w80\", \"field_mg8l\" AS \"field_mg8l\", \"field_rvp5\" AS \"field_rvp5\", \"field_kwgs\" AS \"field_kwgs\", \"field_cenr\" AS \"field_cenr\" FROM \"join_2qan\"), \"map_jyze\" (\"_PRIVACY_UNIT_\", \"_PRIVACY_UNIT_WEIGHT_\", \"field_n9r3\", \"field_8dt4\") AS (SELECT \"_PRIVACY_UNIT_\" AS \"_PRIVACY_UNIT_\", \"_PRIVACY_UNIT_WEIGHT_\" AS \"_PRIVACY_UNIT_WEIGHT_\", 1 AS \"field_n9r3\", \"field_ey3i\" AS \"field_8dt4\" FROM \"map_pc5d\"), \"join_jvsu\" (\"field_8dt4\", \"_PRIVACY_UNIT_\", \"_PRIVACY_UNIT_WEIGHT_\", \"field_n9r3\", \"left__ybda\") AS (SELECT * FROM \"field_8dt4\" AS \"_LEFT_\" LEFT JOIN \"map_jyze\" AS \"_RIGHT_\" ON (\"_LEFT_\".\"field_8dt4\") = (\"_RIGHT_\".\"field_8dt4\")), \"map_qpyp\" (\"field_8dt4\", \"_PRIVACY_UNIT_\", \"_PRIVACY_UNIT_WEIGHT_\", \"field_n9r3\") AS (SELECT \"field_8dt4\" AS \"field_8dt4\", \"_PRIVACY_UNIT_\" AS \"_PRIVACY_UNIT_\", \"_PRIVACY_UNIT_WEIGHT_\" AS \"_PRIVACY_UNIT_WEIGHT_\", \"field_n9r3\" AS \"field_n9r3\" FROM \"join_jvsu\"), \"map_pt6d\" (\"_PRIVACY_UNIT_\", \"_PRIVACY_UNIT_WEIGHT_\", \"field_8dt4\", \"_ONE_field_n9r3\") AS (SELECT COALESCE(CAST(\"_PRIVACY_UNIT_\" AS TEXT), '_PRIVACY_UNIT_DEFAULT_') AS \"_PRIVACY_UNIT_\", COALESCE(\"_PRIVACY_UNIT_WEIGHT_\", 0) AS \"_PRIVACY_UNIT_WEIGHT_\", \"field_8dt4\" AS \"field_8dt4\", CASE WHEN \"field_n9r3\" IS NULL THEN 0 ELSE 1 END AS \"_ONE_field_n9r3\" FROM \"map_qpyp\"), \"map_any6\" (\"field_1iro\", \"field_grep\", \"field_kggk\") AS (SELECT \"_ONE_field_n9r3\" AS \"field_1iro\", \"_PRIVACY_UNIT_\" AS \"field_grep\", \"field_8dt4\" AS \"field_kggk\" FROM \"map_pt6d\"), \"reduce_x8hs\" (\"_PRIVACY_UNIT_\", \"field_8dt4\", \"_NORM__ONE_field_n9r3\") AS (SELECT field_grep AS \"_PRIVACY_UNIT_\", field_kggk AS \"field_8dt4\", SUM(\"field_1iro\") AS \"_NORM__ONE_field_n9r3\" FROM \"map_any6\" GROUP BY \"field_grep\", \"field_kggk\"), \"map_1too\" (\"_PRIVACY_UNIT_\", \"field_8dt4\", \"_NORM__ONE_field_n9r3\") AS (SELECT \"_PRIVACY_UNIT_\" AS \"_PRIVACY_UNIT_\", \"field_8dt4\" AS \"field_8dt4\", ABS((\"_NORM__ONE_field_n9r3\") * (\"_NORM__ONE_field_n9r3\")) AS \"_NORM__ONE_field_n9r3\" FROM \"reduce_x8hs\"), \"map_xd2x\" (\"field_7ra5\", \"field_grep\") AS (SELECT \"_NORM__ONE_field_n9r3\" AS \"field_7ra5\", \"_PRIVACY_UNIT_\" AS \"field_grep\" FROM \"map_1too\"), \"reduce_vvwi\" (\"_PRIVACY_UNIT_\", \"_ONE_field_n9r3\") AS (SELECT field_grep AS \"_PRIVACY_UNIT_\", SUM(\"field_7ra5\") AS \"_ONE_field_n9r3\" FROM \"map_xd2x\" GROUP BY \"field_grep\"), \"map_hidt\" (\"_PRIVACY_UNIT_\", \"_ONE_field_n9r3\") AS (SELECT \"_PRIVACY_UNIT_\" AS \"_PRIVACY_UNIT_\", (1) / (GREATEST(1, (SQRT(\"_ONE_field_n9r3\")) / (100))) AS \"_ONE_field_n9r3\" FROM \"reduce_vvwi\"), \"join_ax3z\" (\"_PRIVACY_UNIT_\", \"_PRIVACY_UNIT_WEIGHT_\", \"field_8dt4\", \"_ONE_field_n9r3\", \"_SCALE_FACTOR__PRIVACY_UNIT_\", \"_SCALE_FACTOR__ONE_field_n9r3\") AS (SELECT * FROM \"map_pt6d\" AS \"_LEFT_\" JOIN \"map_hidt\" AS \"_RIGHT_\" ON (\"_LEFT_\".\"_PRIVACY_UNIT_\") = (\"_RIGHT_\".\"_PRIVACY_UNIT_\")), \"map_cra4\" (\"_PRIVACY_UNIT_\", \"_PRIVACY_UNIT_WEIGHT_\", \"field_8dt4\", \"_ONE_field_n9r3\", \"_SCALE_FACTOR__PRIVACY_UNIT_\", \"_SCALE_FACTOR__ONE_field_n9r3\", \"_CLIPPED__ONE_field_n9r3\") AS (SELECT \"_PRIVACY_UNIT_\" AS \"_PRIVACY_UNIT_\", \"_PRIVACY_UNIT_WEIGHT_\" AS \"_PRIVACY_UNIT_WEIGHT_\", \"field_8dt4\" AS \"field_8dt4\", \"_ONE_field_n9r3\" AS \"_ONE_field_n9r3\", \"_SCALE_FACTOR__PRIVACY_UNIT_\" AS \"_SCALE_FACTOR__PRIVACY_UNIT_\", \"_SCALE_FACTOR__ONE_field_n9r3\" AS \"_SCALE_FACTOR__ONE_field_n9r3\", (\"_ONE_field_n9r3\") * (\"_SCALE_FACTOR__ONE_field_n9r3\") AS \"_CLIPPED__ONE_field_n9r3\" FROM \"join_ax3z\"), \"map_wg5e\" (\"field_8d51\", \"field_kggk\") AS (SELECT \"_CLIPPED__ONE_field_n9r3\" AS \"field_8d51\", \"field_8dt4\" AS \"field_kggk\" FROM \"map_cra4\"), \"reduce_kn50\" (\"field_8dt4\", \"_COUNT_field_n9r3\") AS (SELECT field_kggk AS \"field_8dt4\", SUM(\"field_8d51\") AS \"_COUNT_field_n9r3\" FROM \"map_wg5e\" GROUP BY \"field_kggk\"), \"map_s9if\" (\"field_8dt4\", \"_COUNT_field_n9r3\") AS (SELECT \"field_8dt4\" AS \"field_8dt4\", LEAST(3431966, GREATEST(0, (COALESCE(\"_COUNT_field_n9r3\", 0)) + ((377.6479532659047) * ((SQRT((-2) * (LN(RANDOM())))) * (COS((6.283185307179586) * (RANDOM()))))))) AS \"_COUNT_field_n9r3\" FROM \"reduce_kn50\"), \"map_687m\" (\"field_i4xy\", \"field_8dt4\") AS (SELECT CAST(\"_COUNT_field_n9r3\" AS INTEGER) AS \"field_i4xy\", \"field_8dt4\" AS \"field_8dt4\" FROM \"map_s9if\"), \"map_0efo\" (\"field_8dt4\", \"field_i4xy\") AS (SELECT \"field_8dt4\" AS \"field_8dt4\", \"field_i4xy\" AS \"field_i4xy\" FROM \"map_687m\"), \"map_4gye\" (\"gender\", \"total_number_of_roles\") AS (SELECT \"field_8dt4\" AS \"gender\", \"field_i4xy\" AS \"total_number_of_roles\" FROM \"map_0efo\") SELECT * FROM \"map_4gye\"\n", "```" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "DpEvent(Gaussian { noise_multiplier: 377.6479532659047 })\n" ] } ], "source": [ "# We create a basic aggregation relation\n", "query = \"SELECT gender, COUNT(*) as total_number_of_roles FROM imdb_ijs.actors JOIN imdb_ijs.roles ON roles.actor_id=actors.id GROUP BY gender\"\n", "relation = dataset.relation(query)\n", "# The relation is turned into a DP equivalent\n", "relation_with_dp_event = relation.rewrite_with_differential_privacy(\n", " dataset,\n", " privacy_unit,\n", " budget,\n", ")\n", "# Rewritten relation\n", "dp_relation = relation_with_dp_event.relation()\n", "dp_query = dp_relation.to_query()\n", "print_sql(dp_query)\n", "# Privacy loss\n", "dpe = relation_with_dp_event.dp_event()\n", "print(dpe)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 1000 }, "id": "ODMkLMSt9Wy-", "outputId": "9692348e-3f65-415e-b5a8-7d8ec592e4d4" }, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "graph_1iz3\n", "\n", "\n", "\n", "graph_1iz3\n", "\n", "MAP_4GYE size ∈ int[0 3431966]\n", "gender = field_8dt4 ∈ str{F, M} UNIQUE\n", "total_number_of_roles = field_i4xy ∈ int[0 3431966]\n", "\n", "\n", "\n", "graph_x6xd\n", "\n", "MAP_0EFO size ∈ int[0 3431966]\n", "field_8dt4 = field_8dt4 ∈ str{F, M} UNIQUE\n", "field_i4xy = field_i4xy ∈ int[0 3431966]\n", "\n", "\n", "\n", "graph_1iz3->graph_x6xd\n", "\n", "\n", "\n", "\n", "\n", "graph_f34t\n", "\n", "MAP_687M size ∈ int[0 3431966]\n", "field_i4xy = cast_as_integer(_COUNT_field_n9r3) ∈ int[0 3431966]\n", "field_8dt4 = field_8dt4 ∈ str{F, M} UNIQUE\n", "\n", "\n", "\n", "graph_x6xd->graph_f34t\n", "\n", "\n", "\n", "\n", "\n", "graph_som_\n", "\n", "MAP_S9IF size ∈ int[0 3431966]\n", "field_8dt4 = field_8dt4 ∈ str{F, M} UNIQUE\n", "_COUNT_field_n9r3 = least(3431966, greatest(0, (coalesce(_COUNT_field_n9r3, 0) + (377.6479532659047 * (sqrt((-2 * ln(random()...\n", "\n", "\n", "\n", "graph_f34t->graph_som_\n", "\n", "\n", "\n", "\n", "\n", "graph_4kwf\n", "\n", "REDUCE_KN50 size ∈ int[0 3431966]\n", "field_8dt4 = first(field_kggk) ∈ str{F, M} UNIQUE\n", "_COUNT_field_n9r3 = sum(field_8d51) ∈ option(float[0 3431966])\n", "GROUP BY (field_kggk)\n", "\n", "\n", "\n", "graph_som_->graph_4kwf\n", "\n", "\n", "\n", "\n", "\n", "graph_cmbo\n", "\n", "MAP_WG5E size ∈ int[0 3431966]\n", "field_8d51 = _CLIPPED__ONE_field_n9r3 ∈ option(float[0.00000001572843583455192 1])\n", "field_kggk = field_8dt4 ∈ str{F, M}\n", "\n", "\n", "\n", "graph_4kwf->graph_cmbo\n", "\n", "\n", "\n", "\n", "\n", "graph_6t4r\n", "\n", "MAP_CRA4 size ∈ int[0 3431966]\n", "_PRIVACY_UNIT_ = _PRIVACY_UNIT_ ∈ str\n", "_PRIVACY_UNIT_WEIGHT_ = _PRIVACY_UNIT_WEIGHT_ ∈ float{0, 1}\n", "field_8dt4 = field_8dt4 ∈ str{F, M}\n", "_ONE_field_n9r3 = _ONE_field_n9r3 ∈ option(float{1})\n", "_SCALE_FACTOR__PRIVACY_UNIT_ = _SCALE_FACTOR__PRIVACY_UNIT_ ∈ str\n", "_SCALE_FACTOR__ONE_field_n9r3 = _SCALE_FACTOR__ONE_field_n9r3 ∈ option(float[0.00000001572843583455192 1])\n", "_CLIPPED__ONE_field_n9r3 = (_ONE_field_n9r3 * _SCALE_FACTOR__ONE_field_n9r3) ∈ option(float[0.00000001572843583455192 1])\n", "\n", "\n", "\n", "graph_cmbo->graph_6t4r\n", "\n", "\n", "\n", "\n", "\n", "graph_rffi\n", "\n", "JOIN_AX3Z size ∈ int[0 3431966]\n", "_PRIVACY_UNIT_ = _LEFT_._PRIVACY_UNIT_ ∈ str\n", "_PRIVACY_UNIT_WEIGHT_ = _LEFT_._PRIVACY_UNIT_WEIGHT_ ∈ float{0, 1}\n", "field_8dt4 = _LEFT_.field_8dt4 ∈ str{F, M}\n", "_ONE_field_n9r3 = _LEFT_._ONE_field_n9r3 ∈ option(float{1})\n", "_SCALE_FACTOR__PRIVACY_UNIT_ = _RIGHT_._PRIVACY_UNIT_ ∈ str\n", "_SCALE_FACTOR__ONE_field_n9r3 = _RIGHT_._ONE_field_n9r3 ∈ option(float[0.00000001572843583455192 1])\n", "INNER ON (_LEFT_._PRIVACY_UNIT_ = _RIGHT_._PRIVACY_UNIT_)\n", "\n", "\n", "\n", "graph_6t4r->graph_rffi\n", "\n", "\n", "\n", "\n", "\n", "graph_xovo\n", "\n", "MAP_HIDT size ∈ int[0 3431966]\n", "_PRIVACY_UNIT_ = _PRIVACY_UNIT_ ∈ str UNIQUE\n", "_ONE_field_n9r3 = (1 / greatest(1, (sqrt(_ONE_field_n9r3) / 100))) ∈ option(float[0.00000001572843583455192 1])\n", "\n", "\n", "\n", "graph_rffi->graph_xovo\n", "\n", "\n", "\n", "\n", "\n", "graph_d4bc\n", "\n", "MAP_PT6D size ∈ int[0 3431966]\n", "_PRIVACY_UNIT_ = coalesce(cast_as_text(_PRIVACY_UNIT_), _PRIVACY_UNIT_DEFAULT_) ∈ str\n", "_PRIVACY_UNIT_WEIGHT_ = coalesce(_PRIVACY_UNIT_WEIGHT_, 0) ∈ float{0, 1}\n", "field_8dt4 = field_8dt4 ∈ str{F, M}\n", "_ONE_field_n9r3 = case(is_null(field_n9r3), 0, 1) ∈ option(float{1})\n", "\n", "\n", "\n", "graph_rffi->graph_d4bc\n", "\n", "\n", "\n", "\n", "\n", "graph_69wr\n", "\n", "REDUCE_VVWI size ∈ int[0 3431966]\n", "_PRIVACY_UNIT_ = first(field_grep) ∈ str UNIQUE\n", "_ONE_field_n9r3 = sum(field_7ra5) ∈ option(float[0 40423036160254140000])\n", "GROUP BY (field_grep)\n", "\n", "\n", "\n", "graph_xovo->graph_69wr\n", "\n", "\n", "\n", "\n", "\n", "graph_l3fx\n", "\n", "MAP_XD2X size ∈ int[0 3431966]\n", "field_7ra5 = _NORM__ONE_field_n9r3 ∈ option(float[0 11778390625156])\n", "field_grep = _PRIVACY_UNIT_ ∈ str\n", "\n", "\n", "\n", "graph_69wr->graph_l3fx\n", "\n", "\n", "\n", "\n", "\n", "graph_m5ah\n", "\n", "MAP_1TOO size ∈ int[0 3431966]\n", "_PRIVACY_UNIT_ = _PRIVACY_UNIT_ ∈ str\n", "field_8dt4 = field_8dt4 ∈ str{F, M}\n", "_NORM__ONE_field_n9r3 = abs((_NORM__ONE_field_n9r3 * _NORM__ONE_field_n9r3)) ∈ option(float[0 11778390625156])\n", "\n", "\n", "\n", "graph_l3fx->graph_m5ah\n", "\n", "\n", "\n", "\n", "\n", "graph_l8li\n", "\n", "REDUCE_X8HS size ∈ int[0 3431966]\n", "_PRIVACY_UNIT_ = first(field_grep) ∈ str\n", "field_8dt4 = first(field_kggk) ∈ str{F, M}\n", "_NORM__ONE_field_n9r3 = sum(field_1iro) ∈ option(int[0 3431966])\n", "GROUP BY (field_grep, field_kggk)\n", "\n", "\n", "\n", "graph_m5ah->graph_l8li\n", "\n", "\n", "\n", "\n", "\n", "graph_lds1\n", "\n", "MAP_ANY6 size ∈ int[0 3431966]\n", "field_1iro = _ONE_field_n9r3 ∈ option(float{1})\n", "field_grep = _PRIVACY_UNIT_ ∈ str\n", "field_kggk = field_8dt4 ∈ str{F, M}\n", "\n", "\n", "\n", "graph_l8li->graph_lds1\n", "\n", "\n", "\n", "\n", "\n", "graph_lds1->graph_d4bc\n", "\n", "\n", "\n", "\n", "\n", "graph_5d5_\n", "\n", "MAP_QPYP size ∈ int[0 3431966]\n", "field_8dt4 = field_8dt4 ∈ str{F, M}\n", "_PRIVACY_UNIT_ = _PRIVACY_UNIT_ ∈ option(str)\n", "_PRIVACY_UNIT_WEIGHT_ = _PRIVACY_UNIT_WEIGHT_ ∈ option(int{1})\n", "field_n9r3 = field_n9r3 ∈ option(int{1})\n", "\n", "\n", "\n", "graph_d4bc->graph_5d5_\n", "\n", "\n", "\n", "\n", "\n", "graph_kx7g\n", "\n", "JOIN_JVSU size ∈ int[0 3431966]\n", "field_8dt4 = _LEFT_.field_8dt4 ∈ str{F, M}\n", "_PRIVACY_UNIT_ = _RIGHT_._PRIVACY_UNIT_ ∈ option(str)\n", "_PRIVACY_UNIT_WEIGHT_ = _RIGHT_._PRIVACY_UNIT_WEIGHT_ ∈ option(int{1})\n", "field_n9r3 = _RIGHT_.field_n9r3 ∈ option(int{1})\n", "left__ybda = _RIGHT_.field_8dt4 ∈ option(str{F, M})\n", "LEFT ON (_LEFT_.field_8dt4 = _RIGHT_.field_8dt4)\n", "\n", "\n", "\n", "graph_5d5_->graph_kx7g\n", "\n", "\n", "\n", "\n", "\n", "graph_ukvz\n", "\n", "FIELD_8DT4 size ∈ int{2}\n", "[F, M]\n", "\n", "\n", "\n", "graph_kx7g->graph_ukvz\n", "\n", "\n", "\n", "\n", "\n", "graph_gjqr\n", "\n", "MAP_JYZE size ∈ int[0 3431966]\n", "_PRIVACY_UNIT_ = _PRIVACY_UNIT_ ∈ str\n", "_PRIVACY_UNIT_WEIGHT_ = _PRIVACY_UNIT_WEIGHT_ ∈ int{1}\n", "field_n9r3 = 1 ∈ int{1}\n", "field_8dt4 = field_ey3i ∈ str{F, M}\n", "\n", "\n", "\n", "graph_kx7g->graph_gjqr\n", "\n", "\n", "\n", "\n", "\n", "graph_ix1i\n", "\n", "MAP_PC5D size ∈ int[0 3431966]\n", "_PRIVACY_UNIT_ = _LEFT_PRIVACY_UNIT_ ∈ str\n", "_PRIVACY_UNIT_WEIGHT_ = (_LEFT_PRIVACY_UNIT_WEIGHT_ * _RIGHT_PRIVACY_UNIT_WEIGHT_) ∈ int{1}\n", "field_adyv = field_adyv ∈ str\n", "field_ey3i = field_ey3i ∈ str{F, M}\n", "field_6w80 = field_6w80 ∈ int\n", "field_mg8l = field_mg8l ∈ str\n", "field_rvp5 = field_rvp5 ∈ int\n", "field_kwgs = field_kwgs ∈ int\n", "field_cenr = field_cenr ∈ str\n", "\n", "\n", "\n", "graph_gjqr->graph_ix1i\n", "\n", "\n", "\n", "\n", "\n", "graph__2dq\n", "\n", "JOIN_2QAN size ∈ int[0 3431966]\n", "_LEFT_PRIVACY_UNIT_ = _LEFT_._PRIVACY_UNIT_ ∈ str\n", "_LEFT_PRIVACY_UNIT_WEIGHT_ = _LEFT_._PRIVACY_UNIT_WEIGHT_ ∈ int{1}\n", "field_6w80 = _LEFT_.id ∈ int\n", "field_adyv = _LEFT_.first_name ∈ str\n", "field_mg8l = _LEFT_.last_name ∈ str\n", "field_ey3i = _LEFT_.gender ∈ str{F, M}\n", "_RIGHT_PRIVACY_UNIT_ = _RIGHT_._PRIVACY_UNIT_ ∈ str\n", "_RIGHT_PRIVACY_UNIT_WEIGHT_ = _RIGHT_._PRIVACY_UNIT_WEIGHT_ ∈ int{1}\n", "field_rvp5 = _RIGHT_.actor_id ∈ int\n", "field_kwgs = _RIGHT_.movie_id ∈ int\n", "field_cenr = _RIGHT_.role ∈ str\n", "INNER ON ((_LEFT_._PRIVACY_UNIT_ = _RIGHT_._PRIVACY_UNIT_) and (_RIGHT_.actor_id = _LEFT_.id))\n", "\n", "\n", "\n", "graph_ix1i->graph__2dq\n", "\n", "\n", "\n", "\n", "\n", "graph_n3dl\n", "\n", "MAP_O7OD size ∈ int[0 817718]\n", "_PRIVACY_UNIT_ = md5(cast_as_text(id)) ∈ str UNIQUE\n", "_PRIVACY_UNIT_WEIGHT_ = 1 ∈ int{1}\n", "id = id ∈ int UNIQUE\n", "first_name = first_name ∈ str\n", "last_name = last_name ∈ str\n", "gender = gender ∈ str{F, M}\n", "\n", "\n", "\n", "graph__2dq->graph_n3dl\n", "\n", "\n", "\n", "\n", "\n", "graph_621r\n", "\n", "MAP_AZCR size ∈ int[0 3431966]\n", "_PRIVACY_UNIT_ = md5(cast_as_text(field_6w80)) ∈ str\n", "_PRIVACY_UNIT_WEIGHT_ = 1 ∈ int{1}\n", "actor_id = field_rvp5 ∈ int\n", "movie_id = field_kwgs ∈ int\n", "role = field_cenr ∈ str\n", "\n", "\n", "\n", "graph__2dq->graph_621r\n", "\n", "\n", "\n", "\n", "\n", "graph_hq3c\n", "\n", "IMDB_IJS_ACTORS size ∈ int{817718}\n", "id = id ∈ int UNIQUE\n", "first_name = first_name ∈ str\n", "last_name = last_name ∈ str\n", "gender = gender ∈ str{F, M}\n", "\n", "\n", "\n", "graph_n3dl->graph_hq3c\n", "\n", "\n", "\n", "\n", "\n", "graph_mgxe\n", "\n", "JOIN_5PXQ size ∈ int[0 3431966]\n", "field_6w80 = _LEFT_.id ∈ int\n", "field_adyv = _LEFT_.first_name ∈ str\n", "field_mg8l = _LEFT_.last_name ∈ str\n", "field_ey3i = _LEFT_.gender ∈ str{F, M}\n", "field_rvp5 = _RIGHT_.actor_id ∈ int\n", "field_kwgs = _RIGHT_.movie_id ∈ int\n", "field_cenr = _RIGHT_.role ∈ str\n", "INNER ON (_RIGHT_.actor_id = _LEFT_.id)\n", "\n", "\n", "\n", "graph_621r->graph_mgxe\n", "\n", "\n", "\n", "\n", "\n", "graph_mgxe->graph_hq3c\n", "\n", "\n", "\n", "\n", "\n", "graph_6mn0\n", "\n", "IMDB_IJS_ROLES size ∈ int{3431966}\n", "actor_id = actor_id ∈ int\n", "movie_id = movie_id ∈ int\n", "role = role ∈ str\n", "\n", "\n", "\n", "graph_mgxe->graph_6mn0\n", "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "print_dot(dp_relation.dot())" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 289 }, "id": "F90XcmGS6rql", "outputId": "53745421-6808-4d39-bff1-6d22caa61e38" }, "outputs": [ { "data": { "text/markdown": [ "### Result" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
gendertotal_number_of_roles
0F1135174
1M2296792
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "
\n", "
\n" ], "text/plain": [ " gender total_number_of_roles\n", "0 F 1135174\n", "1 M 2296792" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ "### DP Result" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
gendertotal_number_of_roles
1F1120108
0M2186684
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "
\n", "
\n" ], "text/plain": [ " gender total_number_of_roles\n", "1 F 1120108\n", "0 M 2186684" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import pandas as pd\n", "\n", "true_res = pd.read_sql(query, database.engine())\n", "dp_res = pd.read_sql(dp_query, database.engine())\n", "\n", "print_result('Result', true_res.sort_values('gender'))\n", "print_result('DP Result', dp_res.sort_values('gender'))" ] }, { "cell_type": "markdown", "metadata": { "id": "eMHTH517lcow" }, "source": [ "In this example Qrlew made assumptions on the number of roles per actor to clip the influence of one actor. This is tunable.\n", "We can for example set a `max_multiplicity` parameter of 1.\n", "\n" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 289 }, "id": "2_jCz736mSFl", "outputId": "8921bd9d-f49b-4afa-a549-3e955184d5b4" }, "outputs": [ { "data": { "text/markdown": [ "### Result" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
gendertotal_number_of_roles
0F1135174
1M2296792
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "
\n", "
\n" ], "text/plain": [ " gender total_number_of_roles\n", "0 F 1135174\n", "1 M 2296792" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ "### DP Result" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
gendertotal_number_of_roles
1F304411
0M513306
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "
\n", "
\n" ], "text/plain": [ " gender total_number_of_roles\n", "1 F 304411\n", "0 M 513306" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# The relation is turned into a DP equivalent\n", "relation_with_dp_event = relation.rewrite_with_differential_privacy(\n", " dataset,\n", " privacy_unit,\n", " budget,\n", " max_multiplicity=1,\n", ")\n", "# Rewritten relation\n", "dp_relation = relation_with_dp_event.relation()\n", "dp_query = dp_relation.to_query()\n", "dp_res = pd.read_sql(dp_query, database.engine())\n", "\n", "print_result('Result', true_res.sort_values('gender'))\n", "print_result('DP Result', dp_res.sort_values('gender'))" ] }, { "cell_type": "markdown", "metadata": { "id": "KQMcKkRSoFOj" }, "source": [ "The result will have litle noise, but we see it is very biased.\n", "We actually count the number of actor instead of the number of role.\n", "\n", "We can now set a `max_multiplicity` parameter of 100000 and a `max_multiplicity_share` of 100% of estimated size of the dataset as qrlew will use the smallest of both." ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 289 }, "id": "ke15efQ9oOlu", "outputId": "73d8fa09-169c-4268-eb73-f5fe52dbd813" }, "outputs": [ { "data": { "text/markdown": [ "### Result" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
gendertotal_number_of_roles
0F1135174
1M2296792
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "
\n", "
\n" ], "text/plain": [ " gender total_number_of_roles\n", "0 F 1135174\n", "1 M 2296792" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ "### DP Result" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
gendertotal_number_of_roles
1F1251391
0M1988560
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "
\n", "
\n" ], "text/plain": [ " gender total_number_of_roles\n", "1 F 1251391\n", "0 M 1988560" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# The relation is turned into a DP equivalent\n", "relation_with_dp_event = relation.rewrite_with_differential_privacy(\n", " dataset,\n", " privacy_unit,\n", " budget,\n", " max_multiplicity=100000,\n", " max_multiplicity_share=1.,\n", ")\n", "# Rewritten relation\n", "dp_relation = relation_with_dp_event.relation()\n", "dp_query = dp_relation.to_query()\n", "dp_res = pd.read_sql(dp_query, database.engine())\n", "\n", "print_result('Result', true_res.sort_values('gender'))\n", "print_result('DP Result', dp_res.sort_values('gender'))" ] }, { "cell_type": "markdown", "metadata": { "id": "JDMIyzg9pnej" }, "source": [ "Now the value will be unbiased, but very noisy.\n", "\n", "Taking a value of 1000 yields good results here.\n", "\n", "The defaults is `max_multiplicity=100.0` and `max_multiplicity_share=0.1`." ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 289 }, "id": "F7gDlsWfpz9d", "outputId": "bbbfe62e-1b09-436e-d08e-06d65b9c8d23" }, "outputs": [ { "data": { "text/markdown": [ "### Result" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
gendertotal_number_of_roles
0F1135174
1M2296792
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "
\n", "
\n" ], "text/plain": [ " gender total_number_of_roles\n", "0 F 1135174\n", "1 M 2296792" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ "### DP Result" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
gendertotal_number_of_roles
1F1140708
0M2299726
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "
\n", "
\n" ], "text/plain": [ " gender total_number_of_roles\n", "1 F 1140708\n", "0 M 2299726" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# The relation is turned into a DP equivalent\n", "relation_with_dp_event = relation.rewrite_with_differential_privacy(\n", " dataset,\n", " privacy_unit,\n", " budget,\n", " max_multiplicity=1000,\n", " max_multiplicity_share=1.,\n", ")\n", "# Rewritten relation\n", "dp_relation = relation_with_dp_event.relation()\n", "dp_query = dp_relation.to_query()\n", "dp_res = pd.read_sql(dp_query, database.engine())\n", "\n", "print_result('Result', true_res.sort_values('gender'))\n", "print_result('DP Result', dp_res.sort_values('gender'))" ] }, { "cell_type": "markdown", "metadata": { "id": "wQ5LZ0xet748" }, "source": [ "## Qrlew supports several SQL dialects: PostgreSQL (default), MSSQL or BigQuery.\n", "\n" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "id": "ncmHaAdY0wzU" }, "outputs": [], "source": [ "from pyqrlew import Dialect\n", "\n", "# We create a basic relation\n", "query = \"SELECT LN(1+rank) as log_rank FROM imdb_ijs.movies GROUP BY gender\"\n", "relation = dataset.relation(query)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 75 }, "id": "DdyvN-anuPIA", "outputId": "eff46989-5f47-4297-ba6f-bcd46f1f1a68" }, "outputs": [ { "data": { "text/markdown": [ "### Relation rendered as PostgreSQL" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ "```sql\n", "WITH \"reduce_9y_x\" (\"field_07kk\") AS (SELECT rank AS \"field_07kk\" FROM \"imdb_ijs\".\"movies\" GROUP BY \"gender\"), \"map_ucb9\" (\"log_rank\") AS (SELECT LN((1) + (\"field_07kk\")) AS \"log_rank\" FROM \"reduce_9y_x\") SELECT * FROM \"map_ucb9\"\n", "```" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "print_md('### Relation rendered as PostgreSQL')\n", "query = relation.to_query()\n", "print_sql(query)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 75 }, "id": "6NLf-7OtugVn", "outputId": "70155a25-f23d-4b54-e579-5a4795989331" }, "outputs": [ { "data": { "text/markdown": [ "### Relation rendered as BigQuery" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ "```sql\n", "WITH `reduce_9y_x` AS (SELECT rank AS `field_07kk` FROM `imdb_ijs`.`movies` GROUP BY `gender`), `map_ucb9` AS (SELECT LN((1) + (`field_07kk`)) AS `log_rank` FROM `reduce_9y_x`) SELECT * FROM `map_ucb9`\n", "```" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "print_md('### Relation rendered as BigQuery')\n", "query = relation.to_query(dialect=Dialect.BigQuery)\n", "print_sql(query)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 75 }, "id": "IrKKZru5vcdL", "outputId": "d624fd77-e50f-4674-bcf5-ad13c0a182fc" }, "outputs": [ { "data": { "text/markdown": [ "### Relation rendered as MsSql" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ "```sql\n", "WITH \"reduce_9y_x\" (\"field_07kk\") AS (SELECT rank AS \"field_07kk\" FROM \"imdb_ijs\".\"movies\" GROUP BY \"gender\"), \"map_ucb9\" (\"log_rank\") AS (SELECT LOG((1) + (\"field_07kk\")) AS \"log_rank\" FROM \"reduce_9y_x\") SELECT * FROM \"map_ucb9\"\n", "```" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "print_md('### Relation rendered as MsSql')\n", "query = relation.to_query(dialect=Dialect.MsSql)\n", "print_sql(query)" ] }, { "cell_type": "markdown", "metadata": { "id": "vxmHupRyHaJh" }, "source": [ "Note how the quoting varies and how LN is converted into LOG for MSSQL.\n", "\n", "Not only Qrlew can render relation ito many dialects it can also parse different dialects." ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 368 }, "id": "n8Yzz12GmN68", "outputId": "d469729b-abf1-48a8-85cb-8e5bfdfbaa4e" }, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "graph_m0sq\n", "\n", "\n", "\n", "graph_m0sq\n", "\n", "MAP_SQ4T size ∈ int[0 388269]\n", "log rank = ln((1 + field_07kk)) ∈ float[0 3.044522437723423]\n", "\n", "\n", "\n", "graph_eu9v\n", "\n", "REDUCE_9Y_X size ∈ int[0 388269]\n", "field_07kk = first(rank) ∈ float[0 20] UNIQUE\n", "GROUP BY (gender)\n", "\n", "\n", "\n", "graph_m0sq->graph_eu9v\n", "\n", "\n", "\n", "\n", "\n", "graph_773q\n", "\n", "IMDB_IJS_MOVIES size ∈ int{388269}\n", "id = id ∈ int UNIQUE\n", "name = name ∈ str\n", "year = year ∈ int\n", "rank = rank ∈ float[0 20]\n", "\n", "\n", "\n", "graph_eu9v->graph_773q\n", "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# We create a basic relation\n", "query = \"\"\"SELECT LN(1+rank) as \"log rank\" FROM imdb_ijs.movies GROUP BY gender\"\"\"\n", "relation = dataset.relation(query, dialect=Dialect.PostgreSql)\n", "print_dot(relation.dot())" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 368 }, "id": "oNXr0fHSmVGD", "outputId": "211db54a-2121-4372-bd88-b934b53d40ef" }, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "graph_m0sq\n", "\n", "\n", "\n", "graph_m0sq\n", "\n", "MAP_SQ4T size ∈ int[0 388269]\n", "log rank = ln((1 + field_07kk)) ∈ float[0 3.044522437723423]\n", "\n", "\n", "\n", "graph_eu9v\n", "\n", "REDUCE_9Y_X size ∈ int[0 388269]\n", "field_07kk = first(rank) ∈ float[0 20] UNIQUE\n", "GROUP BY (gender)\n", "\n", "\n", "\n", "graph_m0sq->graph_eu9v\n", "\n", "\n", "\n", "\n", "\n", "graph_773q\n", "\n", "IMDB_IJS_MOVIES size ∈ int{388269}\n", "id = id ∈ int UNIQUE\n", "name = name ∈ str\n", "year = year ∈ int\n", "rank = rank ∈ float[0 20]\n", "\n", "\n", "\n", "graph_eu9v->graph_773q\n", "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# We create a basic relation in BigQuery dialect\n", "query = \"\"\"SELECT LN(1+rank) as `log rank` FROM imdb_ijs.movies GROUP BY gender\"\"\"\n", "relation = dataset.relation(query, dialect=Dialect.BigQuery)\n", "print_dot(relation.dot())" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 368 }, "id": "cbafUTQimq2P", "outputId": "86db3bfe-34f8-4b3a-9101-8596ddbdc6a4" }, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "graph_m0sq\n", "\n", "\n", "\n", "graph_m0sq\n", "\n", "MAP_SQ4T size ∈ int[0 388269]\n", "log rank = ln((1 + field_07kk)) ∈ float[0 3.044522437723423]\n", "\n", "\n", "\n", "graph_eu9v\n", "\n", "REDUCE_9Y_X size ∈ int[0 388269]\n", "field_07kk = first(rank) ∈ float[0 20] UNIQUE\n", "GROUP BY (gender)\n", "\n", "\n", "\n", "graph_m0sq->graph_eu9v\n", "\n", "\n", "\n", "\n", "\n", "graph_773q\n", "\n", "IMDB_IJS_MOVIES size ∈ int{388269}\n", "id = id ∈ int UNIQUE\n", "name = name ∈ str\n", "year = year ∈ int\n", "rank = rank ∈ float[0 20]\n", "\n", "\n", "\n", "graph_eu9v->graph_773q\n", "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# We create a basic relation in MSSQL dialect\n", "query = \"\"\"SELECT LOG(1+rank) as \"log rank\" FROM imdb_ijs.movies GROUP BY gender\"\"\"\n", "relation = dataset.relation(query, dialect=Dialect.MsSql)\n", "print_dot(relation.dot())" ] }, { "cell_type": "markdown", "metadata": { "id": "24_sxs-vm4D7" }, "source": [ "They all return the same abstract `Relation` object" ] } ], "metadata": { "colab": { "provenance": [] }, "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 0 }