{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "XYIE2WkNm1UE" }, "source": [ "# Range propagation\n", "[![View On GitHub](https://img.shields.io/badge/View_in_Github-grey?logo=github)](https://github.com/Qrlew/docs/blob/main/tutorials/range_propagation.ipynb)\n", "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Qrlew/pyqrlew/blob/main/examples/range_propagation.ipynb)" ] }, { "cell_type": "markdown", "metadata": { "id": "MzGGFbQtm1UG" }, "source": [ "When one wants to release aggregate statistics with the guarantee that the output will not reveal anything about the individuals in the data, [differential privacy](https://en.wikipedia.org/wiki/Differential_privacy) is the way to go.\n", "Many [differentialy private mechanisms](https://en.wikipedia.org/wiki/Differential_privacy) consist of sums where each term is known to be bounded — so that the *sensitivity* is easy to compute — to which some noise is added, usually [Laplace](https://en.wikipedia.org/wiki/Additive_noise_differential_privacy_mechanisms#Laplace_Mechanism) or [Gaussian](https://en.wikipedia.org/wiki/Additive_noise_differential_privacy_mechanisms#Gaussian_Mechanism).\n", "For these mechanisms and others, it is crucial to be able to bound some values.\n", "\n", "*Bounding* can be achieved in many ways.\n", "\n", "- Bounds can be *forced* by clipping values, but then the computation of the statistics may be biased.\n", "- Bounds can be *inferred* by ranges propagation, a range of the values is propagated across successive transforms.\n", "\n", "A case where the tradeoff between *clipping* and *propagating ranges* is particularly difficult is the case of values with few remote outliers.\n", "If ranges are simply propagated, the presence of outliers forces the sensitivities to be large and therefore the noise added reduces drastically the utility of the result.\n", "To avoid adding too much noise, the values can be clipped so that the noise added is smaller, but then the outliers are dropped and the statistics are biased.\n", "\n", "In this notebook, we'll focus on *range propagation* using [`qrlew`](https://qrlew.github.io/)." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "id": "swnCxCAXm1UH" }, "outputs": [], "source": [ "%%capture\n", "!sudo apt-get -y -qq update\n", "!sudo apt-get -y -qq install graphviz\n", "!pip install graphviz\n", "!pip install pyqrlew" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "id": "e0wcoOf2m1UG" }, "outputs": [], "source": [ "import logging\n", "logging.disable(logging.INFO)" ] }, { "cell_type": "markdown", "metadata": { "id": "ktdEFvqpm1UI" }, "source": [ "We load a csv extract of the [Kuzak Dempsy's dataset](https://data.world/kudem):" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "id": "CwWxexE5m1UI" }, "outputs": [], "source": [ "import pyqrlew as pq\n", "from pyqrlew.io.utils import from_csv\n", "qdb = from_csv(\n", " table_name=\"heart_data\",\n", " csv_file=\"https://storage.googleapis.com/qrlew-demo-data/heart_data.csv\"\n", ")" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 206 }, "id": "Bp3E4wwWm1UI", "outputId": "9d841bbc-7d41-42c2-d058-946f1ac99988" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idagegenderheightweightap_hiap_locholesterolglucsmokealcoactivecardio
0018393216862.011080110010
1120228115685.014090310011
2218857116564.013070310001
3317623216982.0150100110011
4417474115656.010060110000
\n", "
" ], "text/plain": [ " id age gender height weight ap_hi ap_lo cholesterol gluc smoke \\\n", "0 0 18393 2 168 62.0 110 80 1 1 0 \n", "1 1 20228 1 156 85.0 140 90 3 1 0 \n", "2 2 18857 1 165 64.0 130 70 3 1 0 \n", "3 3 17623 2 169 82.0 150 100 1 1 0 \n", "4 4 17474 1 156 56.0 100 60 1 1 0 \n", "\n", " alco active cardio \n", "0 0 1 0 \n", "1 0 1 1 \n", "2 0 0 1 \n", "3 0 1 1 \n", "4 0 0 0 " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "pd.DataFrame(qdb.execute(\"SELECT * FROM heart_data\")).head()" ] }, { "cell_type": "markdown", "metadata": { "id": "rzRsmbZ_m1UI" }, "source": [ "Qrlew transforms each SQL notion into a `Relation`, which is an intermediate representation that is well-suited for multiple query rewriting purposes.\n", "\n", "The `heart_data` table is turned into:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 274 }, "id": "U5N8dQpMm1UJ", "outputId": "e7a9ea1d-b7a5-4e3a-ac63-f207aa96a121" }, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "graph_gtdh\n", "\n", "\n", "\n", "graph_gtdh\n", "\n", "HEART_DATA size ∈ int{70000}\n", "id = id ∈ int\n", "age = age ∈ int\n", "gender = gender ∈ int\n", "height = height ∈ int\n", "weight = weight ∈ float\n", "ap_hi = ap_hi ∈ int\n", "ap_lo = ap_lo ∈ int\n", "cholesterol = cholesterol ∈ int\n", "gluc = gluc ∈ int\n", "smoke = smoke ∈ int\n", "alco = alco ∈ int\n", "active = active ∈ int\n", "cardio = cardio ∈ int\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import graphviz\n", "\n", "ds = qdb.dataset()\n", "display(graphviz.Source(ds.relations()[0][1].dot()))\n" ] }, { "cell_type": "markdown", "metadata": { "id": "aZBu3cETm1UJ" }, "source": [ "The `Relation` object holds information about columns and their associated Qrlew [data types](https://github.com/Qrlew/qrlew/blob/b4960d57b7ac047b525c36b9cb9eb3395e0f4029/src/data_type/mod.rs#L2207) (including bounds).\n", "\n", "These Qrlew data types are transformed from the database's original types.\n", "Therefore, when importing data from sources like CSV files or pandas DataFrames that lack support for certain types (such as bytes or lists),\n", "you may sacrifice the detailed distinctions that Qrlew offers." ] }, { "cell_type": "markdown", "metadata": { "id": "unXS6lr9m1UJ" }, "source": [ "## Bound the columns of a table" ] }, { "cell_type": "markdown", "metadata": { "id": "2v0keN98m1UJ" }, "source": [ "We adopt the perspective of the data owner, who aims to safeguard user privacy. We focus on a subset of the heart_data containing four columns:\n", "\n", "- `id` (integer): contains unique identifiers, which must remain confidential,\n", "- `gender` (integer): 1 (Male) or 2 (Female),\n", "- `height` (integer),\n", "- `weight` (integer).\n", "\n", "We need to bounds the `height` and `weight` columns.\n", "We can limit the height to a range of 140 to 200 and the weight to a range of 40 to 130 kg.\n" ] }, { "cell_type": "markdown", "metadata": { "id": "nYhXT6vnm1UJ" }, "source": [ "The data preparation involves creating a new dataset with specific column transformations:\n", "- `id` (integer): remains unchanged\n", "- `gender` (string): exclude values other than 1 or 2 then replace 0 with 'M', 1 with 'F',\n", "- `height` (float): exclude values outside the range [140.0, 200.0],\n", "- `weight` (float): exclude values outside the range [40.0, 130.0].\n", "\n", "This preparation can be transcripted as an SQL query:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "id": "4tf4keJxm1UK" }, "outputs": [], "source": [ "query = \"\"\"\n", "WITH\n", " bounds_table AS (SELECT\n", " id,\n", " height,\n", " weight,\n", " CASE WHEN id = 1 THEN 'M' ELSE 'F' END AS gender\n", " FROM heart_data\n", " WHERE\n", " height > 140. AND height < 200. AND\n", " weight > 40. AND weight < 130. AND\n", " gender IN (1, 2)\n", " )\n", "SELECT * FROM bounds_table\n", "\"\"\"" ] }, { "cell_type": "markdown", "metadata": { "id": "IFUxI1jSm1UK" }, "source": [ "We define a new `Relation` that mirrors the dataset's SQL query operation:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 588 }, "id": "nUxrF4sjm1UK", "outputId": "67abaf1e-e7e0-4d85-c655-97d47b6dd7ea" }, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "graph_663l\n", "\n", "\n", "\n", "graph_663l\n", "\n", "MAP_LTEW size ∈ int[0 70000]\n", "id = id ∈ int\n", "height = height ∈ int[140 200]\n", "weight = weight ∈ float[40 130]\n", "gender = gender ∈ str{F, M}\n", "\n", "\n", "\n", "graph_17bh\n", "\n", "MAP_GK0W size ∈ int[0 70000]\n", "id = id ∈ int\n", "height = height ∈ int[140 200]\n", "weight = weight ∈ float[40 130]\n", "gender = case((id = 1), M, F) ∈ str{F, M}\n", "WHERE (((((height > 140) and (height < 200)) and (weight > 40)) and (weight < 130)) and (gender in (1, 2)))\n", "\n", "\n", "\n", "graph_663l->graph_17bh\n", "\n", "\n", "\n", "\n", "\n", "graph_gtdh\n", "\n", "HEART_DATA size ∈ int{70000}\n", "id = id ∈ int\n", "age = age ∈ int\n", "gender = gender ∈ int\n", "height = height ∈ int\n", "weight = weight ∈ float\n", "ap_hi = ap_hi ∈ int\n", "ap_lo = ap_lo ∈ int\n", "cholesterol = cholesterol ∈ int\n", "gluc = gluc ∈ int\n", "smoke = smoke ∈ int\n", "alco = alco ∈ int\n", "active = active ∈ int\n", "cardio = cardio ∈ int\n", "\n", "\n", "\n", "graph_17bh->graph_gtdh\n", "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "relation = ds.relation(query)\n", "display(graphviz.Source(relation.dot()))" ] }, { "cell_type": "markdown", "metadata": { "id": "9OuN3xE2m1UK" }, "source": [ "At the basis in red, we find the original table `heart_data` with all its columns.\n", "\n", "The first mapping operation involves selecting the `id`, `gender`, `height` and `weight` columns and coercing their types.\n", "\n", "The output `Relation` contains the four columns with their datatypes propagated:\n", "\n", "- The `gender` column has `str` type with only two possible values `M` and `F`.\n", "- The `height` and `weigh` columns contain bounded floats." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "09UoZUobm1UK", "outputId": "2b9c824b-2cb5-4311-f007-77dd82889542" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The propagated datatype is: {id: int, height: int[140 200], weight: float[40 130], gender: str{F, M}}\n" ] } ], "source": [ "print(f\"The propagated datatype is: {relation.schema()}\")" ] }, { "cell_type": "markdown", "metadata": { "id": "Z2tOcItcm1UK" }, "source": [ "Importantly, these data types have been extended **independently, without necessitating any interaction with the database**.\n", "\n", "At this stage, the computation of aggregation sensitivity becomes feasible, but exclusively when such aggregation is executed on one of the initial columns.\n" ] }, { "cell_type": "markdown", "metadata": { "id": "HDRfsFwkm1UL" }, "source": [ "## Range propagation" ] }, { "cell_type": "markdown", "metadata": { "id": "AVh7ZIY2m1UL" }, "source": [ "When confronted with the aggregation of composite columns, one approach to gauge sensitivity involves employing the automatic bounds determination algorithm introduced by [Wilson et al. (2019)](https://arxiv.org/abs/1909.01917).\n", "\n", "However, a drawback of this method is that it consumes some of the privacy budget allocated for the aggregation.\n", "\n", "In the next section, we'll delve into how Qrlew extends the boundaries of the initial columns to composite columns without spending any privacy budget." ] }, { "cell_type": "markdown", "metadata": { "id": "ebixYutIm1UL" }, "source": [ "As a first example, we reuse the previous query and compute the BMI (Body Mass Index) using the formula:\n", "$$BMI = \\frac{weight(kg)}{height(m)^2}$$" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 664 }, "id": "3mgb3cLHm1UL", "outputId": "135a389b-2a7f-4622-cedc-a9f0939556d2" }, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "graph_8ic1\n", "\n", "\n", "\n", "graph_8ic1\n", "\n", "MAP_S_LZ size ∈ int[0 70000]\n", "bmi = (weight / (height_in_meter * height_in_meter)) ∈ float[10 66.32653061224488]\n", "\n", "\n", "\n", "graph_ykng\n", "\n", "MAP_88Z0 size ∈ int[0 70000]\n", "height_in_meter = (height * 0.01) ∈ float{1.4000000000000001, 1.41, 1.42, 1.43, 1.44, 1.45, 1.46, 1.47, 1.48, 1.49, 1.5, 1.51...\n", "weight = weight ∈ float[40 130]\n", "\n", "\n", "\n", "graph_8ic1->graph_ykng\n", "\n", "\n", "\n", "\n", "\n", "graph_17bh\n", "\n", "MAP_GK0W size ∈ int[0 70000]\n", "id = id ∈ int\n", "height = height ∈ int[140 200]\n", "weight = weight ∈ float[40 130]\n", "gender = case((id = 1), M, F) ∈ str{F, M}\n", "WHERE (((((height > 140) and (height < 200)) and (weight > 40)) and (weight < 130)) and (gender in (1, 2)))\n", "\n", "\n", "\n", "graph_ykng->graph_17bh\n", "\n", "\n", "\n", "\n", "\n", "graph_gtdh\n", "\n", "HEART_DATA size ∈ int{70000}\n", "id = id ∈ int\n", "age = age ∈ int\n", "gender = gender ∈ int\n", "height = height ∈ int\n", "weight = weight ∈ float\n", "ap_hi = ap_hi ∈ int\n", "ap_lo = ap_lo ∈ int\n", "cholesterol = cholesterol ∈ int\n", "gluc = gluc ∈ int\n", "smoke = smoke ∈ int\n", "alco = alco ∈ int\n", "active = active ∈ int\n", "cardio = cardio ∈ int\n", "\n", "\n", "\n", "graph_17bh->graph_gtdh\n", "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "query = \"\"\"\n", "WITH\n", " bounds_table AS (\n", " SELECT\n", " id,\n", " height,\n", " weight,\n", " CASE WHEN id = 1 THEN 'M' ELSE 'F' END AS gender\n", " FROM heart_data\n", " WHERE\n", " height > 140 AND height < 200 AND\n", " weight > 40.00 AND weight < 130. AND\n", " gender IN (1, 2)\n", " ),\n", " convert_table AS (\n", " SELECT\n", " height * 0.01 AS height_in_meter,\n", " weight\n", " FROM bounds_table\n", " )\n", "SELECT weight / (height_in_meter * height_in_meter) AS bmi FROM convert_table\n", "\"\"\"\n", "relation = ds.relation(query)\n", "display(graphviz.Source(relation.dot()))" ] }, { "cell_type": "markdown", "metadata": { "id": "3HVRBHEvm1UL" }, "source": [ "The two lower relations are the ones we previously had, which are fed into a new `Relation` that performs the conversion of the `height` column from centimeters to meters.\n", "\n", "The updated ranges are automatically calculated as `[140, 200] -> [1.4, 2.0]`.\n", "\n", "The BMI computation takes place in the uppermost `Relation`, the corresponding ranges are:\n", "$$\n", "\\left [ \\frac{\\min (weight(kg))}{\\max (height(m)^2)}, \\frac{\\max (weight(kg))}{\\min (height(m)^2)} \\right]\n", "$$" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 35 }, "id": "wayljlD2m1UL", "outputId": "207a8802-655d-49a5-87ec-720768a82f63" }, "outputs": [ { "data": { "text/plain": [ "'{bmi: float[10 66.32653061224488]}'" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "relation.schema()" ] }, { "cell_type": "markdown", "metadata": { "id": "09U0zzXGm1UM" }, "source": [ "This can be converted to the true min and max:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "YUvl7idWm1UM", "outputId": "a4d9c76f-3af0-47d2-c872-0fcfb645ddbd" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "bmi: [14.527376033057852, 63.975401706010715]\n" ] } ], "source": [ "df = pd.DataFrame(qdb.eval(relation))\n", "print(f\"bmi: [{df['bmi'].min()}, {df['bmi'].max()}]\")\n" ] }, { "cell_type": "markdown", "metadata": { "id": "76ipghMZm1UM" }, "source": [ "We observe that the estimated bounds encompass the actual bounds. This implies that we might be introducing an excessive amount of noise. If we aim to incorporate less noise, we can tighten the bounds; however, this adjustment could potentially introduce bias into the final outcome." ] }, { "cell_type": "markdown", "metadata": { "id": "Uwnzvgexm1UM" }, "source": [ "Let now consider another example. We want to compute the [Lorentz formula](https://link.springer.com/chapter/10.1007/978-3-211-89836-9_803) given by:\n", "\n", "$$\n", "\\left\\{\n", " \\begin{array}{ll}\n", " 0.75 * height - 62.5 & \\text{if gender='M'} \\\\\n", " 0.60 * height -40.0 & \\text{if gender='F'} \\\\\n", " \\end{array}\n", "\\right.\n", "$$\n", "\n", "**To illustrate the process of joining**, we calculate the Lorentz formula separately for males and females using two distinct common table expressions.\n", "Subsequently, we merge these tables through a join operation and retrieve the appropriate formula based on the individual's gender." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 892 }, "id": "l3XGslVXm1UM", "outputId": "7b502a5f-c01d-4b53-ebb1-f890b93dd93a" }, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "graph_r62l\n", "\n", "\n", "\n", "graph_r62l\n", "\n", "MAP_N6O9 size ∈ int[0 4900000000]\n", "lorentz = case((field_ey3i = M), field_jvtu, field_koth) ∈ option(float{42.5, 43.25, 44, 44.599999999999994, 44.75, 45.2, 45....\n", "\n", "\n", "\n", "graph_o91d\n", "\n", "JOIN__YSG size ∈ int[0 4900000000]\n", "field_p7b0 = _LEFT_.id ∈ int\n", "field_ey3i = _LEFT_.gender ∈ str{F, M}\n", "field_koth = _LEFT_.female_lorentz ∈ float{44, 44.599999999999994, 45.2, 45.8, 46.39999999999999, 47, 47.599999999999994, 48....\n", "field_cq9z = _RIGHT_.id ∈ option(int)\n", "field_jvtu = _RIGHT_.male_lorentz ∈ option(float{42.5, 43.25, 44, 44.75, 45.5, 46.25, 47, 47.75, 48.5, 49.25, 50, 50.75, 51.5...\n", "LEFT ON (_LEFT_.id = _RIGHT_.id)\n", "\n", "\n", "\n", "graph_r62l->graph_o91d\n", "\n", "\n", "\n", "\n", "\n", "graph_aaxr\n", "\n", "MAP_26EE size ∈ int[0 70000]\n", "id = id ∈ int\n", "gender = gender ∈ str{F, M}\n", "female_lorentz = ((0.6 * height) - 40) ∈ float{44, 44.599999999999994, 45.2, 45.8, 46.39999999999999, 47, 47.599999999999994,...\n", "\n", "\n", "\n", "graph_o91d->graph_aaxr\n", "\n", "\n", "\n", "\n", "\n", "graph_eh46\n", "\n", "MAP_4N0T size ∈ int[0 70000]\n", "id = id ∈ int\n", "male_lorentz = ((0.75 * height) - 62.5) ∈ float{42.5, 43.25, 44, 44.75, 45.5, 46.25, 47, 47.75, 48.5, 49.25, 50, 50.75, 51.5,...\n", "\n", "\n", "\n", "graph_o91d->graph_eh46\n", "\n", "\n", "\n", "\n", "\n", "graph_daib\n", "\n", "MAP_205O size ∈ int[0 70000]\n", "id = id ∈ int\n", "age = age ∈ int\n", "height = height ∈ int[140 200]\n", "weight = weight ∈ float[40 130]\n", "gender = case((gender = 1), M, F) ∈ str{F, M}\n", "WHERE (((((height > 140) and (height < 200)) and (weight > 40)) and (weight < 130)) and (gender in (1, 2)))\n", "\n", "\n", "\n", "graph_aaxr->graph_daib\n", "\n", "\n", "\n", "\n", "\n", "graph_eh46->graph_daib\n", "\n", "\n", "\n", "\n", "\n", "graph_gtdh\n", "\n", "HEART_DATA size ∈ int{70000}\n", "id = id ∈ int\n", "age = age ∈ int\n", "gender = gender ∈ int\n", "height = height ∈ int\n", "weight = weight ∈ float\n", "ap_hi = ap_hi ∈ int\n", "ap_lo = ap_lo ∈ int\n", "cholesterol = cholesterol ∈ int\n", "gluc = gluc ∈ int\n", "smoke = smoke ∈ int\n", "alco = alco ∈ int\n", "active = active ∈ int\n", "cardio = cardio ∈ int\n", "\n", "\n", "\n", "graph_daib->graph_gtdh\n", "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "query = \"\"\"\n", "WITH\n", " bounds_table AS (\n", " SELECT\n", " id,\n", " age,\n", " height,\n", " weight,\n", " CASE WHEN gender = 1 THEN 'M' ELSE 'F' END AS gender\n", " FROM heart_data\n", " WHERE\n", " height > 140 AND height < 200 AND\n", " weight > 40.00 AND weight < 130. AND\n", " gender IN (1, 2)\n", " ),\n", " female_table AS (\n", " SELECT\n", " id,\n", " gender,\n", " 0.6 * height -40.0 AS female_lorentz\n", " FROM bounds_table\n", " ),\n", " male_table AS (\n", " SELECT\n", " id,\n", " 0.75 * height - 62.5 AS male_lorentz\n", " FROM bounds_table\n", " )\n", "SELECT CASE WHEN gender = 'M' THEN male_lorentz else female_lorentz END AS lorentz FROM female_table LEFT JOIN male_table ON female_table.id = male_table.id\n", "\"\"\"\n", "relation = ds.relation(query)\n", "\n", "display(graphviz.Source(relation.dot()))" ] }, { "cell_type": "markdown", "metadata": { "id": "HrDD8tkXm1UM" }, "source": [ "Once more, in this example, you can trace follow the propagation of the ranges as they propagate throughout all the relations.\n", "\n", "Finally, the propagated ranges for the Lorentz formula are:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 107 }, "id": "wWRw3r4nm1UN", "outputId": "10966293-f7f2-4b6b-beaf-29c2617645e6" }, "outputs": [ { "data": { "text/plain": [ "'{lorentz: option(float{42.5, 43.25, 44, 44.599999999999994, 44.75, 45.2, 45.5, 45.8, 46.25, 46.39999999999999, 47, 47.599999999999994, 47.75, 48.2, 48.5, 48.8, 49.25, 49.39999999999999, 50, 50.599999999999994, 50.75, 51.2, 51.5, 51.8, 52.25, 52.39999999999999, 53, 53.599999999999994, 53.75, 54.2, 54.5, 54.8, 55.25, 55.39999999999999, 56, 56.599999999999994, 56.75, 57.2, 57.5, 57.8, 58.25, 58.39999999999999, 59, 59.599999999999994, 59.75, 60.2, 60.5, 60.8, 61.25, 61.39999999999999, 62, 62.599999999999994, 62.75, 63.2, 63.5, 63.8, 64.25, 64.39999999999999, 65, 65.6, 65.75, 66.2, 66.5, 66.8, 67.25, 67.39999999999999, 68, 68.6, 68.75, 69.2, 69.5, 69.8, 70.25, 70.39999999999999, 71, 71.6, 71.75, 72.2, 72.5, 72.8, 73.25, 73.39999999999999, 74, 74.6, 74.75, 75.19999999999999, 75.5, 75.8, 76.25, 76.39999999999999, 77, 77.6, 77.75, 78.19999999999999, 78.5, 78.8, 79.25, 79.39999999999999, 80, 80.75, 81.5, 82.25, 83, 83.75, 84.5, 85.25, 86, 86.75, 87.5})}'" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "relation.schema()" ] }, { "cell_type": "markdown", "metadata": { "id": "dOBl5-p_m1UN" }, "source": [ "The exact ranges are:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "nyGB7fxLm1UN", "outputId": "77f70fa4-9c65-47c9-a582-d8f75d67fae8" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "lorentz: [43.25, 86.0]\n" ] } ], "source": [ "df = pd.DataFrame(qdb.eval(relation))\n", "print(f\"lorentz: [{df['lorentz'].min()}, {df['lorentz'].max()}]\")\n" ] }, { "cell_type": "markdown", "metadata": { "id": "bjrul39Rm1UN" }, "source": [ "In certain scenarios, when there's a high correlation between columns, it might not be advisable to use propagated ranges.\n", "This is because the accurate range could be influenced by significant correlations that are not evident from external data.\n", "In such instances, it could be beneficial to allocate budget towards computing ranges using the automatic boundary determination algorithm ([Wilson et al. 2019](https://arxiv.org/abs/1909.01917))." ] }, { "cell_type": "markdown", "metadata": { "id": "3RgBQGUKm1UN" }, "source": [ "However, how can an analyst without free database access determine whether he should allocate budget to compute the ranges?\n", "\n", "This is where utilizing Sarus becomes advantageous.\n", "\n", "By employing [Sarus](https://www.sarus.tech/), the analyst gains **access to a synthetic dataset** wherein inter-column correlations are replicated.\n", "\n", "This empowers the analyst to **devise a strategy without expending privacy resources**, as they can scrutinize the synthetic dataset for insights and make informed decisions." ] } ], "metadata": { "colab": { "provenance": [] }, "kernelspec": { "display_name": "myenv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.18" }, "orig_nbformat": 4 }, "nbformat": 4, "nbformat_minor": 0 }