{
  "cells": [
    {
      "cell_type": "code",
      "execution_count": 1,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Copyright 2023 Google LLC\n",
        "#\n",
        "# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
        "# you may not use this file except in compliance with the License.\n",
        "# You may obtain a copy of the License at\n",
        "#\n",
        "#     https://www.apache.org/licenses/LICENSE-2.0\n",
        "#\n",
        "# Unless required by applicable law or agreed to in writing, software\n",
        "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
        "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
        "# See the License for the specific language governing permissions and\n",
        "# limitations under the License."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# BigQuery DataFrames: Synthetic Data Generation"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "In addition to BigQuery DataFrames (installing which also installs `pandas` as a dependency) we will use\n",
        "`faker` library as a building block for synthetic data generation."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 2,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "suoG7eWDZARj",
        "outputId": "b5c620a9-8f5b-413f-dd38-93448f941846"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Collecting faker\n",
            "  Downloading faker-37.1.0-py3-none-any.whl.metadata (15 kB)\n",
            "Requirement already satisfied: tzdata in /usr/local/google/home/shuowei/src/python-bigquery-dataframes/venv/lib/python3.10/site-packages (from faker) (2024.2)\n",
            "Downloading faker-37.1.0-py3-none-any.whl (1.9 MB)\n",
            "\u001b[2K   \u001b[38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.9/1.9 MB\u001b[0m \u001b[31m55.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hInstalling collected packages: faker\n",
            "Successfully installed faker-37.1.0\n"
          ]
        }
      ],
      "source": [
        "!pip install faker"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 3,
      "metadata": {
        "id": "m3q1oeJALhsG"
      },
      "outputs": [
        {
          "ename": "NameError",
          "evalue": "name 'PROJECT_ID' is not defined",
          "output_type": "error",
          "traceback": [
            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
            "\u001b[0;31mNameError\u001b[0m                                 Traceback (most recent call last)",
            "Cell \u001b[0;32mIn[3], line 2\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[38;5;28;01mimport\u001b[39;00m \u001b[38;5;21;01mbigframes\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mpandas\u001b[39;00m \u001b[38;5;28;01mas\u001b[39;00m \u001b[38;5;21;01mbpd\u001b[39;00m\n\u001b[0;32m----> 2\u001b[0m bpd\u001b[38;5;241m.\u001b[39moptions\u001b[38;5;241m.\u001b[39mbigquery\u001b[38;5;241m.\u001b[39mproject \u001b[38;5;241m=\u001b[39m \u001b[43mPROJECT_ID\u001b[49m\n",
            "\u001b[0;31mNameError\u001b[0m: name 'PROJECT_ID' is not defined"
          ]
        }
      ],
      "source": [
        "import bigframes.pandas as bpd\n",
        "bpd.options.bigquery.project = PROJECT_ID"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Let's use `GeminiTextGenerator` for our purpose, which is BigQuery DataFrame's state-of-the-art LLM integration at the time of writing this notebook (Apr 16 2024)."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 3,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 69
        },
        "id": "lIYdn1woOS1n",
        "outputId": "be474338-44c2-4ce0-955e-d525b8b9c84b"
      },
      "outputs": [],
      "source": [
        "from bigframes.ml.llm import GeminiTextGenerator\n",
        "\n",
        "model = GeminiTextGenerator(model_name=\"gemini-2.0-flash-001\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Craft a prompt for the LLM to indicate the schema of the desired data and hints for the code that could generate such data. "
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 5,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 162
        },
        "id": "SSR-lLScLa95",
        "outputId": "cbaec34e-6fa6-45b4-e54a-f11ca06b61e1"
      },
      "outputs": [],
      "source": [
        "prompt = \"\"\"\\\n",
        "Write python code to generate a pandas dataframe based on the requirements:\n",
        "  Column name: Name, type: string, Description: Latin American Names\n",
        "  Column name: Age, type: int\n",
        "  Column name: Gender, type: string, Description: Inclusive\n",
        "\n",
        "Note:\n",
        "  - Return the code only, no additional texts or comments\n",
        "  - Use faker library\n",
        "  - Generate 100 rows\n",
        "  - The final dataframe should be named 'result_df'.\n",
        "\"\"\"\n",
        "\n",
        "df_prompt = bpd.DataFrame({\"prompt\" : [prompt]})\n",
        "df_prompt"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Be accommodating that LLM may not produce a runnable code in the first go and may need some nudging. We will retry by adding the failing code and the exception it throws as additional context in the prompt."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 6,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 277
        },
        "id": "miDe3K4GNvOo",
        "outputId": "f2039e80-5ad7-4551-f8b2-7ef714a89d63"
      },
      "outputs": [],
      "source": [
        "max_tries = 5\n",
        "for i in range(max_tries):\n",
        "  # Get LLM generated code\n",
        "  df_result = model.predict(df_prompt)\n",
        "  llm_result = df_result['ml_generate_text_llm_result'].iloc[0]\n",
        "\n",
        "  # Python code comes back as a markdown code block,\n",
        "  # remove the prefix \"```python\" and suffix \"```\"\n",
        "  code = llm_result[9:-3]\n",
        "  print(code)\n",
        "\n",
        "  # Check if the generated code is runnable\n",
        "  try:\n",
        "    exec(code)\n",
        "    break\n",
        "  except Exception as ex:\n",
        "    print(ex)\n",
        "    error_context = f\"\"\"\n",
        "Previous code:\n",
        "{code}\n",
        "\n",
        "Had this exception:\n",
        "{ex}\"\"\"\n",
        "\n",
        "    # Update the prompt to help LLM correct error\n",
        "    df_prompt[\"prompt\"] += error_context\n",
        "\n",
        "    # If we have exhausted max tries then stop trying\n",
        "    if i+1 == max_tries:\n",
        "      raise Exception(\"Failed to generate runnable code\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Run the generated code and verify that it produced the desired data."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 7,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 424
        },
        "id": "GODcPwX2PBEu",
        "outputId": "dec4c872-c464-49e4-cd7f-9442fc977d18"
      },
      "outputs": [],
      "source": [
        "execution_context = {}\n",
        "exec(code, execution_context)\n",
        "execution_context.get(\"result_df\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We want to run this code at scale to generate since we want to generate large amount of data. Let's deploy a `remote_function` for this purpose."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 8,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 36
        },
        "id": "n-BsGciNqSwU",
        "outputId": "996e5639-a49c-4542-a0dc-ede450e0eb6d"
      },
      "outputs": [],
      "source": [
        "@bpd.remote_function(packages=['faker', 'pandas'], cloud_function_service_account=\"default\")\n",
        "def data_generator(id: int) -> str:\n",
        "  context = {}\n",
        "  exec(code, context)\n",
        "  result_df = context.get(\"result_df\")\n",
        "  return result_df.to_json(orient=\"records\")\n",
        "\n",
        "data_generator.bigframes_cloud_function"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Let’s say we want to generate 1 million rows of synthetic data. Since our generated code produces 100 rows in one run, we can initialize an indicator dataframe with 1M/100 = 10K indicator rows. Then we can apply the remote function to produce 100 synthetic data rows for each indicator row."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 10,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 34
        },
        "id": "Odkmev9nsYqA",
        "outputId": "4aa7a1fd-0c0d-4412-f326-a20e19f583b5"
      },
      "outputs": [],
      "source": [
        "desired_num_rows = 1_000_000 # 1 million rows\n",
        "batch_size = 100 # used in the prompt\n",
        "num_batches = int(desired_num_rows/batch_size)\n",
        "\n",
        "df = bpd.DataFrame({\"row_id\": range(num_batches)})"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 11,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 34
        },
        "id": "UyBhlJFVsmQC",
        "outputId": "29748df5-673b-4320-bb1f-53abaace3b81"
      },
      "outputs": [],
      "source": [
        "df[\"json_data\"] = df[\"row_id\"].apply(data_generator)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "At this point each item in `df[\"json_data\"]` is a json serialized array of 100 records. Let’s flatten that into 1 record per row using a direct SQL."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 12,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 932
        },
        "id": "6p3eM21qvRvy",
        "outputId": "333f4e49-a555-4d2f-b527-02142782b3a7"
      },
      "outputs": [],
      "source": [
        "sql = f\"\"\"\n",
        "WITH T0 AS ({df.sql}),\n",
        "T1 AS (\n",
        "  SELECT PARSE_JSON(json_row) AS json_row\n",
        "  FROM T0, UNNEST(JSON_EXTRACT_ARRAY(json_data)) AS json_row\n",
        ")\n",
        "SELECT STRING(json_row.Name) AS Name,\n",
        "       INT64(json_row.Age) AS Age,\n",
        "       STRING(json_row.Gender) AS Gender\n",
        "FROM T1\n",
        "\"\"\"\n",
        "df_result = bpd.read_gbq(sql)\n",
        "df_result"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "There you have it, 1 million synthetic data rows ready to use, or save them in a BigQuery table for future use."
      ]
    }
  ],
  "metadata": {
    "colab": {
      "provenance": []
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.10.15"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}