{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# Copyright 2023 Google LLC\n", "#\n", "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# BigQuery DataFrames: Synthetic Data Generation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In addition to BigQuery DataFrames (installing which also installs `pandas` as a dependency) we will use\n", "`faker` library as a building block for synthetic data generation." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "suoG7eWDZARj", "outputId": "b5c620a9-8f5b-413f-dd38-93448f941846" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Collecting faker\n", " Downloading faker-37.1.0-py3-none-any.whl.metadata (15 kB)\n", "Requirement already satisfied: tzdata in /usr/local/google/home/shuowei/src/python-bigquery-dataframes/venv/lib/python3.10/site-packages (from faker) (2024.2)\n", "Downloading faker-37.1.0-py3-none-any.whl (1.9 MB)\n", "\u001b[2K \u001b[38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.9/1.9 MB\u001b[0m \u001b[31m55.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[?25hInstalling collected packages: faker\n", "Successfully installed faker-37.1.0\n" ] } ], "source": [ "!pip install faker" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "id": "m3q1oeJALhsG" }, "outputs": [ { "ename": "NameError", "evalue": "name 'PROJECT_ID' is not defined", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)", "Cell \u001b[0;32mIn[3], line 2\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[38;5;28;01mimport\u001b[39;00m \u001b[38;5;21;01mbigframes\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mpandas\u001b[39;00m \u001b[38;5;28;01mas\u001b[39;00m \u001b[38;5;21;01mbpd\u001b[39;00m\n\u001b[0;32m----> 2\u001b[0m bpd\u001b[38;5;241m.\u001b[39moptions\u001b[38;5;241m.\u001b[39mbigquery\u001b[38;5;241m.\u001b[39mproject \u001b[38;5;241m=\u001b[39m \u001b[43mPROJECT_ID\u001b[49m\n", "\u001b[0;31mNameError\u001b[0m: name 'PROJECT_ID' is not defined" ] } ], "source": [ "import bigframes.pandas as bpd\n", "bpd.options.bigquery.project = PROJECT_ID" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's use `GeminiTextGenerator` for our purpose, which is BigQuery DataFrame's state-of-the-art LLM integration at the time of writing this notebook (Apr 16 2024)." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 69 }, "id": "lIYdn1woOS1n", "outputId": "be474338-44c2-4ce0-955e-d525b8b9c84b" }, "outputs": [], "source": [ "from bigframes.ml.llm import GeminiTextGenerator\n", "\n", "model = GeminiTextGenerator(model_name=\"gemini-2.0-flash-001\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Craft a prompt for the LLM to indicate the schema of the desired data and hints for the code that could generate such data. " ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 162 }, "id": "SSR-lLScLa95", "outputId": "cbaec34e-6fa6-45b4-e54a-f11ca06b61e1" }, "outputs": [], "source": [ "prompt = \"\"\"\\\n", "Write python code to generate a pandas dataframe based on the requirements:\n", " Column name: Name, type: string, Description: Latin American Names\n", " Column name: Age, type: int\n", " Column name: Gender, type: string, Description: Inclusive\n", "\n", "Note:\n", " - Return the code only, no additional texts or comments\n", " - Use faker library\n", " - Generate 100 rows\n", " - The final dataframe should be named 'result_df'.\n", "\"\"\"\n", "\n", "df_prompt = bpd.DataFrame({\"prompt\" : [prompt]})\n", "df_prompt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Be accommodating that LLM may not produce a runnable code in the first go and may need some nudging. We will retry by adding the failing code and the exception it throws as additional context in the prompt." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 277 }, "id": "miDe3K4GNvOo", "outputId": "f2039e80-5ad7-4551-f8b2-7ef714a89d63" }, "outputs": [], "source": [ "max_tries = 5\n", "for i in range(max_tries):\n", " # Get LLM generated code\n", " df_result = model.predict(df_prompt)\n", " llm_result = df_result['ml_generate_text_llm_result'].iloc[0]\n", "\n", " # Python code comes back as a markdown code block,\n", " # remove the prefix \"```python\" and suffix \"```\"\n", " code = llm_result[9:-3]\n", " print(code)\n", "\n", " # Check if the generated code is runnable\n", " try:\n", " exec(code)\n", " break\n", " except Exception as ex:\n", " print(ex)\n", " error_context = f\"\"\"\n", "Previous code:\n", "{code}\n", "\n", "Had this exception:\n", "{ex}\"\"\"\n", "\n", " # Update the prompt to help LLM correct error\n", " df_prompt[\"prompt\"] += error_context\n", "\n", " # If we have exhausted max tries then stop trying\n", " if i+1 == max_tries:\n", " raise Exception(\"Failed to generate runnable code\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Run the generated code and verify that it produced the desired data." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 424 }, "id": "GODcPwX2PBEu", "outputId": "dec4c872-c464-49e4-cd7f-9442fc977d18" }, "outputs": [], "source": [ "execution_context = {}\n", "exec(code, execution_context)\n", "execution_context.get(\"result_df\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We want to run this code at scale to generate since we want to generate large amount of data. Let's deploy a `remote_function` for this purpose." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 36 }, "id": "n-BsGciNqSwU", "outputId": "996e5639-a49c-4542-a0dc-ede450e0eb6d" }, "outputs": [], "source": [ "@bpd.remote_function(packages=['faker', 'pandas'], cloud_function_service_account=\"default\")\n", "def data_generator(id: int) -> str:\n", " context = {}\n", " exec(code, context)\n", " result_df = context.get(\"result_df\")\n", " return result_df.to_json(orient=\"records\")\n", "\n", "data_generator.bigframes_cloud_function" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let’s say we want to generate 1 million rows of synthetic data. Since our generated code produces 100 rows in one run, we can initialize an indicator dataframe with 1M/100 = 10K indicator rows. Then we can apply the remote function to produce 100 synthetic data rows for each indicator row." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 34 }, "id": "Odkmev9nsYqA", "outputId": "4aa7a1fd-0c0d-4412-f326-a20e19f583b5" }, "outputs": [], "source": [ "desired_num_rows = 1_000_000 # 1 million rows\n", "batch_size = 100 # used in the prompt\n", "num_batches = int(desired_num_rows/batch_size)\n", "\n", "df = bpd.DataFrame({\"row_id\": range(num_batches)})" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 34 }, "id": "UyBhlJFVsmQC", "outputId": "29748df5-673b-4320-bb1f-53abaace3b81" }, "outputs": [], "source": [ "df[\"json_data\"] = df[\"row_id\"].apply(data_generator)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "At this point each item in `df[\"json_data\"]` is a json serialized array of 100 records. Let’s flatten that into 1 record per row using a direct SQL." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 932 }, "id": "6p3eM21qvRvy", "outputId": "333f4e49-a555-4d2f-b527-02142782b3a7" }, "outputs": [], "source": [ "sql = f\"\"\"\n", "WITH T0 AS ({df.sql}),\n", "T1 AS (\n", " SELECT PARSE_JSON(json_row) AS json_row\n", " FROM T0, UNNEST(JSON_EXTRACT_ARRAY(json_data)) AS json_row\n", ")\n", "SELECT STRING(json_row.Name) AS Name,\n", " INT64(json_row.Age) AS Age,\n", " STRING(json_row.Gender) AS Gender\n", "FROM T1\n", "\"\"\"\n", "df_result = bpd.read_gbq(sql)\n", "df_result" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There you have it, 1 million synthetic data rows ready to use, or save them in a BigQuery table for future use." ] } ], "metadata": { "colab": { "provenance": [] }, "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.15" } }, "nbformat": 4, "nbformat_minor": 0 }