{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Copyright 2025 Google LLC\n",
    "#\n",
    "# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
    "# you may not use this file except in compliance with the License.\n",
    "# You may obtain a copy of the License at\n",
    "#\n",
    "#     https://www.apache.org/licenses/LICENSE-2.0\n",
    "#\n",
    "# Unless required by applicable law or agreed to in writing, software\n",
    "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
    "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
    "# See the License for the specific language governing permissions and\n",
    "# limitations under the License."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# JSON Data Types\n",
    "\n",
    "When using BigQuery DataFrames, columns containing data in BigQuery's [JSON](https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#json_type) format (a lightweight standard) are represented as `pandas.ArrowDtype`. The exact underlying Arrow type depends on your library versions. Older environments typically use `db_dtypes.JSONArrowType()` for compatibility, which is an Arrow extension type acting as a light wrapper around `pa.string()`. In contrast, newer setups (pandas 3.0+ and pyarrow 19.0+) utilize the more recent `pa.json_(pa.string())` representation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import bigframes.pandas as bpd\n",
    "import bigframes.bigquery as bbq\n",
    "import db_dtypes\n",
    "import pandas as pd\n",
    "import pyarrow as pa"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "REGION = \"US\"  # @param {type: \"string\"}\n",
    "\n",
    "bpd.options.display.progress_bar = None\n",
    "bpd.options.bigquery.location = REGION"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create Series with JSON columns\n",
    "\n",
    "**Example 1: Create a Series with a JSON dtype from local data**\n",
    "\n",
    "This example demonstrates creating a JSON Series from a list of JSON strings. Note that BigQuery standardizes these strings, for instance, by removing extra spaces and ordering dictionary keys. Specifying the `dtype` is essential; if omitted, a string-type Series will be generated."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0                               1\n",
       "1                           \"str\"\n",
       "2                           false\n",
       "3              [\"a\",{\"b\":1},null]\n",
       "4    {\"a\":{\"b\":[1,2,3],\"c\":true}}\n",
       "5                            <NA>\n",
       "dtype: extension<dbjson<JSONArrowType>>[pyarrow]"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "json_data = [\n",
    "    \"1\",\n",
    "    '\"str\"',\n",
    "    \"false\",\n",
    "    '[\"a\", {\"b\": 1}, null]',\n",
    "    '{\"a\": {\"b\": [1, 2, 3], \"c\": true}}',\n",
    "    None,\n",
    "]\n",
    "bpd.Series(json_data, dtype=pd.ArrowDtype(db_dtypes.JSONArrowType()))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Example 2: Create a Series with a Nested JSON dtype from local data**\n",
    "\n",
    "To create a BigQuery DataFrame Series containing `JSON` data nested within a `STRUCT` or `LIST` type, you must represent the `JSON` data in a `pa.array` defined with the `pa.string` type. This workaround is necessary because Pyarrow lacks support for creating structs or lists that directly contain extension types (see [issue](https://github.com/apache/arrow/issues/45262))."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0                                       [{'key': '1'}]\n",
       "1                                      [{'key': None}]\n",
       "2                           [{'key': '[\"1\",\"3\",\"5\"]'}]\n",
       "3    [{'key': '{\"a\":1,\"b\":[\"x\",\"y\"],\"c\":{\"x\":[],\"z\"...\n",
       "dtype: list<item: struct<key: extension<dbjson<JSONArrowType>>>>[pyarrow]"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "list_data = [\n",
    "    [{\"key\": \"1\"}],\n",
    "    [{\"key\": None}],\n",
    "    [{\"key\": '[\"1\",\"3\",\"5\"]'}],\n",
    "    [{\"key\": '{\"a\":1,\"b\":[\"x\",\"y\"],\"c\":{\"x\":[],\"z\":false}}'}],\n",
    "]\n",
    "pa_array = pa.array(list_data, type=pa.list_(pa.struct([(\"key\", pa.string())])))\n",
    "bpd.Series(\n",
    "    pd.arrays.ArrowExtensionArray(pa_array),\n",
    "    dtype=pd.ArrowDtype(\n",
    "        pa.list_(pa.struct([(\"key\", db_dtypes.JSONArrowType())])),\n",
    "    ),\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Example 3: Create a Series with a Nested JSON dtype using BigQuery SQLs**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>struct_col</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>{'data': '{\"b\":100}', 'number': 2}</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0</td>\n",
       "      <td>{'data': '{\"a\":true}', 'number': 1}</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>2 rows × 2 columns</p>\n",
       "</div>[2 rows x 2 columns in total]"
      ],
      "text/plain": [
       "   id                           struct_col\n",
       "0   1   {'data': '{\"b\":100}', 'number': 2}\n",
       "1   0  {'data': '{\"a\":true}', 'number': 1}\n",
       "\n",
       "[2 rows x 2 columns]"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sql = \"\"\"\n",
    "SELECT 0 AS id, STRUCT(JSON_OBJECT('a', True) AS data, 1 AS number) AS struct_col\n",
    "UNION ALL\n",
    "SELECT 1, STRUCT(JSON_OBJECT('b', 100), 2),\n",
    "\"\"\"\n",
    "df = bpd.read_gbq(sql)\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "id                                                        Int64\n",
       "struct_col    struct<data: extension<dbjson<JSONArrowType>>,...\n",
       "dtype: object"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.dtypes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Operate on `JSON` data\n",
    "\n",
    "The `bigframes.bigquery` module (often abbreviated as `bbq`) provides access within BigQuery DataFrames to various **[BigQuery built-in functions](https://cloud.google.com/bigquery/docs/reference/standard-sql/json_functions)**. Examples relevant for JSON data include [`json_extract`](https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.bigquery#bigframes_bigquery_json_extract) and [`parse_json`](https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.bigquery#bigframes_bigquery_parse_json)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Extract JSON data via specific JSON path"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Example 1: When JSON data is represented as strings**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "fruits = [\n",
    "  '{\"fruits\": [{\"name\": \"apple\"}, {\"name\": \"cherry\"}]}',\n",
    "  '{\"fruits\": [{\"name\": \"guava\"}, {\"name\": \"grapes\"}]}',\n",
    "]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    {\"fruits\": [{\"name\": \"apple\"}, {\"name\": \"cherr...\n",
       "1    {\"fruits\": [{\"name\": \"guava\"}, {\"name\": \"grape...\n",
       "dtype: string"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "str_s = bpd.Series(fruits, dtype=\"string\")\n",
    "str_s"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    {\"name\":\"apple\"}\n",
       "1    {\"name\":\"guava\"}\n",
       "dtype: string"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bbq.json_extract(str_s, \"$.fruits[0]\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Example 2: When JSON data is stored as JSON type**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    {\"fruits\":[{\"name\":\"apple\"},{\"name\":\"cherry\"}]}\n",
       "1    {\"fruits\":[{\"name\":\"guava\"},{\"name\":\"grapes\"}]}\n",
       "dtype: extension<dbjson<JSONArrowType>>[pyarrow]"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "json_s = bpd.Series(fruits, dtype=pd.ArrowDtype(db_dtypes.JSONArrowType()))\n",
    "json_s"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    {\"name\":\"apple\"}\n",
       "1    {\"name\":\"guava\"}\n",
       "dtype: extension<dbjson<JSONArrowType>>[pyarrow]"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bbq.json_extract(json_s, \"$.fruits[0]\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Extract an array from JSON data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    ['{\"name\":\"apple\"}' '{\"name\":\"cherry\"}']\n",
       "1    ['{\"name\":\"guava\"}' '{\"name\":\"grapes\"}']\n",
       "dtype: list<item: extension<dbjson<JSONArrowType>>>[pyarrow]"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bbq.json_extract_array(json_s, \"$.fruits\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    ['{\"name\":\"apple\"}' '{\"name\":\"cherry\"}']\n",
       "1    ['{\"name\":\"guava\"}' '{\"name\":\"grapes\"}']\n",
       "dtype: list<item: string>[pyarrow]"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bbq.json_extract_array(str_s, \"$.fruits\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}