{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# Copyright 2025 Google LLC\n", "#\n", "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# JSON Data Types\n", "\n", "When using BigQuery DataFrames, columns containing data in BigQuery's [JSON](https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#json_type) format (a lightweight standard) are represented as `pandas.ArrowDtype`. The exact underlying Arrow type depends on your library versions. Older environments typically use `db_dtypes.JSONArrowType()` for compatibility, which is an Arrow extension type acting as a light wrapper around `pa.string()`. In contrast, newer setups (pandas 3.0+ and pyarrow 19.0+) utilize the more recent `pa.json_(pa.string())` representation." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import bigframes.pandas as bpd\n", "import bigframes.bigquery as bbq\n", "import db_dtypes\n", "import pandas as pd\n", "import pyarrow as pa" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "REGION = \"US\" # @param {type: \"string\"}\n", "\n", "bpd.options.display.progress_bar = None\n", "bpd.options.bigquery.location = REGION" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create Series with JSON columns\n", "\n", "**Example 1: Create a Series with a JSON dtype from local data**\n", "\n", "This example demonstrates creating a JSON Series from a list of JSON strings. Note that BigQuery standardizes these strings, for instance, by removing extra spaces and ordering dictionary keys. Specifying the `dtype` is essential; if omitted, a string-type Series will be generated." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1\n", "1 \"str\"\n", "2 false\n", "3 [\"a\",{\"b\":1},null]\n", "4 {\"a\":{\"b\":[1,2,3],\"c\":true}}\n", "5 \n", "dtype: extension>[pyarrow]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "json_data = [\n", " \"1\",\n", " '\"str\"',\n", " \"false\",\n", " '[\"a\", {\"b\": 1}, null]',\n", " '{\"a\": {\"b\": [1, 2, 3], \"c\": true}}',\n", " None,\n", "]\n", "bpd.Series(json_data, dtype=pd.ArrowDtype(db_dtypes.JSONArrowType()))\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Example 2: Create a Series with a Nested JSON dtype from local data**\n", "\n", "To create a BigQuery DataFrame Series containing `JSON` data nested within a `STRUCT` or `LIST` type, you must represent the `JSON` data in a `pa.array` defined with the `pa.string` type. This workaround is necessary because Pyarrow lacks support for creating structs or lists that directly contain extension types (see [issue](https://github.com/apache/arrow/issues/45262))." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 [{'key': '1'}]\n", "1 [{'key': None}]\n", "2 [{'key': '[\"1\",\"3\",\"5\"]'}]\n", "3 [{'key': '{\"a\":1,\"b\":[\"x\",\"y\"],\"c\":{\"x\":[],\"z\"...\n", "dtype: list>>>[pyarrow]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list_data = [\n", " [{\"key\": \"1\"}],\n", " [{\"key\": None}],\n", " [{\"key\": '[\"1\",\"3\",\"5\"]'}],\n", " [{\"key\": '{\"a\":1,\"b\":[\"x\",\"y\"],\"c\":{\"x\":[],\"z\":false}}'}],\n", "]\n", "pa_array = pa.array(list_data, type=pa.list_(pa.struct([(\"key\", pa.string())])))\n", "bpd.Series(\n", " pd.arrays.ArrowExtensionArray(pa_array),\n", " dtype=pd.ArrowDtype(\n", " pa.list_(pa.struct([(\"key\", db_dtypes.JSONArrowType())])),\n", " ),\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Example 3: Create a Series with a Nested JSON dtype using BigQuery SQLs**" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idstruct_col
01{'data': '{\"b\":100}', 'number': 2}
10{'data': '{\"a\":true}', 'number': 1}
\n", "

2 rows × 2 columns

\n", "
[2 rows x 2 columns in total]" ], "text/plain": [ " id struct_col\n", "0 1 {'data': '{\"b\":100}', 'number': 2}\n", "1 0 {'data': '{\"a\":true}', 'number': 1}\n", "\n", "[2 rows x 2 columns]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sql = \"\"\"\n", "SELECT 0 AS id, STRUCT(JSON_OBJECT('a', True) AS data, 1 AS number) AS struct_col\n", "UNION ALL\n", "SELECT 1, STRUCT(JSON_OBJECT('b', 100), 2),\n", "\"\"\"\n", "df = bpd.read_gbq(sql)\n", "df" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "id Int64\n", "struct_col struct>,...\n", "dtype: object" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Operate on `JSON` data\n", "\n", "The `bigframes.bigquery` module (often abbreviated as `bbq`) provides access within BigQuery DataFrames to various **[BigQuery built-in functions](https://cloud.google.com/bigquery/docs/reference/standard-sql/json_functions)**. Examples relevant for JSON data include [`json_extract`](https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.bigquery#bigframes_bigquery_json_extract) and [`parse_json`](https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.bigquery#bigframes_bigquery_parse_json)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Extract JSON data via specific JSON path" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Example 1: When JSON data is represented as strings**" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "fruits = [\n", " '{\"fruits\": [{\"name\": \"apple\"}, {\"name\": \"cherry\"}]}',\n", " '{\"fruits\": [{\"name\": \"guava\"}, {\"name\": \"grapes\"}]}',\n", "]" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 {\"fruits\": [{\"name\": \"apple\"}, {\"name\": \"cherr...\n", "1 {\"fruits\": [{\"name\": \"guava\"}, {\"name\": \"grape...\n", "dtype: string" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "str_s = bpd.Series(fruits, dtype=\"string\")\n", "str_s" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 {\"name\":\"apple\"}\n", "1 {\"name\":\"guava\"}\n", "dtype: string" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bbq.json_extract(str_s, \"$.fruits[0]\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Example 2: When JSON data is stored as JSON type**" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 {\"fruits\":[{\"name\":\"apple\"},{\"name\":\"cherry\"}]}\n", "1 {\"fruits\":[{\"name\":\"guava\"},{\"name\":\"grapes\"}]}\n", "dtype: extension>[pyarrow]" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "json_s = bpd.Series(fruits, dtype=pd.ArrowDtype(db_dtypes.JSONArrowType()))\n", "json_s" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 {\"name\":\"apple\"}\n", "1 {\"name\":\"guava\"}\n", "dtype: extension>[pyarrow]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bbq.json_extract(json_s, \"$.fruits[0]\")\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Extract an array from JSON data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 ['{\"name\":\"apple\"}' '{\"name\":\"cherry\"}']\n", "1 ['{\"name\":\"guava\"}' '{\"name\":\"grapes\"}']\n", "dtype: list>>[pyarrow]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bbq.json_extract_array(json_s, \"$.fruits\")" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 ['{\"name\":\"apple\"}' '{\"name\":\"cherry\"}']\n", "1 ['{\"name\":\"guava\"}' '{\"name\":\"grapes\"}']\n", "dtype: list[pyarrow]" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bbq.json_extract_array(str_s, \"$.fruits\")" ] } ], "metadata": { "kernelspec": { "display_name": "venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.1" } }, "nbformat": 4, "nbformat_minor": 2 }