{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# Copyright 2025 Google LLC\n", "#\n", "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Array Data Types\n", "\n", "In BigQuery, an [ARRAY](https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#array_type) (also called a `repeated` column) is an ordered list of zero or more elements of the same, non-`NULL` data type. It's important to note that BigQuery `ARRAY`s cannot contain nested `ARRAY`s. BigQuery DataFrames represents BigQuery `ARRAY` types to `pandas.ArrowDtype(pa.list_(T))`, where `T` is the underlying Arrow type of the array elements.\n", "\n", "This notebook illustrates how to work with `ARRAY` columns in BigQuery DataFrames. First, let's import the required packages and perform the necessary setup below." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import bigframes.pandas as bpd\n", "import bigframes.bigquery as bbq\n", "import pandas as pd\n", "import pyarrow as pa" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "REGION = \"US\" # @param {type: \"string\"}\n", "\n", "bpd.options.display.progress_bar = None\n", "bpd.options.bigquery.location = REGION" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create DataFrames with an array column\n", "\n", "**Example 1: Creating from a list of lists/tuples**" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameScores
0Alice[95 88 92]
1Bob[78 81]
2Charlie[ 82 89 94 100]
\n", "

3 rows × 2 columns

\n", "
[3 rows x 2 columns in total]" ], "text/plain": [ " Name Scores\n", "0 Alice [95 88 92]\n", "1 Bob [78 81]\n", "2 Charlie [ 82 89 94 100]\n", "\n", "[3 rows x 2 columns]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "names = [\"Alice\", \"Bob\", \"Charlie\"]\n", "scores = [\n", " [95, 88, 92],\n", " [78, 81],\n", " [82, 89, 94, 100]\n", "]\n", "df = bpd.DataFrame({\"Name\": names, \"Scores\": scores})\n", "df" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Name string[pyarrow]\n", "Scores list[pyarrow]\n", "dtype: object" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Example 2: Defining schema explicitly**" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 [95. 88. 92.]\n", "1 [78. 81.]\n", "2 [ 82. 89. 94. 100.]\n", "dtype: list[pyarrow]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bpd.Series(data=scores, dtype=pd.ArrowDtype(pa.list_(pa.float64())))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Example 3: Reading from a source**" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 [{'tables': {'score': 0.9349926710128784, 'val...\n", "1 [{'tables': {'score': 0.9690881371498108, 'val...\n", "2 [{'tables': {'score': 0.8667634129524231, 'val...\n", "3 [{'tables': {'score': 0.9351968765258789, 'val...\n", "4 [{'tables': {'score': 0.8572560548782349, 'val...\n", "Name: predicted_default_payment_next_month, dtype: list>>[pyarrow]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bpd.read_gbq(\"bigquery-public-data.ml_datasets.credit_card_default\", max_results=5)[\"predicted_default_payment_next_month\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Operate on `ARRAY` data\n", "\n", "BigQuery DataFrames provides two main approaches for operating on list (`ARRAY`) data:\n", "\n", "1. **The `Series.list` accessor**: Provides Pandas-like methods for array column manipulation.\n", "2. **[BigQuery built-in functions](https://cloud.google.com/bigquery/docs/reference/standard-sql/array_functions)**: Allows you to use functions mirroring BigQuery SQL operations, available through the `bigframes.bigquery` module (abbreviated as `bbq` below), such as [`array_agg`](https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.bigquery#bigframes_bigquery_array_agg) and [`array_length`](https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.bigquery#bigframes_bigquery_array_length)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Get the Length of Each Arrray\n", "\n", "**Example 1: Using list accessor to get array length**" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 3\n", "1 2\n", "2 4\n", "Name: Scores, dtype: Int64" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['Scores'].list.len()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Example 2: Using BigQuery build-in functions to get array length**" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 3\n", "1 2\n", "2 4\n", "Name: Scores, dtype: Int64" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bbq.array_length(df['Scores'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Access Element at a Specific Index (e.g., First Element) " ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 95\n", "1 78\n", "2 82\n", "Name: Scores, dtype: Int64" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['Scores'].list[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Explode/Unnest Array elements into Seperate Rows\n", "\n", "The exploded rows preserving original order when in ordering mode. If an array has multiple elements, exploded rows are ordered by the element's index\n", "within its original array. " ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 95\n", "0 88\n", "0 92\n", "1 78\n", "1 81\n", "2 82\n", "2 89\n", "2 94\n", "2 100\n", "Name: Scores, dtype: Int64" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "scores = df['Scores'].explode()\n", "scores" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Aggregate elements back into an array" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 [100. 93. 97.]\n", "1 [83. 86.]\n", "2 [ 87. 94. 99. 105.]\n", "Name: Scores, dtype: list[pyarrow]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "new_scores = scores + 5.0\n", "new_scores_arr = bbq.array_agg(new_scores.groupby(level=0))\n", "new_scores_arr" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameScoresNewScores
0Alice[95 88 92][100. 93. 97.]
1Bob[78 81][83. 86.]
2Charlie[ 82 89 94 100][ 87. 94. 99. 105.]
\n", "

3 rows × 3 columns

\n", "
[3 rows x 3 columns in total]" ], "text/plain": [ " Name Scores NewScores\n", "0 Alice [95 88 92] [100. 93. 97.]\n", "1 Bob [78 81] [83. 86.]\n", "2 Charlie [ 82 89 94 100] [ 87. 94. 99. 105.]\n", "\n", "[3 rows x 3 columns]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Add adjusted scores into the DataFrame. This operation requires an implicit join \n", "# between the two tables, necessitating a unique index in the DataFrame (guaranteed \n", "# in the default ordering and index mode).\n", "df['NewScores'] = new_scores_arr\n", "df" ] } ], "metadata": { "kernelspec": { "display_name": "venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.1" } }, "nbformat": 4, "nbformat_minor": 2 }