# Copyright 2025 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

Array Data Types#

In BigQuery, an ARRAY (also called a repeated column) is an ordered list of zero or more elements of the same, non-NULL data type. It’s important to note that BigQuery ARRAYs cannot contain nested ARRAYs. BigQuery DataFrames represents BigQuery ARRAY types to pandas.ArrowDtype(pa.list_(T)), where T is the underlying Arrow type of the array elements.

This notebook illustrates how to work with ARRAY columns in BigQuery DataFrames. First, let’s import the required packages and perform the necessary setup below.

import bigframes.pandas as bpd
import bigframes.bigquery as bbq
import pandas as pd
import pyarrow as pa
REGION = "US"  # @param {type: "string"}

bpd.options.display.progress_bar = None
bpd.options.bigquery.location = REGION

Create DataFrames with an array column#

Example 1: Creating from a list of lists/tuples

names = ["Alice", "Bob", "Charlie"]
scores = [
    [95, 88, 92],
    [78, 81],
    [82, 89, 94, 100]
]
df = bpd.DataFrame({"Name": names, "Scores": scores})
df
Name Scores
0 Alice [95 88 92]
1 Bob [78 81]
2 Charlie [ 82 89 94 100]

3 rows × 2 columns

[3 rows x 2 columns in total]
df.dtypes
Name                 string[pyarrow]
Scores    list<item: int64>[pyarrow]
dtype: object

Example 2: Defining schema explicitly

bpd.Series(data=scores, dtype=pd.ArrowDtype(pa.list_(pa.float64())))
0            [95. 88. 92.]
1                [78. 81.]
2    [ 82.  89.  94. 100.]
dtype: list<item: double>[pyarrow]

Example 3: Reading from a source

bpd.read_gbq("bigquery-public-data.ml_datasets.credit_card_default", max_results=5)["predicted_default_payment_next_month"]
0    [{'tables': {'score': 0.9349926710128784, 'val...
1    [{'tables': {'score': 0.9690881371498108, 'val...
2    [{'tables': {'score': 0.8667634129524231, 'val...
3    [{'tables': {'score': 0.9351968765258789, 'val...
4    [{'tables': {'score': 0.8572560548782349, 'val...
Name: predicted_default_payment_next_month, dtype: list<item: struct<tables: struct<score: double, value: string>>>[pyarrow]

Operate on ARRAY data#

BigQuery DataFrames provides two main approaches for operating on list (ARRAY) data:

  1. The Series.list accessor: Provides Pandas-like methods for array column manipulation.

  2. BigQuery built-in functions: Allows you to use functions mirroring BigQuery SQL operations, available through the bigframes.bigquery module (abbreviated as bbq below), such as array_agg and array_length.

Get the Length of Each Arrray#

Example 1: Using list accessor to get array length

df['Scores'].list.len()
0    3
1    2
2    4
Name: Scores, dtype: Int64

Example 2: Using BigQuery build-in functions to get array length

bbq.array_length(df['Scores'])
0    3
1    2
2    4
Name: Scores, dtype: Int64

Access Element at a Specific Index (e.g., First Element)#

df['Scores'].list[0]
0    95
1    78
2    82
Name: Scores, dtype: Int64

Explode/Unnest Array elements into Seperate Rows#

The exploded rows preserving original order when in ordering mode. If an array has multiple elements, exploded rows are ordered by the element’s index within its original array.

scores = df['Scores'].explode()
scores
0     95
0     88
0     92
1     78
1     81
2     82
2     89
2     94
2    100
Name: Scores, dtype: Int64

Aggregate elements back into an array#

new_scores = scores + 5.0
new_scores_arr = bbq.array_agg(new_scores.groupby(level=0))
new_scores_arr
0         [100.  93.  97.]
1                [83. 86.]
2    [ 87.  94.  99. 105.]
Name: Scores, dtype: list<item: double>[pyarrow]
# Add adjusted scores into the DataFrame. This operation requires an implicit join 
# between the two tables, necessitating a unique index in the DataFrame (guaranteed 
# in the default ordering and index mode).
df['NewScores'] = new_scores_arr
df
Name Scores NewScores
0 Alice [95 88 92] [100. 93. 97.]
1 Bob [78 81] [83. 86.]
2 Charlie [ 82 89 94 100] [ 87. 94. 99. 105.]

3 rows × 3 columns

[3 rows x 3 columns in total]