# Copyright 2025 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
Array Data Types#
In BigQuery, an ARRAY (also called a repeated column) is an ordered list of zero or more elements of the same, non-NULL data type. It’s important to note that BigQuery ARRAYs cannot contain nested ARRAYs. BigQuery DataFrames represents BigQuery ARRAY types to pandas.ArrowDtype(pa.list_(T)), where T is the underlying Arrow type of the array elements.
This notebook illustrates how to work with ARRAY columns in BigQuery DataFrames. First, let’s import the required packages and perform the necessary setup below.
import bigframes.pandas as bpd
import bigframes.bigquery as bbq
import pandas as pd
import pyarrow as pa
REGION = "US" # @param {type: "string"}
bpd.options.display.progress_bar = None
bpd.options.bigquery.location = REGION
Create DataFrames with an array column#
Example 1: Creating from a list of lists/tuples
names = ["Alice", "Bob", "Charlie"]
scores = [
[95, 88, 92],
[78, 81],
[82, 89, 94, 100]
]
df = bpd.DataFrame({"Name": names, "Scores": scores})
df
| Name | Scores | |
|---|---|---|
| 0 | Alice | [95 88 92] |
| 1 | Bob | [78 81] |
| 2 | Charlie | [ 82 89 94 100] |
3 rows × 2 columns
df.dtypes
Name string[pyarrow]
Scores list<item: int64>[pyarrow]
dtype: object
Example 2: Defining schema explicitly
bpd.Series(data=scores, dtype=pd.ArrowDtype(pa.list_(pa.float64())))
0 [95. 88. 92.]
1 [78. 81.]
2 [ 82. 89. 94. 100.]
dtype: list<item: double>[pyarrow]
Example 3: Reading from a source
bpd.read_gbq("bigquery-public-data.ml_datasets.credit_card_default", max_results=5)["predicted_default_payment_next_month"]
0 [{'tables': {'score': 0.9349926710128784, 'val...
1 [{'tables': {'score': 0.9690881371498108, 'val...
2 [{'tables': {'score': 0.8667634129524231, 'val...
3 [{'tables': {'score': 0.9351968765258789, 'val...
4 [{'tables': {'score': 0.8572560548782349, 'val...
Name: predicted_default_payment_next_month, dtype: list<item: struct<tables: struct<score: double, value: string>>>[pyarrow]
Operate on ARRAY data#
BigQuery DataFrames provides two main approaches for operating on list (ARRAY) data:
The
Series.listaccessor: Provides Pandas-like methods for array column manipulation.BigQuery built-in functions: Allows you to use functions mirroring BigQuery SQL operations, available through the
bigframes.bigquerymodule (abbreviated asbbqbelow), such asarray_aggandarray_length.
Get the Length of Each Arrray#
Example 1: Using list accessor to get array length
df['Scores'].list.len()
0 3
1 2
2 4
Name: Scores, dtype: Int64
Example 2: Using BigQuery build-in functions to get array length
bbq.array_length(df['Scores'])
0 3
1 2
2 4
Name: Scores, dtype: Int64
Access Element at a Specific Index (e.g., First Element)#
df['Scores'].list[0]
0 95
1 78
2 82
Name: Scores, dtype: Int64
Explode/Unnest Array elements into Seperate Rows#
The exploded rows preserving original order when in ordering mode. If an array has multiple elements, exploded rows are ordered by the element’s index within its original array.
scores = df['Scores'].explode()
scores
0 95
0 88
0 92
1 78
1 81
2 82
2 89
2 94
2 100
Name: Scores, dtype: Int64
Aggregate elements back into an array#
new_scores = scores + 5.0
new_scores_arr = bbq.array_agg(new_scores.groupby(level=0))
new_scores_arr
0 [100. 93. 97.]
1 [83. 86.]
2 [ 87. 94. 99. 105.]
Name: Scores, dtype: list<item: double>[pyarrow]
# Add adjusted scores into the DataFrame. This operation requires an implicit join
# between the two tables, necessitating a unique index in the DataFrame (guaranteed
# in the default ordering and index mode).
df['NewScores'] = new_scores_arr
df
| Name | Scores | NewScores | |
|---|---|---|---|
| 0 | Alice | [95 88 92] | [100. 93. 97.] |
| 1 | Bob | [78 81] | [83. 86.] |
| 2 | Charlie | [ 82 89 94 100] | [ 87. 94. 99. 105.] |
3 rows × 3 columns