# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

Train a linear regression model with BigQuery DataFrames ML#

Run in Colab

View on GitHub

Open in Vertex AI Workbench

Open in BQ Studio

NOTE: This notebook has been tested in the following environment:

Python version = 3.10

Overview#

Use this notebook to learn how to train a linear regression model using BigQuery ML and the bigframes.bigquery module.

This example is adapted from the BQML linear regression tutorial.

Learn more about BigQuery DataFrames.

Objective#

In this tutorial, you use BigQuery DataFrames to create a linear regression model that predicts the weight of an Adelie penguin based on the penguin’s island of residence, culmen length and depth, flipper length, and sex.

The steps include:

Creating a DataFrame from a BigQuery table.
Cleaning and preparing data using pandas.
Creating a linear regression model using bigframes.ml.
Saving the ML model to BigQuery for future use.

Dataset#

This tutorial uses the penguins table (a BigQuery Public Dataset) which includes data on a set of penguins including species, island of residence, weight, culmen length and depth, flipper length, and sex.

Costs#

This tutorial uses billable components of Google Cloud:

BigQuery (compute)
BigQuery ML

Learn about BigQuery compute pricing and BigQuery ML pricing, and use the Pricing Calculator to generate a cost estimate based on your projected usage.

Installation#

If you don’t have bigframes package already installed, uncomment and execute the following cells to

Install the package
Restart the notebook kernel (Jupyter or Colab) to work with the package

# !pip install bigframes

# Automatically restart kernel after installs so that your environment can access the new packages
# import IPython

# app = IPython.Application.instance()
# app.kernel.do_shutdown(True)

Before you begin#

Complete the tasks in this section to set up your environment.

Set up your Google Cloud project#

The following steps are required, regardless of your notebook environment.

Select or create a Google Cloud project. When you first create an account, you get a $300 credit towards your compute/storage costs.
Make sure that billing is enabled for your project.
Enable the BigQuery API.
If you are running this notebook locally, install the Cloud SDK.

Set your project ID#

If you don’t know your project ID, try the following:

Run gcloud config list.
Run gcloud projects list.
See the support page: Locate the project ID.

PROJECT_ID = ""  # @param {type:"string"}

# Set the project id
! gcloud config set project {PROJECT_ID}

Updated property [core/project].

Set the region#

You can also change the REGION variable used by BigQuery. Learn more about BigQuery regions.

REGION = "US"  # @param {type: "string"}

Authenticate your Google Cloud account#

Depending on your Jupyter environment, you might have to manually authenticate. Follow the relevant instructions below.

Vertex AI Workbench

Do nothing, you are already authenticated.

Local JupyterLab instance

Uncomment and run the following cell:

# ! gcloud auth login

Colab

Uncomment and run the following cell:

# from google.colab import auth
# auth.authenticate_user()

Import libraries#

import bigframes.pandas as bpd

Set BigQuery DataFrames options#

# Note: The project option is not required in all environments.
# On BigQuery Studio, the project ID is automatically detected.
bpd.options.bigquery.project = PROJECT_ID

# Note: The location option is not required.
# It defaults to the location of the first table or query
# passed to read_gbq(). For APIs where a location can't be
# auto-detected, the location defaults to the "US" location.
bpd.options.bigquery.location = REGION

# Recommended for performance. Disables pandas default ordering of all rows.
bpd.options.bigquery.ordering_mode = "partial"

If you want to reset the location of the created DataFrame or Series objects, reset the session by executing bpd.close_session(). After that, you can reuse bpd.options.bigquery.location to specify another location.

Read a BigQuery table into a BigQuery DataFrames DataFrame#

Read the penguins table into a BigQuery DataFrames DataFrame:

df = bpd.read_gbq("bigquery-public-data.ml_datasets.penguins")

Take a look at the DataFrame:

df.peek()

✅ Completed.

	species	island	culmen_length_mm	culmen_depth_mm	flipper_length_mm	body_mass_g	sex
0	Adelie Penguin (Pygoscelis adeliae)	Dream	36.6	18.4	184.0	3475.0	FEMALE
1	Adelie Penguin (Pygoscelis adeliae)	Dream	39.8	19.1	184.0	4650.0	MALE
2	Adelie Penguin (Pygoscelis adeliae)	Dream	40.9	18.9	184.0	3900.0	MALE
3	Chinstrap penguin (Pygoscelis antarctica)	Dream	46.5	17.9	192.0	3500.0	FEMALE
4	Adelie Penguin (Pygoscelis adeliae)	Dream	37.3	16.8	192.0	3000.0	FEMALE

Clean and prepare data#

You can use pandas as you normally would on the BigQuery DataFrames DataFrame, but calculations happen in the BigQuery query engine instead of your local environment.

Because this model will focus on the Adelie Penguin species, you need to filter the data for only those rows representing Adelie penguins. Then you drop the species column because it is no longer needed.

As these functions are applied, only the new DataFrame object adelie_data is modified. The source table and the original DataFrame object df don’t change.

# Filter down to the data to the Adelie Penguin species
adelie_data = df[df.species == "Adelie Penguin (Pygoscelis adeliae)"]

# Drop the species column
adelie_data = adelie_data.drop(columns=["species"])

# Take a look at the filtered DataFrame
adelie_data

✅ Completed. Query processed 28.9 kB in 12 seconds of slot time. [Job bigframes-dev:US.bb256e8c-f2c7-4eff-b5f3-fcc6836110cf details]

✅ Completed. Query processed 8.4 kB in a moment of slot time.

✅ Completed.

	island	culmen_length_mm	culmen_depth_mm	flipper_length_mm	body_mass_g	sex
0	Dream	36.6	18.4	184.0	3475.0	FEMALE
1	Dream	39.8	19.1	184.0	4650.0	MALE
2	Dream	40.9	18.9	184.0	3900.0	MALE
3	Dream	37.3	16.8	192.0	3000.0	FEMALE
4	Dream	43.2	18.5	192.0	4100.0	MALE
5	Dream	40.2	20.1	200.0	3975.0	MALE
6	Dream	40.8	18.9	208.0	4300.0	MALE
7	Dream	39.0	18.7	185.0	3650.0	MALE
8	Dream	37.0	16.9	185.0	3000.0	FEMALE
9	Dream	34.0	17.1	185.0	3400.0	FEMALE

10 rows × 6 columns

[152 rows x 6 columns in total]

Drop rows with NULL values in order to create a BigQuery DataFrames DataFrame for the training data:

# Drop rows with nulls to get training data
training_data = adelie_data.dropna()

# Take a peek at the training data
training_data

Starting.

✅ Completed. Query processed 8.1 kB in a moment of slot time.

✅ Completed.

	island	culmen_length_mm	culmen_depth_mm	flipper_length_mm	body_mass_g	sex
0	Dream	36.6	18.4	184.0	3475.0	FEMALE
1	Dream	39.8	19.1	184.0	4650.0	MALE
2	Dream	40.9	18.9	184.0	3900.0	MALE
3	Dream	37.3	16.8	192.0	3000.0	FEMALE
4	Dream	43.2	18.5	192.0	4100.0	MALE
5	Dream	40.2	20.1	200.0	3975.0	MALE
6	Dream	40.8	18.9	208.0	4300.0	MALE
7	Dream	39.0	18.7	185.0	3650.0	MALE
8	Dream	37.0	16.9	185.0	3000.0	FEMALE
9	Dream	34.0	17.1	185.0	3400.0	FEMALE

10 rows × 6 columns

[146 rows x 6 columns in total]

Create the linear regression model#

In this notebook, you create a linear regression model, a type of regression model that generates a continuous value from a linear combination of input features.

Create a BigQuery dataset to house the model, adding a name for your dataset as the DATASET_ID variable:

DATASET_ID = "bqml_tutorial"  # @param {type:"string"}

from google.cloud import bigquery
client = bigquery.Client(project=PROJECT_ID)
dataset = bigquery.Dataset(PROJECT_ID + "." + DATASET_ID)
dataset.location = REGION
dataset = client.create_dataset(dataset, exists_ok=True)
print(f"Dataset {dataset.dataset_id} created.")

Dataset bqml_tutorial created.

Create the model using `bigframes.bigquery.ml.create_model`#

When you pass the feature columns without transforms, BigQuery ML uses automatic preprocessing to encode string values and scale numeric values.

BigQuery ML also automatically splits the data for training and evaluation, although for datasets with less than 500 rows (such as this one), all rows are used for training.

import bigframes.bigquery as bbq

model_name = f"{PROJECT_ID}.{DATASET_ID}.penguin_weight"
model_metadata = bbq.ml.create_model(
    model_name,
    replace=True,
    options={
        "model_type": "LINEAR_REG",
    },
    training_data=training_data.rename(columns={"body_mass_g": "label"})
)
model_metadata

Query started with request ID bigframes-dev:US.a33b3628-730b-46e8-ad17-c78bb48619ce.

SQL

CREATE OR REPLACE MODEL `bigframes-dev.bqml_tutorial.penguin_weight`
OPTIONS(model_type = 'LINEAR_REG')
AS SELECT
`bfuid_col_3` AS `island`,
`bfuid_col_4` AS `culmen_length_mm`,
`bfuid_col_5` AS `culmen_depth_mm`,
`bfuid_col_6` AS `flipper_length_mm`,
`bfuid_col_7` AS `label`,
`bfuid_col_8` AS `sex`
FROM
(SELECT
  `t0`.`bfuid_col_3`,
  `t0`.`bfuid_col_4`,
  `t0`.`bfuid_col_5`,
  `t0`.`bfuid_col_6`,
  `t0`.`bfuid_col_7`,
  `t0`.`bfuid_col_8`
FROM `bigframes-dev._63cfa399614a54153cc386c27d6c0c6fdb249f9e._e154f0aa_5b29_492a_b464_a77c5f5a3dbd_bqdf_60fa3196-5a3e-45ae-898e-c2b473bfa1e9` AS `t0`)

etag                                         P3XS+g0ZZM19ywL+hdwUmQ==
modelReference      {'projectId': 'bigframes-dev', 'datasetId': 'b...
creationTime                                            1764779445166
lastModifiedTime                                        1764779445237
modelType                                           LINEAR_REGRESSION
trainingRuns        [{'trainingOptions': {'lossType': 'MEAN_SQUARE...
featureColumns      [{'name': 'island', 'type': {'typeKind': 'STRI...
labelColumns        [{'name': 'predicted_label', 'type': {'typeKin...
location                                                           US
dtype: object

Evaluate the model#

Check how the model performed by using the evalutate function. More information on model evaluation can be found here.

bbq.ml.evaluate(model_name)

✅ Completed. Query processed 0 Bytes in a moment of slot time.

✅ Completed.

	mean_absolute_error	mean_squared_error	mean_squared_log_error	median_absolute_error	r2_score	explained_variance
0	223.878763	78553.601634	0.005614	181.330911	0.623951	0.623951

1 rows × 6 columns

[1 rows x 6 columns in total]

Use the model to predict outcomes#

Now that you have evaluated your model, the next step is to use it to predict an outcome. You can run bigframes.bigquery.ml.predict function on the model to predict the body mass in grams of all penguins that reside on the Biscoe Islands.

df = bpd.read_gbq("bigquery-public-data.ml_datasets.penguins")
biscoe = df[df["island"].str.contains("Biscoe")]
bbq.ml.predict(model_name, biscoe)

/usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes/bigframes/core/log_adapter.py:182: TimeTravelCacheWarning: Reading cached table from 2025-12-03 16:30:18.272882+00:00 to avoid
incompatibilies with previous reads of this table. To read the latest
version, set `use_cache=False` or close the current session with
Session.close() or bigframes.pandas.close_session().
  return method(*args, **kwargs)

✅ Completed. Query processed 29.3 kB in a moment of slot time.

✅ Completed.

	predicted_label	species	island	culmen_length_mm	culmen_depth_mm	flipper_length_mm	body_mass_g	sex
0	3945.010052	Gentoo penguin (Pygoscelis papua)	Biscoe	<NA>	<NA>	<NA>	<NA>	<NA>
1	3914.916297	Adelie Penguin (Pygoscelis adeliae)	Biscoe	39.7	18.9	184.0	3550.0	MALE
2	3278.611224	Adelie Penguin (Pygoscelis adeliae)	Biscoe	36.4	17.1	184.0	2850.0	FEMALE
3	4006.367355	Adelie Penguin (Pygoscelis adeliae)	Biscoe	41.6	18.0	192.0	3950.0	MALE
4	3417.610478	Adelie Penguin (Pygoscelis adeliae)	Biscoe	35.0	17.9	192.0	3725.0	FEMALE
5	4009.612421	Adelie Penguin (Pygoscelis adeliae)	Biscoe	41.1	18.2	192.0	4050.0	MALE
6	4231.330911	Adelie Penguin (Pygoscelis adeliae)	Biscoe	42.0	19.5	200.0	4050.0	MALE
7	3554.308906	Gentoo penguin (Pygoscelis papua)	Biscoe	43.8	13.9	208.0	4300.0	FEMALE
8	3550.677455	Gentoo penguin (Pygoscelis papua)	Biscoe	43.3	14.0	208.0	4575.0	FEMALE
9	3537.882543	Gentoo penguin (Pygoscelis papua)	Biscoe	44.0	13.6	208.0	4350.0	FEMALE

10 rows × 8 columns

[168 rows x 8 columns in total]

Explain the prediction results#

To understand why the model is generating these prediction results, you can use the explain_predict function.

bbq.ml.explain_predict(model_name, biscoe, top_k_features=3)

Query started with request ID bigframes-dev:US.161bba69-c852-4916-a2df-bb5b309be6e4.

SQL

SELECT * FROM ML.EXPLAIN_PREDICT(MODEL `bigframes-dev.bqml_tutorial.penguin_weight`, (SELECT
`bfuid_col_22` AS `species`,
`bfuid_col_23` AS `island`,
`bfuid_col_24` AS `culmen_length_mm`,
`bfuid_col_25` AS `culmen_depth_mm`,
`bfuid_col_26` AS `flipper_length_mm`,
`bfuid_col_27` AS `body_mass_g`,
`bfuid_col_28` AS `sex`
FROM
(SELECT
  `t0`.`species`,
  `t0`.`island`,
  `t0`.`culmen_length_mm`,
  `t0`.`culmen_depth_mm`,
  `t0`.`flipper_length_mm`,
  `t0`.`body_mass_g`,
  `t0`.`sex`,
  `t0`.`species` AS `bfuid_col_22`,
  `t0`.`island` AS `bfuid_col_23`,
  `t0`.`culmen_length_mm` AS `bfuid_col_24`,
  `t0`.`culmen_depth_mm` AS `bfuid_col_25`,
  `t0`.`flipper_length_mm` AS `bfuid_col_26`,
  `t0`.`body_mass_g` AS `bfuid_col_27`,
  `t0`.`sex` AS `bfuid_col_28`,
  regexp_contains(`t0`.`island`, 'Biscoe') AS `bfuid_col_29`
FROM (
  SELECT
    `species`,
    `island`,
    `culmen_length_mm`,
    `culmen_depth_mm`,
    `flipper_length_mm`,
    `body_mass_g`,
    `sex`
  FROM `bigquery-public-data.ml_datasets.penguins` FOR SYSTEM_TIME AS OF TIMESTAMP('2025-12-03T16:30:18.272882+00:00')
) AS `t0`
WHERE
  regexp_contains(`t0`.`island`, 'Biscoe'))), STRUCT(3 AS top_k_features))

✅ Completed.

	predicted_label	top_feature_attributions	baseline_prediction_value	prediction_value	species	island	culmen_length_mm	culmen_depth_mm	flipper_length_mm	body_mass_g	sex
0	3945.010052	[{'feature': 'island', 'attribution': 0.0} {'...	3945.010052	3945.010052	Gentoo penguin (Pygoscelis papua)	Biscoe	<NA>	<NA>	<NA>	<NA>	<NA>
1	3914.916297	[{'feature': 'flipper_length_mm', 'attribution...	3945.010052	3914.916297	Adelie Penguin (Pygoscelis adeliae)	Biscoe	39.7	18.9	184.0	3550.0	MALE
2	3278.611224	[{'feature': 'sex', 'attribution': -443.175184...	3945.010052	3278.611224	Adelie Penguin (Pygoscelis adeliae)	Biscoe	36.4	17.1	184.0	2850.0	FEMALE
3	4006.367355	[{'feature': 'culmen_length_mm', 'attribution'...	3945.010052	4006.367355	Adelie Penguin (Pygoscelis adeliae)	Biscoe	41.6	18.0	192.0	3950.0	MALE
4	3417.610478	[{'feature': 'sex', 'attribution': -443.175184...	3945.010052	3417.610478	Adelie Penguin (Pygoscelis adeliae)	Biscoe	35.0	17.9	192.0	3725.0	FEMALE
5	4009.612421	[{'feature': 'culmen_length_mm', 'attribution'...	3945.010052	4009.612421	Adelie Penguin (Pygoscelis adeliae)	Biscoe	41.1	18.2	192.0	4050.0	MALE
6	4231.330911	[{'feature': 'flipper_length_mm', 'attribution...	3945.010052	4231.330911	Adelie Penguin (Pygoscelis adeliae)	Biscoe	42.0	19.5	200.0	4050.0	MALE
7	3554.308906	[{'feature': 'sex', 'attribution': -443.175184...	3945.010052	3554.308906	Gentoo penguin (Pygoscelis papua)	Biscoe	43.8	13.9	208.0	4300.0	FEMALE
8	3550.677455	[{'feature': 'sex', 'attribution': -443.175184...	3945.010052	3550.677455	Gentoo penguin (Pygoscelis papua)	Biscoe	43.3	14.0	208.0	4575.0	FEMALE
9	3537.882543	[{'feature': 'sex', 'attribution': -443.175184...	3945.010052	3537.882543	Gentoo penguin (Pygoscelis papua)	Biscoe	44.0	13.6	208.0	4350.0	FEMALE

10 rows × 12 columns

[168 rows x 12 columns in total]

Globally explain the model#

To know which features are generally the most important to determine penguin weight, you can use the global_explain function. In order to use global_explain, you must retrain the model with the enable_global_explain option set to True.

model_name = f"{PROJECT_ID}.{DATASET_ID}.penguin_weight_with_global_explain"
model_metadata = bbq.ml.create_model(
    model_name,
    replace=True,
    options={
        "model_type": "LINEAR_REG",
        "input_label_cols": ["body_mass_g"],
        "enable_global_explain": True,
    },
    training_data=training_data,
)

✅ Completed. Query processed 6.9 kB in 53 seconds of slot time. [Job bigframes-dev:US.job_welN8ErlZ_sTG7oOEULsWUgmIg7l details]

bbq.ml.global_explain(model_name)

✅ Completed. Query processed 0 Bytes in a moment of slot time.

✅ Completed.

	feature	attribution
0	sex	221.587592
1	flipper_length_mm	71.311846
2	culmen_depth_mm	66.17986
3	culmen_length_mm	45.443363
4	island	17.258076

5 rows × 2 columns

[5 rows x 2 columns in total]

Compatibility with pandas#

The functions in bigframes.bigquery.ml can accept pandas DataFrames as well. Use the to_pandas() method on the results of methods like predict() to get a pandas DataFrame back.

import pandas as pd

predict_df = pd.DataFrame({
    "sex": ["MALE", "FEMALE", "MALE", "FEMALE"],
    "flipper_length_mm": [180, 190, 200, 210],
    "culmen_depth_mm": [15, 16, 17, 18],
    "culmen_length_mm": [40, 41, 42, 43],
    "island": ["Biscoe", "Biscoe", "Dream", "Dream"],
})
bbq.ml.predict(model_metadata, predict_df).to_pandas()

Query started with request ID bigframes-dev:US.18d9027b-7d55-42c9-ad1b-dabccdda80dc.

SQL

SELECT * FROM ML.PREDICT(MODEL `bigframes-dev.bqml_tutorial.penguin_weight_with_global_explain`, (SELECT
`column_0` AS `sex`,
`column_1` AS `flipper_length_mm`,
`column_2` AS `culmen_depth_mm`,
`column_3` AS `culmen_length_mm`,
`column_4` AS `island`
FROM
(SELECT
  *
FROM (
  SELECT
    *
  FROM UNNEST(ARRAY<STRUCT<`column_0` STRING, `column_1` INT64, `column_2` INT64, `column_3` INT64, `column_4` STRING>>[STRUCT('MALE', 180, 15, 40, 'Biscoe'), STRUCT('FEMALE', 190, 16, 41, 'Biscoe'), STRUCT('MALE', 200, 17, 42, 'Dream'), STRUCT('FEMALE', 210, 18, 43, 'Dream')]) AS `column_0`
) AS `t0`)))

✅ Completed.

	predicted_body_mass_g	sex	flipper_length_mm	culmen_depth_mm	culmen_length_mm	island
0	3596.332211	MALE	180	15	40	Biscoe
1	3384.699918	FEMALE	190	16	41	Biscoe
2	4049.581796	MALE	200	17	42	Dream
3	3837.949503	FEMALE	210	18	43	Dream

Compatibility with `bigframes.ml`#

The models created with bigframes.bigquery.ml can be used with the scikit-learn-like bigframes.ml modules by using the read_gbq_model method.

model = bpd.read_gbq_model(model_name)
model

LinearRegression(enable_global_explain=True,
                 optimize_strategy='NORMAL_EQUATION')

X = training_data[["sex", "flipper_length_mm", "culmen_depth_mm", "culmen_length_mm", "island"]]
y = training_data[["body_mass_g"]]
model.score(X, y)

✅ Completed. Query processed 7.3 kB in a moment of slot time. [Job bigframes-dev:US.f2f86927-bbd1-431d-b89e-3d6a064268d7 details]

✅ Completed.

	mean_absolute_error	mean_squared_error	mean_squared_log_error	median_absolute_error	r2_score	explained_variance
0	223.878763	78553.601634	0.005614	181.330911	0.623951	0.623951

1 rows × 6 columns

[1 rows x 6 columns in total]

Summary and next steps#

You’ve created a linear regression model using bigframes.bigquery.ml.

Learn more about BigQuery DataFrames in the documentation and find more sample notebooks in the GitHub repo.

Cleaning up#

To clean up all Google Cloud resources used in this project, you can delete the Google Cloud project you used for the tutorial.

Otherwise, you can uncomment the remaining cells and run them to delete the individual resources you created in this tutorial:

# # Delete the BigQuery dataset and associated ML model
# from google.cloud import bigquery
# client = bigquery.Client(project=PROJECT_ID)
# client.delete_dataset(
#  DATASET_ID, delete_contents=True, not_found_ok=True
# )
# print("Deleted dataset '{}'.".format(DATASET_ID))

Train a linear regression model with BigQuery DataFrames ML#

Overview#

Objective#

Dataset#

Costs#

Installation#

Before you begin#

Set up your Google Cloud project#

Set your project ID#

Set the region#

Authenticate your Google Cloud account#

Import libraries#

Set BigQuery DataFrames options#

Read a BigQuery table into a BigQuery DataFrames DataFrame#

Clean and prepare data#

Create the linear regression model#

Create the model using bigframes.bigquery.ml.create_model#

Evaluate the model#

Use the model to predict outcomes#

Explain the prediction results#

Globally explain the model#

Compatibility with pandas#

Compatibility with bigframes.ml#

Summary and next steps#

Cleaning up#

Create the model using `bigframes.bigquery.ml.create_model`#

Compatibility with `bigframes.ml`#