# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

Train a linear regression model with BigQuery DataFrames ML#

Colab logo Run in Colab GitHub logo View on GitHub Vertex AI logo Open in Vertex AI Workbench BQ logo Open in BQ Studio

NOTE: This notebook has been tested in the following environment:

  • Python version = 3.10

Overview#

Use this notebook to learn how to train a linear regression model using BigQuery DataFrames ML. BigQuery DataFrames ML provides a provides a scikit-learn-like API for ML powered by the BigQuery engine.

This example is adapted from the BQML linear regression tutorial.

Learn more about BigQuery DataFrames.

Objective#

In this tutorial, you use BigQuery DataFrames to create a linear regression model that predicts the weight of an Adelie penguin based on the penguin’s island of residence, culmen length and depth, flipper length, and sex.

The steps include:

  • Creating a DataFrame from a BigQuery table.

  • Cleaning and preparing data using pandas.

  • Creating a linear regression model using bigframes.ml.

  • Saving the ML model to BigQuery for future use.

Dataset#

This tutorial uses the penguins table (a BigQuery Public Dataset) which includes data on a set of penguins including species, island of residence, weight, culmen length and depth, flipper length, and sex.

Costs#

This tutorial uses billable components of Google Cloud:

  • BigQuery (compute)

  • BigQuery ML

Learn about BigQuery compute pricing and BigQuery ML pricing, and use the Pricing Calculator to generate a cost estimate based on your projected usage.

Installation#

If you don’t have bigframes package already installed, uncomment and execute the following cells to

  1. Install the package

  2. Restart the notebook kernel (Jupyter or Colab) to work with the package

# !pip install bigframes
# Automatically restart kernel after installs so that your environment can access the new packages
# import IPython

# app = IPython.Application.instance()
# app.kernel.do_shutdown(True)

Before you begin#

Complete the tasks in this section to set up your environment.

Set up your Google Cloud project#

The following steps are required, regardless of your notebook environment.

  1. Select or create a Google Cloud project. When you first create an account, you get a $300 credit towards your compute/storage costs.

  2. Make sure that billing is enabled for your project.

  3. Enable the BigQuery API.

  4. If you are running this notebook locally, install the Cloud SDK.

Set your project ID#

If you don’t know your project ID, try the following:

PROJECT_ID = ""  # @param {type:"string"}

# Set the project id
! gcloud config set project {PROJECT_ID}

Set the region#

You can also change the REGION variable used by BigQuery. Learn more about BigQuery regions.

REGION = "US"  # @param {type: "string"}

Authenticate your Google Cloud account#

Depending on your Jupyter environment, you might have to manually authenticate. Follow the relevant instructions below.

Vertex AI Workbench

Do nothing, you are already authenticated.

Local JupyterLab instance

Uncomment and run the following cell:

# ! gcloud auth login

Colab

Uncomment and run the following cell:

# from google.colab import auth
# auth.authenticate_user()

Import libraries#

import bigframes.pandas as bpd

Set BigQuery DataFrames options#

# Note: The project option is not required in all environments.
# On BigQuery Studio, the project ID is automatically detected.
bpd.options.bigquery.project = PROJECT_ID

# Note: The location option is not required.
# It defaults to the location of the first table or query
# passed to read_gbq(). For APIs where a location can't be
# auto-detected, the location defaults to the "US" location.
bpd.options.bigquery.location = REGION

If you want to reset the location of the created DataFrame or Series objects, reset the session by executing bpd.close_session(). After that, you can reuse bpd.options.bigquery.location to specify another location.

Read a BigQuery table into a BigQuery DataFrames DataFrame#

Read the penguins table into a BigQuery DataFrames DataFrame:

df = bpd.read_gbq("bigquery-public-data.ml_datasets.penguins")

Take a look at the DataFrame:

df.head()

Clean and prepare data#

You can use pandas as you normally would on the BigQuery DataFrames DataFrame, but calculations happen in the BigQuery query engine instead of your local environment.

Because this model will focus on the Adelie Penguin species, you need to filter the data for only those rows representing Adelie penguins. Then you drop the species column because it is no longer needed.

As these functions are applied, only the new DataFrame object adelie_data is modified. The source table and the original DataFrame object df don’t change.

# Filter down to the data to the Adelie Penguin species
adelie_data = df[df.species == "Adelie Penguin (Pygoscelis adeliae)"]

# Drop the species column
adelie_data = adelie_data.drop(columns=["species"])

# Take a look at the filtered DataFrame
adelie_data

Drop rows with NULL values in order to create a BigQuery DataFrames DataFrame for the training data:

# Drop rows with nulls to get training data
training_data = adelie_data.dropna()

# Take a peek at the training data
training_data

Specify your feature (or input) columns and the label (or output) column:

feature_columns = training_data[['island', 'culmen_length_mm', 'culmen_depth_mm', 'flipper_length_mm', 'sex']]
label_columns = training_data[['body_mass_g']]

There is a row within the adelie_data BigQuery DataFrames DataFrame that has a NULL value for the body mass column. body mass is the label column, which is the value that the model you are creating is trying to predict.

Create a new BigQuery DataFrames DataFrame, test_data, for this row so that you can use it as test data on which to make a prediction later:

test_data = adelie_data[adelie_data.body_mass_g.isnull()]

test_data

Create the linear regression model#

BigQuery DataFrames ML lets you move from exploring data to creating machine learning models through its scikit-learn-like API, bigframes.ml. BigQuery DataFrames ML supports several types of ML models.

In this notebook, you create a linear regression model, a type of regression model that generates a continuous value from a linear combination of input features.

When you create a model with BigQuery DataFrames ML, it is saved locally and limited to the BigQuery session. However, as you’ll see in the next section, you can use to_gbq to save the model permanently to your BigQuery project.

Create the model using bigframes.ml#

When you pass the feature columns without transforms, BigQuery ML uses automatic preprocessing to encode string values and scale numeric values.

BigQuery ML also automatically splits the data for training and evaluation, although for datasets with less than 500 rows (such as this one), all rows are used for training.

from bigframes.ml.linear_model import LinearRegression

model = LinearRegression()

model.fit(feature_columns, label_columns)

Score the model#

Check how the model performed by using the score method. More information on model scoring can be found here.

model.score(feature_columns, label_columns)

Predict using the model#

Use the model to predict the body mass of the data row you saved earlier to the test_data DataFrame:

model.predict(test_data)

Save the model in BigQuery#

The model is saved locally within this session. You can save the model permanently to BigQuery for use in future sessions, and to make the model sharable with others.

Create a BigQuery dataset to house the model, adding a name for your dataset as the DATASET_ID variable:

DATASET_ID = ""  # @param {type:"string"}

from google.cloud import bigquery
client = bigquery.Client(project=PROJECT_ID)
dataset = bigquery.Dataset(PROJECT_ID + "." + DATASET_ID)
dataset.location = REGION
dataset = client.create_dataset(dataset, exists_ok=True)
print(f"Dataset {dataset.dataset_id} created.")

Save the model using the to_gbq method:

model.to_gbq(DATASET_ID + ".penguin_weight" , replace=True)

You can view the saved model in the BigQuery console under the dataset you created in the first step. Run the following cell and follow the link to view your BigQuery console:

print(f'https://console.developers.google.com/bigquery?p={PROJECT_ID}')

Summary and next steps#

You’ve created a linear regression model using bigframes.ml.

Learn more about BigQuery DataFrames in the documentation and find more sample notebooks in the GitHub repo.

Cleaning up#

To clean up all Google Cloud resources used in this project, you can delete the Google Cloud project you used for the tutorial.

Otherwise, you can uncomment the remaining cells and run them to delete the individual resources you created in this tutorial:

# # Delete the BigQuery dataset and associated ML model
# from google.cloud import bigquery
# client = bigquery.Client(project=PROJECT_ID)
# client.delete_dataset(
#  DATASET_ID, delete_contents=True, not_found_ok=True
# )
# print("Deleted dataset '{}'.".format(DATASET_ID))