# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

Machine Learning Fundamentals with BigQuery DataFrames#

Colab logo Run in Colab GitHub logo View on GitHub Vertex AI logo Open in Vertex AI Workbench BQ logo Open in BQ Studio

NOTE: This notebook has been tested in the following environment:

  • Python version = 3.10

Overview#

The bigframes.ml module implements Scikit-Learn’s machine learning API in BigQuery DataFrames. It exposes BigQuery’s ML capabilities in a simple, popular API that works seamlessly with the rest of the BigQuery DataFrames API.

Learn more about BigQuery DataFrames.

Objective#

In this tutorial, you will walk through an end-to-end machine learning workflow using BigQuery DataFrames. You will load data, manipulate and prepare it for model training, build supervised and unsupervised models, and evaluate and save a model for future use; all using built-in BigQuery DataFrames functionality.

Dataset#

This tutorial uses the penguins table (a BigQuery public dataset), which contains data on a set of penguins including species, island of residence, weight, culmen length and depth, flipper length, and sex.

Costs#

This tutorial uses billable components of Google Cloud:

  • BigQuery (storage and compute)

  • BigQuery ML

Learn about BigQuery storage pricing, BigQuery compute pricing, and BigQuery ML pricing, and use the Pricing Calculator to generate a cost estimate based on your projected usage.

Installation#

Depending on your Jupyter environment, you might have to install packages.

Vertex AI Workbench or Colab

Do nothing, BigQuery DataFrames package is already installed.

Local JupyterLab instance

Uncomment and run the following cell:

# !pip install bigframes

Before you begin#

Complete the tasks in this section to set up your environment.

Set up your Google Cloud project#

The following steps are required, regardless of your notebook environment.

  1. Select or create a Google Cloud project. When you first create an account, you get a $300 credit towards your compute/storage costs.

  2. Make sure that billing is enabled for your project.

  3. Click here to enable the BigQuery API.

  4. If you are running this notebook locally, install the Cloud SDK.

Set your project ID#

If you don’t know your project ID, try the following:

PROJECT_ID = ""  # @param {type:"string"}

# Set the project id
! gcloud config set project {PROJECT_ID}
Updated property [core/project].

Set the region#

You can also change the REGION variable used by BigQuery. Learn more about BigQuery regions.

REGION = "US"  # @param {type: "string"}

Set the dataset ID#

As part of this notebook, you will save BigQuery ML models to your Google Cloud project, which requires a dataset. Create the dataset, if needed, and provide the ID here as the DATASET variable used by BigQuery. Learn how to create a BigQuery dataset.

DATASET = ""  # @param {type: "string"}

Authenticate your Google Cloud account#

Depending on your Jupyter environment, you might have to manually authenticate. Follow the relevant instructions below.

Vertex AI Workbench

Do nothing, you are already authenticated.

Local JupyterLab instance

Uncomment and run the following cell:

# ! gcloud auth login

Colab

Uncomment and run the following cell:

# from google.colab import auth
# auth.authenticate_user()

Import libraries#

import bigframes.pandas as bpd

Set BigQuery DataFrames options#

# Note: The project option is not required in all environments.
# On BigQuery Studio, the project ID is automatically detected.
bpd.options.bigquery.project = PROJECT_ID

# Note: The location option is not required.
# It defaults to the location of the first table or query
# passed to read_gbq(). For APIs where a location can't be
# auto-detected, the location defaults to the "US" location.
bpd.options.bigquery.location = REGION

If you want to reset the location of the created DataFrame or Series objects, reset the session by executing bpd.reset_session(). After that, you can reuse bpd.options.bigquery.location to specify another location.

Import data into BigQuery DataFrames#

You can create a DataFrame by reading data from a BigQuery table.

df = bpd.read_gbq("bigquery-public-data.ml_datasets.penguins")
df = df.dropna()

# BigQuery DataFrames creates a default numbered index, which we can give a name
df.index.name = "penguin_id"

Take a look at a few rows of the DataFrame:

df.head()
Query job d3acda60-1059-4bb0-9912-ed374491c5c3 is DONE. 28.9 kB processed. Open Job
Query job 51c6aa1c-ff98-4805-921e-00830e125e56 is DONE. 0 Bytes processed. Open Job
Query job 01e2cb6d-604b-4cdd-afb0-8f515a9da951 is DONE. 501 Bytes processed. Open Job
species island culmen_length_mm culmen_depth_mm flipper_length_mm body_mass_g sex
penguin_id
0 Gentoo penguin (Pygoscelis papua) Biscoe 50.5 15.9 225.0 5400.0 MALE
1 Gentoo penguin (Pygoscelis papua) Biscoe 45.1 14.5 215.0 5000.0 FEMALE
2 Adelie Penguin (Pygoscelis adeliae) Torgersen 41.4 18.5 202.0 3875.0 MALE
3 Adelie Penguin (Pygoscelis adeliae) Torgersen 38.6 17.0 188.0 2900.0 FEMALE
4 Gentoo penguin (Pygoscelis papua) Biscoe 46.5 14.8 217.0 5200.0 FEMALE

5 rows × 7 columns

[5 rows x 7 columns in total]

Clean and prepare data#

We’re are going to start with supervised learning, where a Linear Regression model will learn to predict the body mass (output variable y) using input features such as flipper length, sex, species, and more (features X).

# Isolate input features and output variable into DataFrames
X = df[['island', 'culmen_length_mm', 'culmen_depth_mm', 'flipper_length_mm', 'sex', 'species']]
y = df[['body_mass_g']]

Part of preparing data for a machine learning task is splitting it into subsets for training and testing to ensure that the solution is not overfitting. By default, BQML will automatically manage splitting the data for you. However, BQML also supports manually splitting out your training data.

Performing a manual data split can be done with bigframes.ml.model_selection.train_test_split like so:

from bigframes.ml.model_selection import train_test_split

# This will split X and y into test and training sets, with 20% of the rows in the test set,
# and the rest in the training set
X_train, X_test, y_train, y_test = train_test_split(
  X, y, test_size=0.2)

# Show the shape of the data after the split
print(f"""X_train shape: {X_train.shape}
X_test shape: {X_test.shape}
y_train shape: {y_train.shape}
y_test shape: {y_test.shape}""")
Query job 7bd14e04-b3b4-4281-b5be-187f7baad62f is DONE. 28.9 kB processed. Open Job
Query job 240cc7db-19ac-4bd3-8e76-a79f75ded077 is DONE. 28.9 kB processed. Open Job
Query job 91194fee-d9b9-4cb9-a469-e49e9d77c624 is DONE. 28.9 kB processed. Open Job
Query job 84c71647-956b-4385-8dce-c8bc70a917c8 is DONE. 28.9 kB processed. Open Job
Query job 9c94600b-2231-4d04-8e3a-fb46f8892b6a is DONE. 28.9 kB processed. Open Job
X_train shape: (267, 6)
X_test shape: (67, 6)
y_train shape: (267, 1)
y_test shape: (67, 1)

If we look at the data, we can see that random rows were selected for each side of the split:

X_test.head(5)
Query job 8ad534c1-eb49-4616-b7a6-f7d8b044b8bf is DONE. 28.9 kB processed. Open Job
Query job 3793de66-fb3c-4ca4-a337-aa708c718cc5 is DONE. 0 Bytes processed. Open Job
Query job 66524afb-4509-4927-8902-4a72826e83c4 is DONE. 456 Bytes processed. Open Job
island culmen_length_mm culmen_depth_mm flipper_length_mm sex species
penguin_id
188 Dream 51.5 18.7 187.0 MALE Chinstrap penguin (Pygoscelis antarctica)
251 Biscoe 49.5 16.1 224.0 MALE Gentoo penguin (Pygoscelis papua)
231 Biscoe 45.7 13.9 214.0 FEMALE Gentoo penguin (Pygoscelis papua)
271 Biscoe 59.6 17.0 230.0 MALE Gentoo penguin (Pygoscelis papua)
128 Biscoe 38.8 17.2 180.0 MALE Adelie Penguin (Pygoscelis adeliae)

5 rows × 6 columns

[5 rows x 6 columns in total]

Note that the y_test data matches the same rows in X_test:

y_test.head(5)
Query job 6a87fcc2-f2d0-44f5-8ab2-08f109c2b70d is DONE. 28.9 kB processed. Open Job
Query job ed8e49f8-0f4c-4ef2-bbc2-b8c5ef9fd064 is DONE. 0 Bytes processed. Open Job
Query job 97fea642-03aa-49fd-943e-f4efa5a87f0f is DONE. 120 Bytes processed. Open Job
body_mass_g
penguin_id
188 3250.0
251 5650.0
231 4400.0
271 6050.0
128 3800.0

5 rows × 1 columns

[5 rows x 1 columns in total]

Estimators#

Following scikit-learn, all learning components are “estimators”; objects that can learn from training data and then apply themselves to new data. Estimators share the following patterns:

  • a constructor that takes a list of parameters

  • a standard string representation that shows the class name and all non-default parameters, e.g. LinearRegression(fit_intercept=False)

  • a .fit(..) method to fit the estimator to training data

There estimators can be further broken down into two main subtypes:

  1. Transformers

  2. Predictors

Let’s walk through each of these with our example model.

Transformers#

Transformers are estimators that are used to prepare data for consumption by other estimators (‘preprocessing’). In addition to .fit(...), the transformer implements a .transform(...) method, which will apply a transformation based on what was computed during .fit(..). With this pattern dynamic preprocessing steps can be applied to both training and test/production data consistently.

An example of a transformer is bigframes.ml.preprocessing.StandardScaler, which rescales a dataset to have a mean of zero and a standard deviation of one:

from bigframes.ml.preprocessing import StandardScaler

# StandardScaler will only work on numeric columns
numeric_columns = ["culmen_length_mm", "culmen_depth_mm", "flipper_length_mm"]

scaler = StandardScaler()
scaler.fit(X_train[numeric_columns])

# Now, standardscaler should transform the numbers to have mean of zero
# and standard deviation of one:
scaler.transform(X_train[numeric_columns])
Query job aee64759-42bb-44d6-b8c7-1c737cdd6eed is DONE. 28.9 kB processed. Open Job
Query job acb29d04-a20d-4f1c-8d90-51c7e8ac9922 is DONE. 0 Bytes processed. Open Job
Query job 2bd034db-7d9b-467c-be17-49bca094cceb is DONE. 28.9 kB processed. Open Job
Query job 5dfb583a-1ced-4f2a-94b9-f1282263134d is DONE. 2.1 kB processed. Open Job
Query job 8fe87288-4a95-49f4-9895-7c41c1004901 is DONE. 0 Bytes processed. Open Job
Query job 7ebcecee-beff-402d-ac71-6384014a54da is DONE. 8.5 kB processed. Open Job
standard_scaled_culmen_length_mm standard_scaled_culmen_depth_mm standard_scaled_flipper_length_mm
penguin_id
0 1.20778 -0.651531 1.772656
2 -0.455602 0.662855 0.100476
3 -0.967412 -0.095445 -0.917372
4 0.476623 -1.207617 1.191028
5 -1.625454 0.359535 -0.626559
7 -0.345929 -1.86481 0.682104
8 0.842202 -1.561491 1.409139
9 0.348671 0.865068 -0.263041
10 0.933596 1.218941 0.827511
11 -1.460943 -0.297658 -0.771966
12 1.317454 -0.449318 1.409139
13 -0.236255 -1.763704 0.900214
14 0.549739 -0.297658 -0.626559
16 0.970154 -1.005404 1.481842
17 -1.058807 -0.348211 -0.190338
18 1.354012 -1.510937 1.263732
19 -0.053466 -1.662597 1.191028
20 -0.199697 -1.510937 0.609401
21 1.152943 0.763962 -0.190338
22 -1.205038 0.308982 -0.699262
24 -0.784623 1.775028 -0.699262
25 -0.83946 1.724474 -0.771966
26 -0.620113 0.359535 -0.990076
27 0.330392 -0.095445 -0.408448
29 2.194842 -0.095445 1.990767

25 rows × 3 columns

[267 rows x 3 columns in total]

We can then repeat this transformation on the test data:

scaler.transform(X_test[numeric_columns])
Query job 6639e06d-3920-4c64-84d8-b40ce042188c is DONE. 28.9 kB processed. Open Job
Query job 579dfb14-6d39-44c0-9b92-eb6a40c46df8 is DONE. 536 Bytes processed. Open Job
Query job 7f613d94-a68c-42d5-8afe-0413b32de3a0 is DONE. 0 Bytes processed. Open Job
Query job 140e8b5f-a24b-43a3-831f-30a29a4bd7ea is DONE. 2.1 kB processed. Open Job
standard_scaled_culmen_length_mm standard_scaled_culmen_depth_mm standard_scaled_flipper_length_mm
penguin_id
1 0.220718 -1.359277 1.045621
15 -0.510439 0.157322 -0.771966
28 -1.058807 0.713408 -0.771966
32 1.463685 1.168388 0.39129
33 -0.254534 0.056215 -0.990076
34 -0.510439 0.460642 0.318587
37 1.354012 0.511195 -0.263041
41 -0.674949 -0.095445 -1.789814
47 -1.168481 0.662855 -0.117634
52 0.458344 0.308982 -0.699262
56 -1.040528 0.460642 -1.135483
57 -0.967412 0.005662 -0.117634
62 0.988433 -0.752638 1.191028
65 1.756148 1.370601 0.318587
67 0.677691 -1.359277 1.045621
75 -1.113644 1.421155 -0.771966
81 0.677691 0.561748 -0.408448
89 -0.857739 0.713408 -0.771966
92 -0.802902 0.308982 -0.917372
93 -0.309371 1.168388 -0.263041
96 -0.309371 0.662855 -1.499
100 -0.912576 0.814515 -0.771966
101 0.549739 -1.308724 1.554546
102 -0.126582 0.662855 -0.626559
107 1.20778 -1.005404 1.118325

25 rows × 3 columns

[67 rows x 3 columns in total]

Composing transformers#

To process data where different columns need different preprocessors, bigframes.composition.ColumnTransformer can be employed.

Let’s create an aggregate transform that applies StandardScalar to the numeric columns and OneHotEncoder to the string columns.

from bigframes.ml.compose import ColumnTransformer
from bigframes.ml.preprocessing import OneHotEncoder

# Create an aggregate transform that applies StandardScaler to the numeric columns,
# and OneHotEncoder to the string columns
preproc = ColumnTransformer([
    ("scale", StandardScaler(), ["culmen_length_mm", "culmen_depth_mm", "flipper_length_mm"]),
    ("encode", OneHotEncoder(), ["species", "sex", "island"])])

# Now we can fit all columns of the training data
preproc.fit(X_train)

processed_X_train = preproc.transform(X_train)
processed_X_test = preproc.transform(X_test)

# View the processed training data
processed_X_train
Query job c16fdb5d-3f18-4f85-8a31-705ef4680be5 is DONE. 28.9 kB processed. Open Job
Query job 8c94a7c1-7f12-44be-b389-7c854ceead4b is DONE. 0 Bytes processed. Open Job
Query job 1287628d-1380-4495-a5e9-6806440206bc is DONE. 22.9 kB processed. Open Job
Query job 03163e1a-c789-4046-b71a-b4b4e7bbc043 is DONE. 2.1 kB processed. Open Job
Query job 86f39b30-00db-4ada-8699-0fe49c94eb2d is DONE. 29.2 kB processed. Open Job
Query job d5b0e8b0-12cd-47f6-85d2-806b2c252d37 is DONE. 536 Bytes processed. Open Job
Query job 459cdc90-d1f3-4580-9137-9b93d44ca991 is DONE. 0 Bytes processed. Open Job
Query job 80d10913-7263-44e6-89f7-719eac4158a3 is DONE. 21.4 kB processed. Open Job
onehotencoded_island standard_scaled_culmen_length_mm standard_scaled_culmen_depth_mm standard_scaled_flipper_length_mm onehotencoded_sex onehotencoded_species
penguin_id
0 [{'index': 1, 'value': 1.0}] 1.20778 -0.651531 1.772656 [{'index': 3, 'value': 1.0}] [{'index': 3, 'value': 1.0}]
2 [{'index': 3, 'value': 1.0}] -0.455602 0.662855 0.100476 [{'index': 3, 'value': 1.0}] [{'index': 1, 'value': 1.0}]
3 [{'index': 3, 'value': 1.0}] -0.967412 -0.095445 -0.917372 [{'index': 2, 'value': 1.0}] [{'index': 1, 'value': 1.0}]
4 [{'index': 1, 'value': 1.0}] 0.476623 -1.207617 1.191028 [{'index': 2, 'value': 1.0}] [{'index': 3, 'value': 1.0}]
5 [{'index': 1, 'value': 1.0}] -1.625454 0.359535 -0.626559 [{'index': 2, 'value': 1.0}] [{'index': 1, 'value': 1.0}]
7 [{'index': 1, 'value': 1.0}] -0.345929 -1.86481 0.682104 [{'index': 2, 'value': 1.0}] [{'index': 3, 'value': 1.0}]
8 [{'index': 1, 'value': 1.0}] 0.842202 -1.561491 1.409139 [{'index': 3, 'value': 1.0}] [{'index': 3, 'value': 1.0}]
9 [{'index': 3, 'value': 1.0}] 0.348671 0.865068 -0.263041 [{'index': 3, 'value': 1.0}] [{'index': 1, 'value': 1.0}]
10 [{'index': 2, 'value': 1.0}] 0.933596 1.218941 0.827511 [{'index': 3, 'value': 1.0}] [{'index': 2, 'value': 1.0}]
11 [{'index': 3, 'value': 1.0}] -1.460943 -0.297658 -0.771966 [{'index': 2, 'value': 1.0}] [{'index': 1, 'value': 1.0}]
12 [{'index': 1, 'value': 1.0}] 1.317454 -0.449318 1.409139 [{'index': 3, 'value': 1.0}] [{'index': 3, 'value': 1.0}]
13 [{'index': 1, 'value': 1.0}] -0.236255 -1.763704 0.900214 [{'index': 2, 'value': 1.0}] [{'index': 3, 'value': 1.0}]
14 [{'index': 2, 'value': 1.0}] 0.549739 -0.297658 -0.626559 [{'index': 2, 'value': 1.0}] [{'index': 2, 'value': 1.0}]
16 [{'index': 1, 'value': 1.0}] 0.970154 -1.005404 1.481842 [{'index': 3, 'value': 1.0}] [{'index': 3, 'value': 1.0}]
17 [{'index': 1, 'value': 1.0}] -1.058807 -0.348211 -0.190338 [{'index': 2, 'value': 1.0}] [{'index': 1, 'value': 1.0}]
18 [{'index': 1, 'value': 1.0}] 1.354012 -1.510937 1.263732 [{'index': 3, 'value': 1.0}] [{'index': 3, 'value': 1.0}]
19 [{'index': 1, 'value': 1.0}] -0.053466 -1.662597 1.191028 [{'index': 2, 'value': 1.0}] [{'index': 3, 'value': 1.0}]
20 [{'index': 1, 'value': 1.0}] -0.199697 -1.510937 0.609401 [{'index': 2, 'value': 1.0}] [{'index': 3, 'value': 1.0}]
21 [{'index': 2, 'value': 1.0}] 1.152943 0.763962 -0.190338 [{'index': 2, 'value': 1.0}] [{'index': 2, 'value': 1.0}]
22 [{'index': 2, 'value': 1.0}] -1.205038 0.308982 -0.699262 [{'index': 2, 'value': 1.0}] [{'index': 1, 'value': 1.0}]
24 [{'index': 1, 'value': 1.0}] -0.784623 1.775028 -0.699262 [{'index': 2, 'value': 1.0}] [{'index': 1, 'value': 1.0}]
25 [{'index': 3, 'value': 1.0}] -0.83946 1.724474 -0.771966 [{'index': 3, 'value': 1.0}] [{'index': 1, 'value': 1.0}]
26 [{'index': 1, 'value': 1.0}] -0.620113 0.359535 -0.990076 [{'index': 2, 'value': 1.0}] [{'index': 1, 'value': 1.0}]
27 [{'index': 2, 'value': 1.0}] 0.330392 -0.095445 -0.408448 [{'index': 2, 'value': 1.0}] [{'index': 2, 'value': 1.0}]
29 [{'index': 1, 'value': 1.0}] 2.194842 -0.095445 1.990767 [{'index': 3, 'value': 1.0}] [{'index': 3, 'value': 1.0}]

25 rows × 6 columns

[267 rows x 6 columns in total]

Predictors#

Predictors are estimators that learn and make predictions. In addition to .fit(...), the predictor implements a .predict(...) method, which will use what was learned during .fit(...) to predict some output.

Predictors can be further broken down into two categories:

  • Supervised predictors

  • Unsupervised predictors

Supervised predictors#

Supervised learning is when we train a model on input-output pairs, and then ask it to predict the output for new inputs. An example of such a predictor is bigframes.ml.linear_models.LinearRegression.

from bigframes.ml.linear_model import LinearRegression

linreg = LinearRegression()

# Learn from the training data how to predict output y
linreg.fit(processed_X_train, y_train)

# Predict y for the test data
predicted_y_test = linreg.predict(processed_X_test)

# View predictions
predicted_y_test
Query job a59bf4cc-4c92-4a68-96b1-7465fbcb3ed0 is DONE. 21.4 kB processed. Open Job
Query job 6860c534-a218-4a55-866d-a6e011399cd9 is DONE. 28.9 kB processed. Open Job
Query job 1b3e8da6-2d64-4337-872e-55b874f00596 is DONE. 0 Bytes processed. Open Job
Query job fc118469-8dd7-4187-a3c1-7c5c2f1c5e36 is DONE. 5.7 kB processed. Open Job
Query job 544c5453-cd10-4a08-a338-601d85142df8 is DONE. 536 Bytes processed. Open Job
Query job 41c82cc9-7268-40ae-a736-f7a5f2c8b413 is DONE. 0 Bytes processed. Open Job
Query job e9836f6b-160d-4ce4-88b6-0b04f40a1549 is DONE. 5.9 kB processed. Open Job
predicted_body_mass_g onehotencoded_island standard_scaled_culmen_length_mm standard_scaled_culmen_depth_mm standard_scaled_flipper_length_mm onehotencoded_sex onehotencoded_species
penguin_id
1 4772.376044 [{'index': 1, 'value': 1.0}] 0.220718 -1.359277 1.045621 [{'index': 2, 'value': 1.0}] [{'index': 3, 'value': 1.0}]
15 3883.373922 [{'index': 2, 'value': 1.0}] -0.510439 0.157322 -0.771966 [{'index': 3, 'value': 1.0}] [{'index': 1, 'value': 1.0}]
28 3479.709088 [{'index': 2, 'value': 1.0}] -1.058807 0.713408 -0.771966 [{'index': 2, 'value': 1.0}] [{'index': 1, 'value': 1.0}]
32 4223.853626 [{'index': 2, 'value': 1.0}] 1.463685 1.168388 0.39129 [{'index': 3, 'value': 1.0}] [{'index': 2, 'value': 1.0}]
33 3197.623474 [{'index': 2, 'value': 1.0}] -0.254534 0.056215 -0.990076 [{'index': 2, 'value': 1.0}] [{'index': 2, 'value': 1.0}]
34 4155.26742 [{'index': 2, 'value': 1.0}] -0.510439 0.460642 0.318587 [{'index': 3, 'value': 1.0}] [{'index': 1, 'value': 1.0}]
37 3991.314095 [{'index': 2, 'value': 1.0}] 1.354012 0.511195 -0.263041 [{'index': 3, 'value': 1.0}] [{'index': 2, 'value': 1.0}]
41 3232.648242 [{'index': 3, 'value': 1.0}] -0.674949 -0.095445 -1.789814 [{'index': 2, 'value': 1.0}] [{'index': 1, 'value': 1.0}]
47 4017.740788 [{'index': 2, 'value': 1.0}] -1.168481 0.662855 -0.117634 [{'index': 3, 'value': 1.0}] [{'index': 1, 'value': 1.0}]
52 3365.080596 [{'index': 2, 'value': 1.0}] 0.458344 0.308982 -0.699262 [{'index': 2, 'value': 1.0}] [{'index': 2, 'value': 1.0}]
56 3791.332002 [{'index': 1, 'value': 1.0}] -1.040528 0.460642 -1.135483 [{'index': 3, 'value': 1.0}] [{'index': 1, 'value': 1.0}]
57 3547.892992 [{'index': 1, 'value': 1.0}] -0.967412 0.005662 -0.117634 [{'index': 2, 'value': 1.0}] [{'index': 1, 'value': 1.0}]
62 5372.087702 [{'index': 1, 'value': 1.0}] 0.988433 -0.752638 1.191028 [{'index': 3, 'value': 1.0}] [{'index': 3, 'value': 1.0}]
65 4263.232169 [{'index': 2, 'value': 1.0}] 1.756148 1.370601 0.318587 [{'index': 3, 'value': 1.0}] [{'index': 2, 'value': 1.0}]
67 5234.45894 [{'index': 1, 'value': 1.0}] 0.677691 -1.359277 1.045621 [{'index': 3, 'value': 1.0}] [{'index': 3, 'value': 1.0}]
75 3979.314516 [{'index': 1, 'value': 1.0}] -1.113644 1.421155 -0.771966 [{'index': 3, 'value': 1.0}] [{'index': 1, 'value': 1.0}]
81 3481.331391 [{'index': 2, 'value': 1.0}] 0.677691 0.561748 -0.408448 [{'index': 2, 'value': 1.0}] [{'index': 2, 'value': 1.0}]
89 3915.240555 [{'index': 2, 'value': 1.0}] -0.857739 0.713408 -0.771966 [{'index': 3, 'value': 1.0}] [{'index': 1, 'value': 1.0}]
92 3425.563946 [{'index': 2, 'value': 1.0}] -0.802902 0.308982 -0.917372 [{'index': 2, 'value': 1.0}] [{'index': 1, 'value': 1.0}]
93 4141.497717 [{'index': 1, 'value': 1.0}] -0.309371 1.168388 -0.263041 [{'index': 3, 'value': 1.0}] [{'index': 1, 'value': 1.0}]
96 3394.72289 [{'index': 2, 'value': 1.0}] -0.309371 0.662855 -1.499 [{'index': 2, 'value': 1.0}] [{'index': 1, 'value': 1.0}]
100 3507.226918 [{'index': 2, 'value': 1.0}] -0.912576 0.814515 -0.771966 [{'index': 2, 'value': 1.0}] [{'index': 1, 'value': 1.0}]
101 4922.286202 [{'index': 1, 'value': 1.0}] 0.549739 -1.308724 1.554546 [{'index': 2, 'value': 1.0}] [{'index': 3, 'value': 1.0}]
102 4016.243221 [{'index': 2, 'value': 1.0}] -0.126582 0.662855 -0.626559 [{'index': 3, 'value': 1.0}] [{'index': 1, 'value': 1.0}]
107 4933.655362 [{'index': 1, 'value': 1.0}] 1.20778 -1.005404 1.118325 [{'index': 2, 'value': 1.0}] [{'index': 3, 'value': 1.0}]

25 rows × 7 columns

[67 rows x 7 columns in total]

Unsupervised predictors#

In unsupervised learning, there are no known outputs in the training data, instead the model learns on input data alone and predicts something else. An example of an unsupervised predictor is bigframes.ml.cluster.KMeans, which learns how to fit input data to a target number of clusters.

from bigframes.ml.cluster import KMeans

# Specify KMeans with four clusters
kmeans = KMeans(n_clusters=4)

# Fit data
kmeans.fit(processed_X_train)

# View predictions
kmeans.predict(processed_X_test)
Query job 728068d3-2349-4636-a030-016b500a9812 is DONE. 23.5 kB processed. Open Job
Query job 37bac685-2afa-4ece-b3a3-e0b84a92c65f is DONE. 0 Bytes processed. Open Job
Query job 38416629-4615-45f5-9e27-d9164124f755 is DONE. 6.2 kB processed. Open Job
Query job 0241ea1c-8d96-418a-b3d6-08d819854954 is DONE. 536 Bytes processed. Open Job
Query job 405bcf9b-d652-42f3-931e-12ca0310fe4f is DONE. 0 Bytes processed. Open Job
Query job 21ca6f31-2ea2-4f71-b030-c738bf5afe27 is DONE. 10.2 kB processed. Open Job
CENTROID_ID NEAREST_CENTROIDS_DISTANCE onehotencoded_island standard_scaled_culmen_length_mm standard_scaled_culmen_depth_mm standard_scaled_flipper_length_mm onehotencoded_sex onehotencoded_species
penguin_id
1 3 [{'CENTROID_ID': 3, 'DISTANCE': 0.857057881337... [{'index': 1, 'value': 1.0}] 0.220718 -1.359277 1.045621 [{'index': 2, 'value': 1.0}] [{'index': 3, 'value': 1.0}]
15 4 [{'CENTROID_ID': 4, 'DISTANCE': 1.181613302004... [{'index': 2, 'value': 1.0}] -0.510439 0.157322 -0.771966 [{'index': 3, 'value': 1.0}] [{'index': 1, 'value': 1.0}]
28 1 [{'CENTROID_ID': 1, 'DISTANCE': 1.006856853050... [{'index': 2, 'value': 1.0}] -1.058807 0.713408 -0.771966 [{'index': 2, 'value': 1.0}] [{'index': 1, 'value': 1.0}]
32 2 [{'CENTROID_ID': 2, 'DISTANCE': 1.237504384283... [{'index': 2, 'value': 1.0}] 1.463685 1.168388 0.39129 [{'index': 3, 'value': 1.0}] [{'index': 2, 'value': 1.0}]
33 2 [{'CENTROID_ID': 2, 'DISTANCE': 1.656439702919... [{'index': 2, 'value': 1.0}] -0.254534 0.056215 -0.990076 [{'index': 2, 'value': 1.0}] [{'index': 2, 'value': 1.0}]
34 4 [{'CENTROID_ID': 4, 'DISTANCE': 1.343792119214... [{'index': 2, 'value': 1.0}] -0.510439 0.460642 0.318587 [{'index': 3, 'value': 1.0}] [{'index': 1, 'value': 1.0}]
37 2 [{'CENTROID_ID': 2, 'DISTANCE': 0.816670297369... [{'index': 2, 'value': 1.0}] 1.354012 0.511195 -0.263041 [{'index': 3, 'value': 1.0}] [{'index': 2, 'value': 1.0}]
41 1 [{'CENTROID_ID': 1, 'DISTANCE': 1.317560921596... [{'index': 3, 'value': 1.0}] -0.674949 -0.095445 -1.789814 [{'index': 2, 'value': 1.0}] [{'index': 1, 'value': 1.0}]
47 4 [{'CENTROID_ID': 4, 'DISTANCE': 1.135112005343... [{'index': 2, 'value': 1.0}] -1.168481 0.662855 -0.117634 [{'index': 3, 'value': 1.0}] [{'index': 1, 'value': 1.0}]
52 2 [{'CENTROID_ID': 2, 'DISTANCE': 1.004096945181... [{'index': 2, 'value': 1.0}] 0.458344 0.308982 -0.699262 [{'index': 2, 'value': 1.0}] [{'index': 2, 'value': 1.0}]
56 4 [{'CENTROID_ID': 4, 'DISTANCE': 1.218648668822... [{'index': 1, 'value': 1.0}] -1.040528 0.460642 -1.135483 [{'index': 3, 'value': 1.0}] [{'index': 1, 'value': 1.0}]
57 1 [{'CENTROID_ID': 1, 'DISTANCE': 1.238466630273... [{'index': 1, 'value': 1.0}] -0.967412 0.005662 -0.117634 [{'index': 2, 'value': 1.0}] [{'index': 1, 'value': 1.0}]
62 3 [{'CENTROID_ID': 3, 'DISTANCE': 0.876984617451... [{'index': 1, 'value': 1.0}] 0.988433 -0.752638 1.191028 [{'index': 3, 'value': 1.0}] [{'index': 3, 'value': 1.0}]
65 2 [{'CENTROID_ID': 2, 'DISTANCE': 1.439604004538... [{'index': 2, 'value': 1.0}] 1.756148 1.370601 0.318587 [{'index': 3, 'value': 1.0}] [{'index': 2, 'value': 1.0}]
67 3 [{'CENTROID_ID': 3, 'DISTANCE': 0.763112987694... [{'index': 1, 'value': 1.0}] 0.677691 -1.359277 1.045621 [{'index': 3, 'value': 1.0}] [{'index': 3, 'value': 1.0}]
75 4 [{'CENTROID_ID': 4, 'DISTANCE': 1.075788925734... [{'index': 1, 'value': 1.0}] -1.113644 1.421155 -0.771966 [{'index': 3, 'value': 1.0}] [{'index': 1, 'value': 1.0}]
81 2 [{'CENTROID_ID': 2, 'DISTANCE': 0.777307801541... [{'index': 2, 'value': 1.0}] 0.677691 0.561748 -0.408448 [{'index': 2, 'value': 1.0}] [{'index': 2, 'value': 1.0}]
89 4 [{'CENTROID_ID': 4, 'DISTANCE': 0.891303183824... [{'index': 2, 'value': 1.0}] -0.857739 0.713408 -0.771966 [{'index': 3, 'value': 1.0}] [{'index': 1, 'value': 1.0}]
92 1 [{'CENTROID_ID': 1, 'DISTANCE': 0.934676470689... [{'index': 2, 'value': 1.0}] -0.802902 0.308982 -0.917372 [{'index': 2, 'value': 1.0}] [{'index': 1, 'value': 1.0}]
93 4 [{'CENTROID_ID': 4, 'DISTANCE': 0.984620018517... [{'index': 1, 'value': 1.0}] -0.309371 1.168388 -0.263041 [{'index': 3, 'value': 1.0}] [{'index': 1, 'value': 1.0}]
96 1 [{'CENTROID_ID': 1, 'DISTANCE': 1.446939975674... [{'index': 2, 'value': 1.0}] -0.309371 0.662855 -1.499 [{'index': 2, 'value': 1.0}] [{'index': 1, 'value': 1.0}]
100 1 [{'CENTROID_ID': 1, 'DISTANCE': 1.101117711572... [{'index': 2, 'value': 1.0}] -0.912576 0.814515 -0.771966 [{'index': 2, 'value': 1.0}] [{'index': 1, 'value': 1.0}]
101 3 [{'CENTROID_ID': 3, 'DISTANCE': 0.823832007899... [{'index': 1, 'value': 1.0}] 0.549739 -1.308724 1.554546 [{'index': 2, 'value': 1.0}] [{'index': 3, 'value': 1.0}]
102 4 [{'CENTROID_ID': 4, 'DISTANCE': 0.995348310182... [{'index': 2, 'value': 1.0}] -0.126582 0.662855 -0.626559 [{'index': 3, 'value': 1.0}] [{'index': 1, 'value': 1.0}]
107 3 [{'CENTROID_ID': 3, 'DISTANCE': 0.930021405831... [{'index': 1, 'value': 1.0}] 1.20778 -1.005404 1.118325 [{'index': 2, 'value': 1.0}] [{'index': 3, 'value': 1.0}]

25 rows × 8 columns

[67 rows x 8 columns in total]

Pipelines#

Transfomers and predictors can be chained into a single estimator component using bigframes.ml.pipeline.Pipeline:

from bigframes.ml.pipeline import Pipeline

pipeline = Pipeline([
  ('preproc', preproc),
  ('linreg', linreg)
])

# Print our pipeline
pipeline
Pipeline(steps=[('preproc',
                 ColumnTransformer(transformers=[('scale', StandardScaler(),
                                                  ['culmen_length_mm',
                                                   'culmen_depth_mm',
                                                   'flipper_length_mm']),
                                                 ('encode', OneHotEncoder(),
                                                  ['species', 'sex',
                                                   'island'])])),
                ('linreg', LinearRegression())])

The pipeline simplifies the workflow by applying each of its component steps automatically:

pipeline.fit(X_train, y_train)

predicted_y_test = pipeline.predict(X_test)
predicted_y_test
Query job 95b43592-b198-4f9e-a990-4e837b82121f is DONE. 24.8 kB processed. Open Job
Query job 615b2afb-0c76-45d6-82c7-bde7c8b2b3a4 is DONE. 8.5 kB processed. Open Job
Query job cf2ed3ca-01bf-4cb6-a71a-d6e30a8428f6 is DONE. 0 Bytes processed. Open Job
Query job d9780763-1d2b-494d-a778-20364c52bd08 is DONE. 29.6 kB processed. Open Job
Query job f01296ba-7cd0-4d06-b25a-b5697e46bbf7 is DONE. 536 Bytes processed. Open Job
Query job 5b6fe451-2f8e-471e-a6a0-00b9bffaa826 is DONE. 0 Bytes processed. Open Job
Query job 6a81883b-0514-4251-9f63-490b6346bb8b is DONE. 6.1 kB processed. Open Job
predicted_body_mass_g island culmen_length_mm culmen_depth_mm flipper_length_mm sex species
penguin_id
1 4772.374547 Biscoe 45.1 14.5 215.0 FEMALE Gentoo penguin (Pygoscelis papua)
15 3883.371052 Dream 41.1 17.5 190.0 MALE Adelie Penguin (Pygoscelis adeliae)
28 3479.706166 Dream 38.1 18.6 190.0 FEMALE Adelie Penguin (Pygoscelis adeliae)
32 4223.851137 Dream 51.9 19.5 206.0 MALE Chinstrap penguin (Pygoscelis antarctica)
33 3197.620461 Dream 42.5 17.3 187.0 FEMALE Chinstrap penguin (Pygoscelis antarctica)
34 4155.265191 Dream 41.1 18.1 205.0 MALE Adelie Penguin (Pygoscelis adeliae)
37 3991.311319 Dream 51.3 18.2 197.0 MALE Chinstrap penguin (Pygoscelis antarctica)
41 3232.644783 Torgersen 40.2 17.0 176.0 FEMALE Adelie Penguin (Pygoscelis adeliae)
47 4017.738303 Dream 37.5 18.5 199.0 MALE Adelie Penguin (Pygoscelis adeliae)
52 3365.077659 Dream 46.4 17.8 191.0 FEMALE Chinstrap penguin (Pygoscelis antarctica)
56 3791.328893 Biscoe 38.2 18.1 185.0 MALE Adelie Penguin (Pygoscelis adeliae)
57 3547.890609 Biscoe 38.6 17.2 199.0 FEMALE Adelie Penguin (Pygoscelis adeliae)
62 5372.086117 Biscoe 49.3 15.7 217.0 MALE Gentoo penguin (Pygoscelis papua)
65 4263.229571 Dream 53.5 19.9 205.0 MALE Chinstrap penguin (Pygoscelis antarctica)
67 5234.457401 Biscoe 47.6 14.5 215.0 MALE Gentoo penguin (Pygoscelis papua)
75 3979.311469 Biscoe 37.8 20.0 190.0 MALE Adelie Penguin (Pygoscelis adeliae)
81 3481.328573 Dream 47.6 18.3 195.0 FEMALE Chinstrap penguin (Pygoscelis antarctica)
89 3915.237615 Dream 39.2 18.6 190.0 MALE Adelie Penguin (Pygoscelis adeliae)
92 3425.560982 Dream 39.5 17.8 188.0 FEMALE Adelie Penguin (Pygoscelis adeliae)
93 4141.494969 Biscoe 42.2 19.5 197.0 MALE Adelie Penguin (Pygoscelis adeliae)
96 3394.719445 Dream 42.2 18.5 180.0 FEMALE Adelie Penguin (Pygoscelis adeliae)
100 3507.223965 Dream 38.9 18.8 190.0 FEMALE Adelie Penguin (Pygoscelis adeliae)
101 4922.284991 Biscoe 46.9 14.6 222.0 FEMALE Gentoo penguin (Pygoscelis papua)
102 4016.240318 Dream 43.2 18.5 192.0 MALE Adelie Penguin (Pygoscelis adeliae)
107 4933.653758 Biscoe 50.5 15.2 216.0 FEMALE Gentoo penguin (Pygoscelis papua)

25 rows × 7 columns

[67 rows x 7 columns in total]

In the backend, a pipeline will actually be compiled into a single model with an embedded TRANSFORM step.

Evaluating results#

Some models include a convenient .score(X, y) method for evaulation with a preset accuracy metric:

# In the case of a pipeline, this will be equivalent to calling .score on the contained LinearRegression
pipeline.score(X_test, y_test)
Query job c098e1d1-b3ed-4ec5-94c7-6ba3b2b59e3f is DONE. 29.6 kB processed. Open Job
Query job 035234b0-537a-44ce-adff-bb51c40b4ffa is DONE. 0 Bytes processed. Open Job
Query job b4a2a367-3e06-4fa3-9f00-bdbca884cfdd is DONE. 48 Bytes processed. Open Job
mean_absolute_error mean_squared_error mean_squared_log_error median_absolute_error r2_score explained_variance
0 225.883512 77765.989281 0.004457 179.548041 0.873166 0.873315

1 rows × 6 columns

[1 rows x 6 columns in total]

For a more general approach, the library bigframes.ml.metrics is provided:

from bigframes.ml.metrics import r2_score

r2_score(y_test, predicted_y_test["predicted_body_mass_g"])
Query job 20ec1716-3e8e-4d3f-ba08-1f7b9970ce3f is DONE. 28.9 kB processed. Open Job
Query job 6f628f3b-62df-4a5a-8e05-0b313db0ed07 is DONE. 28.9 kB processed. Open Job
Query job c4eee1e5-146f-4a52-8499-83fe5f701f53 is DONE. 30.0 kB processed. Open Job
0.8731660699616813

Save to BigQuery#

Estimators can be saved to BigQuery as BQML models, and loaded again in future.

Saving requires bigquery.tables.create permission, and loading requires bigquery.models.getMetadata permission. These permissions can be at project level or the dataset level.

If you have those permissions, please go ahead and uncomment the code in the following cells and run.

linreg.to_gbq(f"{DATASET}.penguins_model", replace=True)
Copy job 06c2b62d-a7aa-46a5-a04a-2f189bafc5ee is DONE. Open Job
Pipeline(steps=[('transform',
                 ColumnTransformer(transformers=[('ont_hot_encoder',
                                                  OneHotEncoder(max_categories=1000001,
                                                                min_frequency=0),
                                                  'island'),
                                                 ('standard_scaler',
                                                  StandardScaler(),
                                                  'culmen_length_mm'),
                                                 ('standard_scaler',
                                                  StandardScaler(),
                                                  'culmen_depth_mm'),
                                                 ('standard_scaler',
                                                  StandardScaler(),
                                                  'flipper_length_mm'),
                                                 ('ont_hot_encoder',
                                                  OneHotEncoder(max_categories=1000001,
                                                                min_frequency=0),
                                                  'sex'),
                                                 ('ont_hot_encoder',
                                                  OneHotEncoder(max_categories=1000001,
                                                                min_frequency=0),
                                                  'species')])),
                ('estimator',
                 LinearRegression(optimize_strategy='NORMAL_EQUATION'))])
bpd.read_gbq_model(f"{DATASET}.penguins_model")
Pipeline(steps=[('transform',
                 ColumnTransformer(transformers=[('ont_hot_encoder',
                                                  OneHotEncoder(max_categories=1000001,
                                                                min_frequency=0),
                                                  'island'),
                                                 ('standard_scaler',
                                                  StandardScaler(),
                                                  'culmen_length_mm'),
                                                 ('standard_scaler',
                                                  StandardScaler(),
                                                  'culmen_depth_mm'),
                                                 ('standard_scaler',
                                                  StandardScaler(),
                                                  'flipper_length_mm'),
                                                 ('ont_hot_encoder',
                                                  OneHotEncoder(max_categories=1000001,
                                                                min_frequency=0),
                                                  'sex'),
                                                 ('ont_hot_encoder',
                                                  OneHotEncoder(max_categories=1000001,
                                                                min_frequency=0),
                                                  'species')])),
                ('estimator',
                 LinearRegression(optimize_strategy='NORMAL_EQUATION'))])

We can also save the pipeline to BigQuery. BigQuery will save this as a single model, with the pre-processing steps embedded in the TRANSFORM property:

pipeline.to_gbq(f"{DATASET}.penguins_pipeline", replace=True)
Copy job a0ed8c1b-3a3f-4995-853c-e151d41560d7 is DONE. Open Job
Pipeline(steps=[('transform',
                 ColumnTransformer(transformers=[('ont_hot_encoder',
                                                  OneHotEncoder(max_categories=1000001,
                                                                min_frequency=0),
                                                  'island'),
                                                 ('standard_scaler',
                                                  StandardScaler(),
                                                  'culmen_length_mm'),
                                                 ('standard_scaler',
                                                  StandardScaler(),
                                                  'culmen_depth_mm'),
                                                 ('standard_scaler',
                                                  StandardScaler(),
                                                  'flipper_length_mm'),
                                                 ('ont_hot_encoder',
                                                  OneHotEncoder(max_categories=1000001,
                                                                min_frequency=0),
                                                  'sex'),
                                                 ('ont_hot_encoder',
                                                  OneHotEncoder(max_categories=1000001,
                                                                min_frequency=0),
                                                  'species')])),
                ('estimator',
                 LinearRegression(optimize_strategy='NORMAL_EQUATION'))])
bpd.read_gbq_model(f"{DATASET}.penguins_pipeline")
Pipeline(steps=[('transform',
                 ColumnTransformer(transformers=[('ont_hot_encoder',
                                                  OneHotEncoder(max_categories=1000001,
                                                                min_frequency=0),
                                                  'island'),
                                                 ('standard_scaler',
                                                  StandardScaler(),
                                                  'culmen_length_mm'),
                                                 ('standard_scaler',
                                                  StandardScaler(),
                                                  'culmen_depth_mm'),
                                                 ('standard_scaler',
                                                  StandardScaler(),
                                                  'flipper_length_mm'),
                                                 ('ont_hot_encoder',
                                                  OneHotEncoder(max_categories=1000001,
                                                                min_frequency=0),
                                                  'sex'),
                                                 ('ont_hot_encoder',
                                                  OneHotEncoder(max_categories=1000001,
                                                                min_frequency=0),
                                                  'species')])),
                ('estimator',
                 LinearRegression(optimize_strategy='NORMAL_EQUATION'))])

Summary and next steps#

You’ve completed an end-to-end machine learning workflow using the built-in capabilities of BigQuery DataFrames.

Learn more about BigQuery DataFrames in the documentation and find more sample notebooks in the GitHub repo.

Cleaning up#

To clean up all Google Cloud resources used in this project, you can delete the Google Cloud project you used for the tutorial.

Otherwise, you can uncomment the remaining cells and run them to delete the individual resources you created in this tutorial:

# # Delete the BQML models
# MODEL_NAME = f"{PROJECT_ID}:{DATASET}.penguins_model"
# ! bq rm -f --model {MODEL_NAME}
# PIPELINE_NAME = f"{PROJECT_ID}:{DATASET}.penguins_pipeline"
# ! bq rm -f --model {PIPELINE_NAME}