# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

Machine Learning Fundamentals with BigQuery DataFrames#

Run in Colab

View on GitHub

Open in Vertex AI Workbench

Open in BQ Studio

NOTE: This notebook has been tested in the following environment:

Python version = 3.10

Overview#

The bigframes.ml module implements Scikit-Learn’s machine learning API in BigQuery DataFrames. It exposes BigQuery’s ML capabilities in a simple, popular API that works seamlessly with the rest of the BigQuery DataFrames API.

Learn more about BigQuery DataFrames.

Objective#

In this tutorial, you will walk through an end-to-end machine learning workflow using BigQuery DataFrames. You will load data, manipulate and prepare it for model training, build supervised and unsupervised models, and evaluate and save a model for future use; all using built-in BigQuery DataFrames functionality.

Dataset#

This tutorial uses the penguins table (a BigQuery public dataset), which contains data on a set of penguins including species, island of residence, weight, culmen length and depth, flipper length, and sex.

Costs#

This tutorial uses billable components of Google Cloud:

BigQuery (storage and compute)
BigQuery ML

Learn about BigQuery storage pricing, BigQuery compute pricing, and BigQuery ML pricing, and use the Pricing Calculator to generate a cost estimate based on your projected usage.

Installation#

Depending on your Jupyter environment, you might have to install packages.

Vertex AI Workbench or Colab

Do nothing, BigQuery DataFrames package is already installed.

Local JupyterLab instance

Uncomment and run the following cell:

# !pip install bigframes

Before you begin#

Complete the tasks in this section to set up your environment.

Set up your Google Cloud project#

The following steps are required, regardless of your notebook environment.

Select or create a Google Cloud project. When you first create an account, you get a $300 credit towards your compute/storage costs.
Make sure that billing is enabled for your project.
Click here to enable the BigQuery API.
If you are running this notebook locally, install the Cloud SDK.

Set your project ID#

If you don’t know your project ID, try the following:

Run gcloud config list.
Run gcloud projects list.
See the support page: Locate the project ID.

PROJECT_ID = ""  # @param {type:"string"}

# Set the project id
! gcloud config set project {PROJECT_ID}

Updated property [core/project].

Set the region#

You can also change the REGION variable used by BigQuery. Learn more about BigQuery regions.

REGION = "US"  # @param {type: "string"}

Set the dataset ID#

As part of this notebook, you will save BigQuery ML models to your Google Cloud project, which requires a dataset. Create the dataset, if needed, and provide the ID here as the DATASET variable used by BigQuery. Learn how to create a BigQuery dataset.

DATASET = ""  # @param {type: "string"}

Authenticate your Google Cloud account#

Depending on your Jupyter environment, you might have to manually authenticate. Follow the relevant instructions below.

Vertex AI Workbench

Do nothing, you are already authenticated.

Local JupyterLab instance

Uncomment and run the following cell:

# ! gcloud auth login

Colab

Uncomment and run the following cell:

# from google.colab import auth
# auth.authenticate_user()

Import libraries#

import bigframes.pandas as bpd

Set BigQuery DataFrames options#

# Note: The project option is not required in all environments.
# On BigQuery Studio, the project ID is automatically detected.
bpd.options.bigquery.project = PROJECT_ID

# Note: The location option is not required.
# It defaults to the location of the first table or query
# passed to read_gbq(). For APIs where a location can't be
# auto-detected, the location defaults to the "US" location.
bpd.options.bigquery.location = REGION

If you want to reset the location of the created DataFrame or Series objects, reset the session by executing bpd.reset_session(). After that, you can reuse bpd.options.bigquery.location to specify another location.

Import data into BigQuery DataFrames#

You can create a DataFrame by reading data from a BigQuery table.

df = bpd.read_gbq("bigquery-public-data.ml_datasets.penguins")
df = df.dropna()

# BigQuery DataFrames creates a default numbered index, which we can give a name
df.index.name = "penguin_id"

Take a look at a few rows of the DataFrame:

df.head()

Query job d3acda60-1059-4bb0-9912-ed374491c5c3 is DONE. 28.9 kB processed. Open Job

Query job 51c6aa1c-ff98-4805-921e-00830e125e56 is DONE. 0 Bytes processed. Open Job

Query job 01e2cb6d-604b-4cdd-afb0-8f515a9da951 is DONE. 501 Bytes processed. Open Job

	species	island	culmen_length_mm	culmen_depth_mm	flipper_length_mm	body_mass_g	sex
penguin_id
0	Gentoo penguin (Pygoscelis papua)	Biscoe	50.5	15.9	225.0	5400.0	MALE
1	Gentoo penguin (Pygoscelis papua)	Biscoe	45.1	14.5	215.0	5000.0	FEMALE
2	Adelie Penguin (Pygoscelis adeliae)	Torgersen	41.4	18.5	202.0	3875.0	MALE
3	Adelie Penguin (Pygoscelis adeliae)	Torgersen	38.6	17.0	188.0	2900.0	FEMALE
4	Gentoo penguin (Pygoscelis papua)	Biscoe	46.5	14.8	217.0	5200.0	FEMALE

5 rows × 7 columns

[5 rows x 7 columns in total]

Clean and prepare data#

We’re are going to start with supervised learning, where a Linear Regression model will learn to predict the body mass (output variable y) using input features such as flipper length, sex, species, and more (features X).

# Isolate input features and output variable into DataFrames
X = df[['island', 'culmen_length_mm', 'culmen_depth_mm', 'flipper_length_mm', 'sex', 'species']]
y = df[['body_mass_g']]

Part of preparing data for a machine learning task is splitting it into subsets for training and testing to ensure that the solution is not overfitting. By default, BQML will automatically manage splitting the data for you. However, BQML also supports manually splitting out your training data.

Performing a manual data split can be done with bigframes.ml.model_selection.train_test_split like so:

from bigframes.ml.model_selection import train_test_split

# This will split X and y into test and training sets, with 20% of the rows in the test set,
# and the rest in the training set
X_train, X_test, y_train, y_test = train_test_split(
  X, y, test_size=0.2)

# Show the shape of the data after the split
print(f"""X_train shape: {X_train.shape}
X_test shape: {X_test.shape}
y_train shape: {y_train.shape}
y_test shape: {y_test.shape}""")

Query job 7bd14e04-b3b4-4281-b5be-187f7baad62f is DONE. 28.9 kB processed. Open Job

Query job 240cc7db-19ac-4bd3-8e76-a79f75ded077 is DONE. 28.9 kB processed. Open Job

Query job 91194fee-d9b9-4cb9-a469-e49e9d77c624 is DONE. 28.9 kB processed. Open Job

Query job 84c71647-956b-4385-8dce-c8bc70a917c8 is DONE. 28.9 kB processed. Open Job

Query job 9c94600b-2231-4d04-8e3a-fb46f8892b6a is DONE. 28.9 kB processed. Open Job

X_train shape: (267, 6)
X_test shape: (67, 6)
y_train shape: (267, 1)
y_test shape: (67, 1)

If we look at the data, we can see that random rows were selected for each side of the split:

X_test.head(5)

Query job 8ad534c1-eb49-4616-b7a6-f7d8b044b8bf is DONE. 28.9 kB processed. Open Job

Query job 3793de66-fb3c-4ca4-a337-aa708c718cc5 is DONE. 0 Bytes processed. Open Job

Query job 66524afb-4509-4927-8902-4a72826e83c4 is DONE. 456 Bytes processed. Open Job

	island	culmen_length_mm	culmen_depth_mm	flipper_length_mm	sex	species
penguin_id
188	Dream	51.5	18.7	187.0	MALE	Chinstrap penguin (Pygoscelis antarctica)
251	Biscoe	49.5	16.1	224.0	MALE	Gentoo penguin (Pygoscelis papua)
231	Biscoe	45.7	13.9	214.0	FEMALE	Gentoo penguin (Pygoscelis papua)
271	Biscoe	59.6	17.0	230.0	MALE	Gentoo penguin (Pygoscelis papua)
128	Biscoe	38.8	17.2	180.0	MALE	Adelie Penguin (Pygoscelis adeliae)

5 rows × 6 columns

[5 rows x 6 columns in total]

Note that the y_test data matches the same rows in X_test:

y_test.head(5)

Query job 6a87fcc2-f2d0-44f5-8ab2-08f109c2b70d is DONE. 28.9 kB processed. Open Job

Query job ed8e49f8-0f4c-4ef2-bbc2-b8c5ef9fd064 is DONE. 0 Bytes processed. Open Job

Query job 97fea642-03aa-49fd-943e-f4efa5a87f0f is DONE. 120 Bytes processed. Open Job

	body_mass_g
penguin_id
188	3250.0
251	5650.0
231	4400.0
271	6050.0
128	3800.0

5 rows × 1 columns

[5 rows x 1 columns in total]

Estimators#

Following scikit-learn, all learning components are “estimators”; objects that can learn from training data and then apply themselves to new data. Estimators share the following patterns:

a constructor that takes a list of parameters
a standard string representation that shows the class name and all non-default parameters, e.g. LinearRegression(fit_intercept=False)
a .fit(..) method to fit the estimator to training data

There estimators can be further broken down into two main subtypes:

Transformers
Predictors

Let’s walk through each of these with our example model.

Transformers#

Transformers are estimators that are used to prepare data for consumption by other estimators (‘preprocessing’). In addition to .fit(...), the transformer implements a .transform(...) method, which will apply a transformation based on what was computed during .fit(..). With this pattern dynamic preprocessing steps can be applied to both training and test/production data consistently.

An example of a transformer is bigframes.ml.preprocessing.StandardScaler, which rescales a dataset to have a mean of zero and a standard deviation of one:

from bigframes.ml.preprocessing import StandardScaler

# StandardScaler will only work on numeric columns
numeric_columns = ["culmen_length_mm", "culmen_depth_mm", "flipper_length_mm"]

scaler = StandardScaler()
scaler.fit(X_train[numeric_columns])

# Now, standardscaler should transform the numbers to have mean of zero
# and standard deviation of one:
scaler.transform(X_train[numeric_columns])

Query job aee64759-42bb-44d6-b8c7-1c737cdd6eed is DONE. 28.9 kB processed. Open Job

Query job acb29d04-a20d-4f1c-8d90-51c7e8ac9922 is DONE. 0 Bytes processed. Open Job

Query job 2bd034db-7d9b-467c-be17-49bca094cceb is DONE. 28.9 kB processed. Open Job

Query job 5dfb583a-1ced-4f2a-94b9-f1282263134d is DONE. 2.1 kB processed. Open Job

Query job 8fe87288-4a95-49f4-9895-7c41c1004901 is DONE. 0 Bytes processed. Open Job

Query job 7ebcecee-beff-402d-ac71-6384014a54da is DONE. 8.5 kB processed. Open Job

	standard_scaled_culmen_length_mm	standard_scaled_culmen_depth_mm	standard_scaled_flipper_length_mm
penguin_id
0	1.20778	-0.651531	1.772656
2	-0.455602	0.662855	0.100476
3	-0.967412	-0.095445	-0.917372
4	0.476623	-1.207617	1.191028
5	-1.625454	0.359535	-0.626559
7	-0.345929	-1.86481	0.682104
8	0.842202	-1.561491	1.409139
9	0.348671	0.865068	-0.263041
10	0.933596	1.218941	0.827511
11	-1.460943	-0.297658	-0.771966
12	1.317454	-0.449318	1.409139
13	-0.236255	-1.763704	0.900214
14	0.549739	-0.297658	-0.626559
16	0.970154	-1.005404	1.481842
17	-1.058807	-0.348211	-0.190338
18	1.354012	-1.510937	1.263732
19	-0.053466	-1.662597	1.191028
20	-0.199697	-1.510937	0.609401
21	1.152943	0.763962	-0.190338
22	-1.205038	0.308982	-0.699262
24	-0.784623	1.775028	-0.699262
25	-0.83946	1.724474	-0.771966
26	-0.620113	0.359535	-0.990076
27	0.330392	-0.095445	-0.408448
29	2.194842	-0.095445	1.990767

25 rows × 3 columns

[267 rows x 3 columns in total]

We can then repeat this transformation on the test data:

scaler.transform(X_test[numeric_columns])

Query job 6639e06d-3920-4c64-84d8-b40ce042188c is DONE. 28.9 kB processed. Open Job

Query job 579dfb14-6d39-44c0-9b92-eb6a40c46df8 is DONE. 536 Bytes processed. Open Job

Query job 7f613d94-a68c-42d5-8afe-0413b32de3a0 is DONE. 0 Bytes processed. Open Job

Query job 140e8b5f-a24b-43a3-831f-30a29a4bd7ea is DONE. 2.1 kB processed. Open Job

	standard_scaled_culmen_length_mm	standard_scaled_culmen_depth_mm	standard_scaled_flipper_length_mm
penguin_id
1	0.220718	-1.359277	1.045621
15	-0.510439	0.157322	-0.771966
28	-1.058807	0.713408	-0.771966
32	1.463685	1.168388	0.39129
33	-0.254534	0.056215	-0.990076
34	-0.510439	0.460642	0.318587
37	1.354012	0.511195	-0.263041
41	-0.674949	-0.095445	-1.789814
47	-1.168481	0.662855	-0.117634
52	0.458344	0.308982	-0.699262
56	-1.040528	0.460642	-1.135483
57	-0.967412	0.005662	-0.117634
62	0.988433	-0.752638	1.191028
65	1.756148	1.370601	0.318587
67	0.677691	-1.359277	1.045621
75	-1.113644	1.421155	-0.771966
81	0.677691	0.561748	-0.408448
89	-0.857739	0.713408	-0.771966
92	-0.802902	0.308982	-0.917372
93	-0.309371	1.168388	-0.263041
96	-0.309371	0.662855	-1.499
100	-0.912576	0.814515	-0.771966
101	0.549739	-1.308724	1.554546
102	-0.126582	0.662855	-0.626559
107	1.20778	-1.005404	1.118325

25 rows × 3 columns

[67 rows x 3 columns in total]

Composing transformers#

To process data where different columns need different preprocessors, bigframes.composition.ColumnTransformer can be employed.

Let’s create an aggregate transform that applies StandardScalar to the numeric columns and OneHotEncoder to the string columns.

from bigframes.ml.compose import ColumnTransformer
from bigframes.ml.preprocessing import OneHotEncoder

# Create an aggregate transform that applies StandardScaler to the numeric columns,
# and OneHotEncoder to the string columns
preproc = ColumnTransformer([
    ("scale", StandardScaler(), ["culmen_length_mm", "culmen_depth_mm", "flipper_length_mm"]),
    ("encode", OneHotEncoder(), ["species", "sex", "island"])])

# Now we can fit all columns of the training data
preproc.fit(X_train)

processed_X_train = preproc.transform(X_train)
processed_X_test = preproc.transform(X_test)

# View the processed training data
processed_X_train

Query job c16fdb5d-3f18-4f85-8a31-705ef4680be5 is DONE. 28.9 kB processed. Open Job

Query job 8c94a7c1-7f12-44be-b389-7c854ceead4b is DONE. 0 Bytes processed. Open Job

Query job 1287628d-1380-4495-a5e9-6806440206bc is DONE. 22.9 kB processed. Open Job

Query job 03163e1a-c789-4046-b71a-b4b4e7bbc043 is DONE. 2.1 kB processed. Open Job

Query job 86f39b30-00db-4ada-8699-0fe49c94eb2d is DONE. 29.2 kB processed. Open Job

Query job d5b0e8b0-12cd-47f6-85d2-806b2c252d37 is DONE. 536 Bytes processed. Open Job

Query job 459cdc90-d1f3-4580-9137-9b93d44ca991 is DONE. 0 Bytes processed. Open Job

Query job 80d10913-7263-44e6-89f7-719eac4158a3 is DONE. 21.4 kB processed. Open Job

	onehotencoded_island	standard_scaled_culmen_length_mm	standard_scaled_culmen_depth_mm	standard_scaled_flipper_length_mm	onehotencoded_sex	onehotencoded_species
penguin_id
0	[{'index': 1, 'value': 1.0}]	1.20778	-0.651531	1.772656	[{'index': 3, 'value': 1.0}]	[{'index': 3, 'value': 1.0}]
2	[{'index': 3, 'value': 1.0}]	-0.455602	0.662855	0.100476	[{'index': 3, 'value': 1.0}]	[{'index': 1, 'value': 1.0}]
3	[{'index': 3, 'value': 1.0}]	-0.967412	-0.095445	-0.917372	[{'index': 2, 'value': 1.0}]	[{'index': 1, 'value': 1.0}]
4	[{'index': 1, 'value': 1.0}]	0.476623	-1.207617	1.191028	[{'index': 2, 'value': 1.0}]	[{'index': 3, 'value': 1.0}]
5	[{'index': 1, 'value': 1.0}]	-1.625454	0.359535	-0.626559	[{'index': 2, 'value': 1.0}]	[{'index': 1, 'value': 1.0}]
7	[{'index': 1, 'value': 1.0}]	-0.345929	-1.86481	0.682104	[{'index': 2, 'value': 1.0}]	[{'index': 3, 'value': 1.0}]
8	[{'index': 1, 'value': 1.0}]	0.842202	-1.561491	1.409139	[{'index': 3, 'value': 1.0}]	[{'index': 3, 'value': 1.0}]
9	[{'index': 3, 'value': 1.0}]	0.348671	0.865068	-0.263041	[{'index': 3, 'value': 1.0}]	[{'index': 1, 'value': 1.0}]
10	[{'index': 2, 'value': 1.0}]	0.933596	1.218941	0.827511	[{'index': 3, 'value': 1.0}]	[{'index': 2, 'value': 1.0}]
11	[{'index': 3, 'value': 1.0}]	-1.460943	-0.297658	-0.771966	[{'index': 2, 'value': 1.0}]	[{'index': 1, 'value': 1.0}]
12	[{'index': 1, 'value': 1.0}]	1.317454	-0.449318	1.409139	[{'index': 3, 'value': 1.0}]	[{'index': 3, 'value': 1.0}]
13	[{'index': 1, 'value': 1.0}]	-0.236255	-1.763704	0.900214	[{'index': 2, 'value': 1.0}]	[{'index': 3, 'value': 1.0}]
14	[{'index': 2, 'value': 1.0}]	0.549739	-0.297658	-0.626559	[{'index': 2, 'value': 1.0}]	[{'index': 2, 'value': 1.0}]
16	[{'index': 1, 'value': 1.0}]	0.970154	-1.005404	1.481842	[{'index': 3, 'value': 1.0}]	[{'index': 3, 'value': 1.0}]
17	[{'index': 1, 'value': 1.0}]	-1.058807	-0.348211	-0.190338	[{'index': 2, 'value': 1.0}]	[{'index': 1, 'value': 1.0}]
18	[{'index': 1, 'value': 1.0}]	1.354012	-1.510937	1.263732	[{'index': 3, 'value': 1.0}]	[{'index': 3, 'value': 1.0}]
19	[{'index': 1, 'value': 1.0}]	-0.053466	-1.662597	1.191028	[{'index': 2, 'value': 1.0}]	[{'index': 3, 'value': 1.0}]
20	[{'index': 1, 'value': 1.0}]	-0.199697	-1.510937	0.609401	[{'index': 2, 'value': 1.0}]	[{'index': 3, 'value': 1.0}]
21	[{'index': 2, 'value': 1.0}]	1.152943	0.763962	-0.190338	[{'index': 2, 'value': 1.0}]	[{'index': 2, 'value': 1.0}]
22	[{'index': 2, 'value': 1.0}]	-1.205038	0.308982	-0.699262	[{'index': 2, 'value': 1.0}]	[{'index': 1, 'value': 1.0}]
24	[{'index': 1, 'value': 1.0}]	-0.784623	1.775028	-0.699262	[{'index': 2, 'value': 1.0}]	[{'index': 1, 'value': 1.0}]
25	[{'index': 3, 'value': 1.0}]	-0.83946	1.724474	-0.771966	[{'index': 3, 'value': 1.0}]	[{'index': 1, 'value': 1.0}]
26	[{'index': 1, 'value': 1.0}]	-0.620113	0.359535	-0.990076	[{'index': 2, 'value': 1.0}]	[{'index': 1, 'value': 1.0}]
27	[{'index': 2, 'value': 1.0}]	0.330392	-0.095445	-0.408448	[{'index': 2, 'value': 1.0}]	[{'index': 2, 'value': 1.0}]
29	[{'index': 1, 'value': 1.0}]	2.194842	-0.095445	1.990767	[{'index': 3, 'value': 1.0}]	[{'index': 3, 'value': 1.0}]

25 rows × 6 columns

[267 rows x 6 columns in total]

Predictors#

Predictors are estimators that learn and make predictions. In addition to .fit(...), the predictor implements a .predict(...) method, which will use what was learned during .fit(...) to predict some output.

Predictors can be further broken down into two categories:

Supervised predictors
Unsupervised predictors

Supervised predictors#

Supervised learning is when we train a model on input-output pairs, and then ask it to predict the output for new inputs. An example of such a predictor is bigframes.ml.linear_models.LinearRegression.

from bigframes.ml.linear_model import LinearRegression

linreg = LinearRegression()

# Learn from the training data how to predict output y
linreg.fit(processed_X_train, y_train)

# Predict y for the test data
predicted_y_test = linreg.predict(processed_X_test)

# View predictions
predicted_y_test

Query job a59bf4cc-4c92-4a68-96b1-7465fbcb3ed0 is DONE. 21.4 kB processed. Open Job

Query job 6860c534-a218-4a55-866d-a6e011399cd9 is DONE. 28.9 kB processed. Open Job

Query job 1b3e8da6-2d64-4337-872e-55b874f00596 is DONE. 0 Bytes processed. Open Job

Query job fc118469-8dd7-4187-a3c1-7c5c2f1c5e36 is DONE. 5.7 kB processed. Open Job

Query job 544c5453-cd10-4a08-a338-601d85142df8 is DONE. 536 Bytes processed. Open Job

Query job 41c82cc9-7268-40ae-a736-f7a5f2c8b413 is DONE. 0 Bytes processed. Open Job

Query job e9836f6b-160d-4ce4-88b6-0b04f40a1549 is DONE. 5.9 kB processed. Open Job

	predicted_body_mass_g	onehotencoded_island	standard_scaled_culmen_length_mm	standard_scaled_culmen_depth_mm	standard_scaled_flipper_length_mm	onehotencoded_sex	onehotencoded_species
penguin_id
1	4772.376044	[{'index': 1, 'value': 1.0}]	0.220718	-1.359277	1.045621	[{'index': 2, 'value': 1.0}]	[{'index': 3, 'value': 1.0}]
15	3883.373922	[{'index': 2, 'value': 1.0}]	-0.510439	0.157322	-0.771966	[{'index': 3, 'value': 1.0}]	[{'index': 1, 'value': 1.0}]
28	3479.709088	[{'index': 2, 'value': 1.0}]	-1.058807	0.713408	-0.771966	[{'index': 2, 'value': 1.0}]	[{'index': 1, 'value': 1.0}]
32	4223.853626	[{'index': 2, 'value': 1.0}]	1.463685	1.168388	0.39129	[{'index': 3, 'value': 1.0}]	[{'index': 2, 'value': 1.0}]
33	3197.623474	[{'index': 2, 'value': 1.0}]	-0.254534	0.056215	-0.990076	[{'index': 2, 'value': 1.0}]	[{'index': 2, 'value': 1.0}]
34	4155.26742	[{'index': 2, 'value': 1.0}]	-0.510439	0.460642	0.318587	[{'index': 3, 'value': 1.0}]	[{'index': 1, 'value': 1.0}]
37	3991.314095	[{'index': 2, 'value': 1.0}]	1.354012	0.511195	-0.263041	[{'index': 3, 'value': 1.0}]	[{'index': 2, 'value': 1.0}]
41	3232.648242	[{'index': 3, 'value': 1.0}]	-0.674949	-0.095445	-1.789814	[{'index': 2, 'value': 1.0}]	[{'index': 1, 'value': 1.0}]
47	4017.740788	[{'index': 2, 'value': 1.0}]	-1.168481	0.662855	-0.117634	[{'index': 3, 'value': 1.0}]	[{'index': 1, 'value': 1.0}]
52	3365.080596	[{'index': 2, 'value': 1.0}]	0.458344	0.308982	-0.699262	[{'index': 2, 'value': 1.0}]	[{'index': 2, 'value': 1.0}]
56	3791.332002	[{'index': 1, 'value': 1.0}]	-1.040528	0.460642	-1.135483	[{'index': 3, 'value': 1.0}]	[{'index': 1, 'value': 1.0}]
57	3547.892992	[{'index': 1, 'value': 1.0}]	-0.967412	0.005662	-0.117634	[{'index': 2, 'value': 1.0}]	[{'index': 1, 'value': 1.0}]
62	5372.087702	[{'index': 1, 'value': 1.0}]	0.988433	-0.752638	1.191028	[{'index': 3, 'value': 1.0}]	[{'index': 3, 'value': 1.0}]
65	4263.232169	[{'index': 2, 'value': 1.0}]	1.756148	1.370601	0.318587	[{'index': 3, 'value': 1.0}]	[{'index': 2, 'value': 1.0}]
67	5234.45894	[{'index': 1, 'value': 1.0}]	0.677691	-1.359277	1.045621	[{'index': 3, 'value': 1.0}]	[{'index': 3, 'value': 1.0}]
75	3979.314516	[{'index': 1, 'value': 1.0}]	-1.113644	1.421155	-0.771966	[{'index': 3, 'value': 1.0}]	[{'index': 1, 'value': 1.0}]
81	3481.331391	[{'index': 2, 'value': 1.0}]	0.677691	0.561748	-0.408448	[{'index': 2, 'value': 1.0}]	[{'index': 2, 'value': 1.0}]
89	3915.240555	[{'index': 2, 'value': 1.0}]	-0.857739	0.713408	-0.771966	[{'index': 3, 'value': 1.0}]	[{'index': 1, 'value': 1.0}]
92	3425.563946	[{'index': 2, 'value': 1.0}]	-0.802902	0.308982	-0.917372	[{'index': 2, 'value': 1.0}]	[{'index': 1, 'value': 1.0}]
93	4141.497717	[{'index': 1, 'value': 1.0}]	-0.309371	1.168388	-0.263041	[{'index': 3, 'value': 1.0}]	[{'index': 1, 'value': 1.0}]
96	3394.72289	[{'index': 2, 'value': 1.0}]	-0.309371	0.662855	-1.499	[{'index': 2, 'value': 1.0}]	[{'index': 1, 'value': 1.0}]
100	3507.226918	[{'index': 2, 'value': 1.0}]	-0.912576	0.814515	-0.771966	[{'index': 2, 'value': 1.0}]	[{'index': 1, 'value': 1.0}]
101	4922.286202	[{'index': 1, 'value': 1.0}]	0.549739	-1.308724	1.554546	[{'index': 2, 'value': 1.0}]	[{'index': 3, 'value': 1.0}]
102	4016.243221	[{'index': 2, 'value': 1.0}]	-0.126582	0.662855	-0.626559	[{'index': 3, 'value': 1.0}]	[{'index': 1, 'value': 1.0}]
107	4933.655362	[{'index': 1, 'value': 1.0}]	1.20778	-1.005404	1.118325	[{'index': 2, 'value': 1.0}]	[{'index': 3, 'value': 1.0}]

25 rows × 7 columns

[67 rows x 7 columns in total]

Unsupervised predictors#

In unsupervised learning, there are no known outputs in the training data, instead the model learns on input data alone and predicts something else. An example of an unsupervised predictor is bigframes.ml.cluster.KMeans, which learns how to fit input data to a target number of clusters.

from bigframes.ml.cluster import KMeans

# Specify KMeans with four clusters
kmeans = KMeans(n_clusters=4)

# Fit data
kmeans.fit(processed_X_train)

# View predictions
kmeans.predict(processed_X_test)

Query job 728068d3-2349-4636-a030-016b500a9812 is DONE. 23.5 kB processed. Open Job

Query job 37bac685-2afa-4ece-b3a3-e0b84a92c65f is DONE. 0 Bytes processed. Open Job

Query job 38416629-4615-45f5-9e27-d9164124f755 is DONE. 6.2 kB processed. Open Job

Query job 0241ea1c-8d96-418a-b3d6-08d819854954 is DONE. 536 Bytes processed. Open Job

Query job 405bcf9b-d652-42f3-931e-12ca0310fe4f is DONE. 0 Bytes processed. Open Job

Query job 21ca6f31-2ea2-4f71-b030-c738bf5afe27 is DONE. 10.2 kB processed. Open Job

	CENTROID_ID	NEAREST_CENTROIDS_DISTANCE	onehotencoded_island	standard_scaled_culmen_length_mm	standard_scaled_culmen_depth_mm	standard_scaled_flipper_length_mm	onehotencoded_sex	onehotencoded_species
penguin_id
1	3	[{'CENTROID_ID': 3, 'DISTANCE': 0.857057881337...	[{'index': 1, 'value': 1.0}]	0.220718	-1.359277	1.045621	[{'index': 2, 'value': 1.0}]	[{'index': 3, 'value': 1.0}]
15	4	[{'CENTROID_ID': 4, 'DISTANCE': 1.181613302004...	[{'index': 2, 'value': 1.0}]	-0.510439	0.157322	-0.771966	[{'index': 3, 'value': 1.0}]	[{'index': 1, 'value': 1.0}]
28	1	[{'CENTROID_ID': 1, 'DISTANCE': 1.006856853050...	[{'index': 2, 'value': 1.0}]	-1.058807	0.713408	-0.771966	[{'index': 2, 'value': 1.0}]	[{'index': 1, 'value': 1.0}]
32	2	[{'CENTROID_ID': 2, 'DISTANCE': 1.237504384283...	[{'index': 2, 'value': 1.0}]	1.463685	1.168388	0.39129	[{'index': 3, 'value': 1.0}]	[{'index': 2, 'value': 1.0}]
33	2	[{'CENTROID_ID': 2, 'DISTANCE': 1.656439702919...	[{'index': 2, 'value': 1.0}]	-0.254534	0.056215	-0.990076	[{'index': 2, 'value': 1.0}]	[{'index': 2, 'value': 1.0}]
34	4	[{'CENTROID_ID': 4, 'DISTANCE': 1.343792119214...	[{'index': 2, 'value': 1.0}]	-0.510439	0.460642	0.318587	[{'index': 3, 'value': 1.0}]	[{'index': 1, 'value': 1.0}]
37	2	[{'CENTROID_ID': 2, 'DISTANCE': 0.816670297369...	[{'index': 2, 'value': 1.0}]	1.354012	0.511195	-0.263041	[{'index': 3, 'value': 1.0}]	[{'index': 2, 'value': 1.0}]
41	1	[{'CENTROID_ID': 1, 'DISTANCE': 1.317560921596...	[{'index': 3, 'value': 1.0}]	-0.674949	-0.095445	-1.789814	[{'index': 2, 'value': 1.0}]	[{'index': 1, 'value': 1.0}]
47	4	[{'CENTROID_ID': 4, 'DISTANCE': 1.135112005343...	[{'index': 2, 'value': 1.0}]	-1.168481	0.662855	-0.117634	[{'index': 3, 'value': 1.0}]	[{'index': 1, 'value': 1.0}]
52	2	[{'CENTROID_ID': 2, 'DISTANCE': 1.004096945181...	[{'index': 2, 'value': 1.0}]	0.458344	0.308982	-0.699262	[{'index': 2, 'value': 1.0}]	[{'index': 2, 'value': 1.0}]
56	4	[{'CENTROID_ID': 4, 'DISTANCE': 1.218648668822...	[{'index': 1, 'value': 1.0}]	-1.040528	0.460642	-1.135483	[{'index': 3, 'value': 1.0}]	[{'index': 1, 'value': 1.0}]
57	1	[{'CENTROID_ID': 1, 'DISTANCE': 1.238466630273...	[{'index': 1, 'value': 1.0}]	-0.967412	0.005662	-0.117634	[{'index': 2, 'value': 1.0}]	[{'index': 1, 'value': 1.0}]
62	3	[{'CENTROID_ID': 3, 'DISTANCE': 0.876984617451...	[{'index': 1, 'value': 1.0}]	0.988433	-0.752638	1.191028	[{'index': 3, 'value': 1.0}]	[{'index': 3, 'value': 1.0}]
65	2	[{'CENTROID_ID': 2, 'DISTANCE': 1.439604004538...	[{'index': 2, 'value': 1.0}]	1.756148	1.370601	0.318587	[{'index': 3, 'value': 1.0}]	[{'index': 2, 'value': 1.0}]
67	3	[{'CENTROID_ID': 3, 'DISTANCE': 0.763112987694...	[{'index': 1, 'value': 1.0}]	0.677691	-1.359277	1.045621	[{'index': 3, 'value': 1.0}]	[{'index': 3, 'value': 1.0}]
75	4	[{'CENTROID_ID': 4, 'DISTANCE': 1.075788925734...	[{'index': 1, 'value': 1.0}]	-1.113644	1.421155	-0.771966	[{'index': 3, 'value': 1.0}]	[{'index': 1, 'value': 1.0}]
81	2	[{'CENTROID_ID': 2, 'DISTANCE': 0.777307801541...	[{'index': 2, 'value': 1.0}]	0.677691	0.561748	-0.408448	[{'index': 2, 'value': 1.0}]	[{'index': 2, 'value': 1.0}]
89	4	[{'CENTROID_ID': 4, 'DISTANCE': 0.891303183824...	[{'index': 2, 'value': 1.0}]	-0.857739	0.713408	-0.771966	[{'index': 3, 'value': 1.0}]	[{'index': 1, 'value': 1.0}]
92	1	[{'CENTROID_ID': 1, 'DISTANCE': 0.934676470689...	[{'index': 2, 'value': 1.0}]	-0.802902	0.308982	-0.917372	[{'index': 2, 'value': 1.0}]	[{'index': 1, 'value': 1.0}]
93	4	[{'CENTROID_ID': 4, 'DISTANCE': 0.984620018517...	[{'index': 1, 'value': 1.0}]	-0.309371	1.168388	-0.263041	[{'index': 3, 'value': 1.0}]	[{'index': 1, 'value': 1.0}]
96	1	[{'CENTROID_ID': 1, 'DISTANCE': 1.446939975674...	[{'index': 2, 'value': 1.0}]	-0.309371	0.662855	-1.499	[{'index': 2, 'value': 1.0}]	[{'index': 1, 'value': 1.0}]
100	1	[{'CENTROID_ID': 1, 'DISTANCE': 1.101117711572...	[{'index': 2, 'value': 1.0}]	-0.912576	0.814515	-0.771966	[{'index': 2, 'value': 1.0}]	[{'index': 1, 'value': 1.0}]
101	3	[{'CENTROID_ID': 3, 'DISTANCE': 0.823832007899...	[{'index': 1, 'value': 1.0}]	0.549739	-1.308724	1.554546	[{'index': 2, 'value': 1.0}]	[{'index': 3, 'value': 1.0}]
102	4	[{'CENTROID_ID': 4, 'DISTANCE': 0.995348310182...	[{'index': 2, 'value': 1.0}]	-0.126582	0.662855	-0.626559	[{'index': 3, 'value': 1.0}]	[{'index': 1, 'value': 1.0}]
107	3	[{'CENTROID_ID': 3, 'DISTANCE': 0.930021405831...	[{'index': 1, 'value': 1.0}]	1.20778	-1.005404	1.118325	[{'index': 2, 'value': 1.0}]	[{'index': 3, 'value': 1.0}]

25 rows × 8 columns

[67 rows x 8 columns in total]

Pipelines#

Transfomers and predictors can be chained into a single estimator component using bigframes.ml.pipeline.Pipeline:

from bigframes.ml.pipeline import Pipeline

pipeline = Pipeline([
  ('preproc', preproc),
  ('linreg', linreg)
])

# Print our pipeline
pipeline

Pipeline(steps=[('preproc',
                 ColumnTransformer(transformers=[('scale', StandardScaler(),
                                                  ['culmen_length_mm',
                                                   'culmen_depth_mm',
                                                   'flipper_length_mm']),
                                                 ('encode', OneHotEncoder(),
                                                  ['species', 'sex',
                                                   'island'])])),
                ('linreg', LinearRegression())])

The pipeline simplifies the workflow by applying each of its component steps automatically:

pipeline.fit(X_train, y_train)

predicted_y_test = pipeline.predict(X_test)
predicted_y_test

Query job 95b43592-b198-4f9e-a990-4e837b82121f is DONE. 24.8 kB processed. Open Job

Query job 615b2afb-0c76-45d6-82c7-bde7c8b2b3a4 is DONE. 8.5 kB processed. Open Job

Query job cf2ed3ca-01bf-4cb6-a71a-d6e30a8428f6 is DONE. 0 Bytes processed. Open Job

Query job d9780763-1d2b-494d-a778-20364c52bd08 is DONE. 29.6 kB processed. Open Job

Query job f01296ba-7cd0-4d06-b25a-b5697e46bbf7 is DONE. 536 Bytes processed. Open Job

Query job 5b6fe451-2f8e-471e-a6a0-00b9bffaa826 is DONE. 0 Bytes processed. Open Job

Query job 6a81883b-0514-4251-9f63-490b6346bb8b is DONE. 6.1 kB processed. Open Job

	predicted_body_mass_g	island	culmen_length_mm	culmen_depth_mm	flipper_length_mm	sex	species
penguin_id
1	4772.374547	Biscoe	45.1	14.5	215.0	FEMALE	Gentoo penguin (Pygoscelis papua)
15	3883.371052	Dream	41.1	17.5	190.0	MALE	Adelie Penguin (Pygoscelis adeliae)
28	3479.706166	Dream	38.1	18.6	190.0	FEMALE	Adelie Penguin (Pygoscelis adeliae)
32	4223.851137	Dream	51.9	19.5	206.0	MALE	Chinstrap penguin (Pygoscelis antarctica)
33	3197.620461	Dream	42.5	17.3	187.0	FEMALE	Chinstrap penguin (Pygoscelis antarctica)
34	4155.265191	Dream	41.1	18.1	205.0	MALE	Adelie Penguin (Pygoscelis adeliae)
37	3991.311319	Dream	51.3	18.2	197.0	MALE	Chinstrap penguin (Pygoscelis antarctica)
41	3232.644783	Torgersen	40.2	17.0	176.0	FEMALE	Adelie Penguin (Pygoscelis adeliae)
47	4017.738303	Dream	37.5	18.5	199.0	MALE	Adelie Penguin (Pygoscelis adeliae)
52	3365.077659	Dream	46.4	17.8	191.0	FEMALE	Chinstrap penguin (Pygoscelis antarctica)
56	3791.328893	Biscoe	38.2	18.1	185.0	MALE	Adelie Penguin (Pygoscelis adeliae)
57	3547.890609	Biscoe	38.6	17.2	199.0	FEMALE	Adelie Penguin (Pygoscelis adeliae)
62	5372.086117	Biscoe	49.3	15.7	217.0	MALE	Gentoo penguin (Pygoscelis papua)
65	4263.229571	Dream	53.5	19.9	205.0	MALE	Chinstrap penguin (Pygoscelis antarctica)
67	5234.457401	Biscoe	47.6	14.5	215.0	MALE	Gentoo penguin (Pygoscelis papua)
75	3979.311469	Biscoe	37.8	20.0	190.0	MALE	Adelie Penguin (Pygoscelis adeliae)
81	3481.328573	Dream	47.6	18.3	195.0	FEMALE	Chinstrap penguin (Pygoscelis antarctica)
89	3915.237615	Dream	39.2	18.6	190.0	MALE	Adelie Penguin (Pygoscelis adeliae)
92	3425.560982	Dream	39.5	17.8	188.0	FEMALE	Adelie Penguin (Pygoscelis adeliae)
93	4141.494969	Biscoe	42.2	19.5	197.0	MALE	Adelie Penguin (Pygoscelis adeliae)
96	3394.719445	Dream	42.2	18.5	180.0	FEMALE	Adelie Penguin (Pygoscelis adeliae)
100	3507.223965	Dream	38.9	18.8	190.0	FEMALE	Adelie Penguin (Pygoscelis adeliae)
101	4922.284991	Biscoe	46.9	14.6	222.0	FEMALE	Gentoo penguin (Pygoscelis papua)
102	4016.240318	Dream	43.2	18.5	192.0	MALE	Adelie Penguin (Pygoscelis adeliae)
107	4933.653758	Biscoe	50.5	15.2	216.0	FEMALE	Gentoo penguin (Pygoscelis papua)

25 rows × 7 columns

[67 rows x 7 columns in total]

In the backend, a pipeline will actually be compiled into a single model with an embedded TRANSFORM step.

Evaluating results#

Some models include a convenient .score(X, y) method for evaulation with a preset accuracy metric:

# In the case of a pipeline, this will be equivalent to calling .score on the contained LinearRegression
pipeline.score(X_test, y_test)

Query job c098e1d1-b3ed-4ec5-94c7-6ba3b2b59e3f is DONE. 29.6 kB processed. Open Job

Query job 035234b0-537a-44ce-adff-bb51c40b4ffa is DONE. 0 Bytes processed. Open Job

Query job b4a2a367-3e06-4fa3-9f00-bdbca884cfdd is DONE. 48 Bytes processed. Open Job

	mean_absolute_error	mean_squared_error	mean_squared_log_error	median_absolute_error	r2_score	explained_variance
0	225.883512	77765.989281	0.004457	179.548041	0.873166	0.873315

1 rows × 6 columns

[1 rows x 6 columns in total]

For a more general approach, the library bigframes.ml.metrics is provided:

from bigframes.ml.metrics import r2_score

r2_score(y_test, predicted_y_test["predicted_body_mass_g"])

Query job 20ec1716-3e8e-4d3f-ba08-1f7b9970ce3f is DONE. 28.9 kB processed. Open Job

Query job 6f628f3b-62df-4a5a-8e05-0b313db0ed07 is DONE. 28.9 kB processed. Open Job

Query job c4eee1e5-146f-4a52-8499-83fe5f701f53 is DONE. 30.0 kB processed. Open Job

0.8731660699616813

Save to BigQuery#

Estimators can be saved to BigQuery as BQML models, and loaded again in future.

Saving requires bigquery.tables.create permission, and loading requires bigquery.models.getMetadata permission. These permissions can be at project level or the dataset level.

If you have those permissions, please go ahead and uncomment the code in the following cells and run.

linreg.to_gbq(f"{DATASET}.penguins_model", replace=True)

Copy job 06c2b62d-a7aa-46a5-a04a-2f189bafc5ee is DONE. Open Job

Pipeline(steps=[('transform',
                 ColumnTransformer(transformers=[('ont_hot_encoder',
                                                  OneHotEncoder(max_categories=1000001,
                                                                min_frequency=0),
                                                  'island'),
                                                 ('standard_scaler',
                                                  StandardScaler(),
                                                  'culmen_length_mm'),
                                                 ('standard_scaler',
                                                  StandardScaler(),
                                                  'culmen_depth_mm'),
                                                 ('standard_scaler',
                                                  StandardScaler(),
                                                  'flipper_length_mm'),
                                                 ('ont_hot_encoder',
                                                  OneHotEncoder(max_categories=1000001,
                                                                min_frequency=0),
                                                  'sex'),
                                                 ('ont_hot_encoder',
                                                  OneHotEncoder(max_categories=1000001,
                                                                min_frequency=0),
                                                  'species')])),
                ('estimator',
                 LinearRegression(optimize_strategy='NORMAL_EQUATION'))])

bpd.read_gbq_model(f"{DATASET}.penguins_model")

Pipeline(steps=[('transform',
                 ColumnTransformer(transformers=[('ont_hot_encoder',
                                                  OneHotEncoder(max_categories=1000001,
                                                                min_frequency=0),
                                                  'island'),
                                                 ('standard_scaler',
                                                  StandardScaler(),
                                                  'culmen_length_mm'),
                                                 ('standard_scaler',
                                                  StandardScaler(),
                                                  'culmen_depth_mm'),
                                                 ('standard_scaler',
                                                  StandardScaler(),
                                                  'flipper_length_mm'),
                                                 ('ont_hot_encoder',
                                                  OneHotEncoder(max_categories=1000001,
                                                                min_frequency=0),
                                                  'sex'),
                                                 ('ont_hot_encoder',
                                                  OneHotEncoder(max_categories=1000001,
                                                                min_frequency=0),
                                                  'species')])),
                ('estimator',
                 LinearRegression(optimize_strategy='NORMAL_EQUATION'))])

We can also save the pipeline to BigQuery. BigQuery will save this as a single model, with the pre-processing steps embedded in the TRANSFORM property:

pipeline.to_gbq(f"{DATASET}.penguins_pipeline", replace=True)

Copy job a0ed8c1b-3a3f-4995-853c-e151d41560d7 is DONE. Open Job

Pipeline(steps=[('transform',
                 ColumnTransformer(transformers=[('ont_hot_encoder',
                                                  OneHotEncoder(max_categories=1000001,
                                                                min_frequency=0),
                                                  'island'),
                                                 ('standard_scaler',
                                                  StandardScaler(),
                                                  'culmen_length_mm'),
                                                 ('standard_scaler',
                                                  StandardScaler(),
                                                  'culmen_depth_mm'),
                                                 ('standard_scaler',
                                                  StandardScaler(),
                                                  'flipper_length_mm'),
                                                 ('ont_hot_encoder',
                                                  OneHotEncoder(max_categories=1000001,
                                                                min_frequency=0),
                                                  'sex'),
                                                 ('ont_hot_encoder',
                                                  OneHotEncoder(max_categories=1000001,
                                                                min_frequency=0),
                                                  'species')])),
                ('estimator',
                 LinearRegression(optimize_strategy='NORMAL_EQUATION'))])

bpd.read_gbq_model(f"{DATASET}.penguins_pipeline")

Pipeline(steps=[('transform',
                 ColumnTransformer(transformers=[('ont_hot_encoder',
                                                  OneHotEncoder(max_categories=1000001,
                                                                min_frequency=0),
                                                  'island'),
                                                 ('standard_scaler',
                                                  StandardScaler(),
                                                  'culmen_length_mm'),
                                                 ('standard_scaler',
                                                  StandardScaler(),
                                                  'culmen_depth_mm'),
                                                 ('standard_scaler',
                                                  StandardScaler(),
                                                  'flipper_length_mm'),
                                                 ('ont_hot_encoder',
                                                  OneHotEncoder(max_categories=1000001,
                                                                min_frequency=0),
                                                  'sex'),
                                                 ('ont_hot_encoder',
                                                  OneHotEncoder(max_categories=1000001,
                                                                min_frequency=0),
                                                  'species')])),
                ('estimator',
                 LinearRegression(optimize_strategy='NORMAL_EQUATION'))])

Summary and next steps#

You’ve completed an end-to-end machine learning workflow using the built-in capabilities of BigQuery DataFrames.

Learn more about BigQuery DataFrames in the documentation and find more sample notebooks in the GitHub repo.

Cleaning up#

To clean up all Google Cloud resources used in this project, you can delete the Google Cloud project you used for the tutorial.

Otherwise, you can uncomment the remaining cells and run them to delete the individual resources you created in this tutorial:

# # Delete the BQML models
# MODEL_NAME = f"{PROJECT_ID}:{DATASET}.penguins_model"
# ! bq rm -f --model {MODEL_NAME}
# PIPELINE_NAME = f"{PROJECT_ID}:{DATASET}.penguins_pipeline"
# ! bq rm -f --model {PIPELINE_NAME}