# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
Machine Learning Fundamentals with BigQuery DataFrames#
Run in Colab
|
|
|
|
NOTE: This notebook has been tested in the following environment:
Python version = 3.10
Overview#
The bigframes.ml module implements Scikit-Learn’s machine learning API in
BigQuery DataFrames. It exposes BigQuery’s ML capabilities in a simple, popular
API that works seamlessly with the rest of the BigQuery DataFrames API.
Learn more about BigQuery DataFrames.
Objective#
In this tutorial, you will walk through an end-to-end machine learning workflow using BigQuery DataFrames. You will load data, manipulate and prepare it for model training, build supervised and unsupervised models, and evaluate and save a model for future use; all using built-in BigQuery DataFrames functionality.
Dataset#
This tutorial uses the penguins table (a BigQuery public dataset), which contains data on a set of penguins including species, island of residence, weight, culmen length and depth, flipper length, and sex.
Costs#
This tutorial uses billable components of Google Cloud:
BigQuery (storage and compute)
BigQuery ML
Learn about BigQuery storage pricing, BigQuery compute pricing, and BigQuery ML pricing, and use the Pricing Calculator to generate a cost estimate based on your projected usage.
Installation#
Depending on your Jupyter environment, you might have to install packages.
Vertex AI Workbench or Colab
Do nothing, BigQuery DataFrames package is already installed.
Local JupyterLab instance
Uncomment and run the following cell:
# !pip install bigframes
Before you begin#
Complete the tasks in this section to set up your environment.
Set up your Google Cloud project#
The following steps are required, regardless of your notebook environment.
Select or create a Google Cloud project. When you first create an account, you get a $300 credit towards your compute/storage costs.
Click here to enable the BigQuery API.
If you are running this notebook locally, install the Cloud SDK.
Set your project ID#
If you don’t know your project ID, try the following:
Run
gcloud config list.Run
gcloud projects list.See the support page: Locate the project ID.
PROJECT_ID = "" # @param {type:"string"}
# Set the project id
! gcloud config set project {PROJECT_ID}
Updated property [core/project].
Set the region#
You can also change the REGION variable used by BigQuery. Learn more about BigQuery regions.
REGION = "US" # @param {type: "string"}
Set the dataset ID#
As part of this notebook, you will save BigQuery ML models to your Google Cloud project, which requires a dataset. Create the dataset, if needed, and provide the ID here as the DATASET variable used by BigQuery. Learn how to create a BigQuery dataset.
DATASET = "" # @param {type: "string"}
Authenticate your Google Cloud account#
Depending on your Jupyter environment, you might have to manually authenticate. Follow the relevant instructions below.
Vertex AI Workbench
Do nothing, you are already authenticated.
Local JupyterLab instance
Uncomment and run the following cell:
# ! gcloud auth login
Colab
Uncomment and run the following cell:
# from google.colab import auth
# auth.authenticate_user()
Import libraries#
import bigframes.pandas as bpd
Set BigQuery DataFrames options#
# Note: The project option is not required in all environments.
# On BigQuery Studio, the project ID is automatically detected.
bpd.options.bigquery.project = PROJECT_ID
# Note: The location option is not required.
# It defaults to the location of the first table or query
# passed to read_gbq(). For APIs where a location can't be
# auto-detected, the location defaults to the "US" location.
bpd.options.bigquery.location = REGION
If you want to reset the location of the created DataFrame or Series objects, reset the session by executing bpd.reset_session(). After that, you can reuse bpd.options.bigquery.location to specify another location.
Import data into BigQuery DataFrames#
You can create a DataFrame by reading data from a BigQuery table.
df = bpd.read_gbq("bigquery-public-data.ml_datasets.penguins")
df = df.dropna()
# BigQuery DataFrames creates a default numbered index, which we can give a name
df.index.name = "penguin_id"
Take a look at a few rows of the DataFrame:
df.head()
| species | island | culmen_length_mm | culmen_depth_mm | flipper_length_mm | body_mass_g | sex | |
|---|---|---|---|---|---|---|---|
| penguin_id | |||||||
| 0 | Gentoo penguin (Pygoscelis papua) | Biscoe | 50.5 | 15.9 | 225.0 | 5400.0 | MALE |
| 1 | Gentoo penguin (Pygoscelis papua) | Biscoe | 45.1 | 14.5 | 215.0 | 5000.0 | FEMALE |
| 2 | Adelie Penguin (Pygoscelis adeliae) | Torgersen | 41.4 | 18.5 | 202.0 | 3875.0 | MALE |
| 3 | Adelie Penguin (Pygoscelis adeliae) | Torgersen | 38.6 | 17.0 | 188.0 | 2900.0 | FEMALE |
| 4 | Gentoo penguin (Pygoscelis papua) | Biscoe | 46.5 | 14.8 | 217.0 | 5200.0 | FEMALE |
5 rows × 7 columns
Clean and prepare data#
We’re are going to start with supervised learning, where a Linear Regression model will learn to predict the body mass (output variable y) using input features such as flipper length, sex, species, and more (features X).
# Isolate input features and output variable into DataFrames
X = df[['island', 'culmen_length_mm', 'culmen_depth_mm', 'flipper_length_mm', 'sex', 'species']]
y = df[['body_mass_g']]
Part of preparing data for a machine learning task is splitting it into subsets for training and testing to ensure that the solution is not overfitting. By default, BQML will automatically manage splitting the data for you. However, BQML also supports manually splitting out your training data.
Performing a manual data split can be done with bigframes.ml.model_selection.train_test_split like so:
from bigframes.ml.model_selection import train_test_split
# This will split X and y into test and training sets, with 20% of the rows in the test set,
# and the rest in the training set
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2)
# Show the shape of the data after the split
print(f"""X_train shape: {X_train.shape}
X_test shape: {X_test.shape}
y_train shape: {y_train.shape}
y_test shape: {y_test.shape}""")
X_train shape: (267, 6)
X_test shape: (67, 6)
y_train shape: (267, 1)
y_test shape: (67, 1)
If we look at the data, we can see that random rows were selected for each side of the split:
X_test.head(5)
| island | culmen_length_mm | culmen_depth_mm | flipper_length_mm | sex | species | |
|---|---|---|---|---|---|---|
| penguin_id | ||||||
| 188 | Dream | 51.5 | 18.7 | 187.0 | MALE | Chinstrap penguin (Pygoscelis antarctica) |
| 251 | Biscoe | 49.5 | 16.1 | 224.0 | MALE | Gentoo penguin (Pygoscelis papua) |
| 231 | Biscoe | 45.7 | 13.9 | 214.0 | FEMALE | Gentoo penguin (Pygoscelis papua) |
| 271 | Biscoe | 59.6 | 17.0 | 230.0 | MALE | Gentoo penguin (Pygoscelis papua) |
| 128 | Biscoe | 38.8 | 17.2 | 180.0 | MALE | Adelie Penguin (Pygoscelis adeliae) |
5 rows × 6 columns
Note that the y_test data matches the same rows in X_test:
y_test.head(5)
| body_mass_g | |
|---|---|
| penguin_id | |
| 188 | 3250.0 |
| 251 | 5650.0 |
| 231 | 4400.0 |
| 271 | 6050.0 |
| 128 | 3800.0 |
5 rows × 1 columns
Estimators#
Following scikit-learn, all learning components are “estimators”; objects that can learn from training data and then apply themselves to new data. Estimators share the following patterns:
a constructor that takes a list of parameters
a standard string representation that shows the class name and all non-default parameters, e.g.
LinearRegression(fit_intercept=False)a
.fit(..)method to fit the estimator to training data
There estimators can be further broken down into two main subtypes:
Transformers
Predictors
Let’s walk through each of these with our example model.
Transformers#
Transformers are estimators that are used to prepare data for consumption by other estimators (‘preprocessing’). In addition to .fit(...), the transformer implements a .transform(...) method, which will apply a transformation based on what was computed during .fit(..). With this pattern dynamic preprocessing steps can be applied to both training and test/production data consistently.
An example of a transformer is bigframes.ml.preprocessing.StandardScaler, which rescales a dataset to have a mean of zero and a standard deviation of one:
from bigframes.ml.preprocessing import StandardScaler
# StandardScaler will only work on numeric columns
numeric_columns = ["culmen_length_mm", "culmen_depth_mm", "flipper_length_mm"]
scaler = StandardScaler()
scaler.fit(X_train[numeric_columns])
# Now, standardscaler should transform the numbers to have mean of zero
# and standard deviation of one:
scaler.transform(X_train[numeric_columns])
| standard_scaled_culmen_length_mm | standard_scaled_culmen_depth_mm | standard_scaled_flipper_length_mm | |
|---|---|---|---|
| penguin_id | |||
| 0 | 1.20778 | -0.651531 | 1.772656 |
| 2 | -0.455602 | 0.662855 | 0.100476 |
| 3 | -0.967412 | -0.095445 | -0.917372 |
| 4 | 0.476623 | -1.207617 | 1.191028 |
| 5 | -1.625454 | 0.359535 | -0.626559 |
| 7 | -0.345929 | -1.86481 | 0.682104 |
| 8 | 0.842202 | -1.561491 | 1.409139 |
| 9 | 0.348671 | 0.865068 | -0.263041 |
| 10 | 0.933596 | 1.218941 | 0.827511 |
| 11 | -1.460943 | -0.297658 | -0.771966 |
| 12 | 1.317454 | -0.449318 | 1.409139 |
| 13 | -0.236255 | -1.763704 | 0.900214 |
| 14 | 0.549739 | -0.297658 | -0.626559 |
| 16 | 0.970154 | -1.005404 | 1.481842 |
| 17 | -1.058807 | -0.348211 | -0.190338 |
| 18 | 1.354012 | -1.510937 | 1.263732 |
| 19 | -0.053466 | -1.662597 | 1.191028 |
| 20 | -0.199697 | -1.510937 | 0.609401 |
| 21 | 1.152943 | 0.763962 | -0.190338 |
| 22 | -1.205038 | 0.308982 | -0.699262 |
| 24 | -0.784623 | 1.775028 | -0.699262 |
| 25 | -0.83946 | 1.724474 | -0.771966 |
| 26 | -0.620113 | 0.359535 | -0.990076 |
| 27 | 0.330392 | -0.095445 | -0.408448 |
| 29 | 2.194842 | -0.095445 | 1.990767 |
25 rows × 3 columns
We can then repeat this transformation on the test data:
scaler.transform(X_test[numeric_columns])
| standard_scaled_culmen_length_mm | standard_scaled_culmen_depth_mm | standard_scaled_flipper_length_mm | |
|---|---|---|---|
| penguin_id | |||
| 1 | 0.220718 | -1.359277 | 1.045621 |
| 15 | -0.510439 | 0.157322 | -0.771966 |
| 28 | -1.058807 | 0.713408 | -0.771966 |
| 32 | 1.463685 | 1.168388 | 0.39129 |
| 33 | -0.254534 | 0.056215 | -0.990076 |
| 34 | -0.510439 | 0.460642 | 0.318587 |
| 37 | 1.354012 | 0.511195 | -0.263041 |
| 41 | -0.674949 | -0.095445 | -1.789814 |
| 47 | -1.168481 | 0.662855 | -0.117634 |
| 52 | 0.458344 | 0.308982 | -0.699262 |
| 56 | -1.040528 | 0.460642 | -1.135483 |
| 57 | -0.967412 | 0.005662 | -0.117634 |
| 62 | 0.988433 | -0.752638 | 1.191028 |
| 65 | 1.756148 | 1.370601 | 0.318587 |
| 67 | 0.677691 | -1.359277 | 1.045621 |
| 75 | -1.113644 | 1.421155 | -0.771966 |
| 81 | 0.677691 | 0.561748 | -0.408448 |
| 89 | -0.857739 | 0.713408 | -0.771966 |
| 92 | -0.802902 | 0.308982 | -0.917372 |
| 93 | -0.309371 | 1.168388 | -0.263041 |
| 96 | -0.309371 | 0.662855 | -1.499 |
| 100 | -0.912576 | 0.814515 | -0.771966 |
| 101 | 0.549739 | -1.308724 | 1.554546 |
| 102 | -0.126582 | 0.662855 | -0.626559 |
| 107 | 1.20778 | -1.005404 | 1.118325 |
25 rows × 3 columns
Composing transformers#
To process data where different columns need different preprocessors, bigframes.composition.ColumnTransformer can be employed.
Let’s create an aggregate transform that applies StandardScalar to the numeric columns and OneHotEncoder to the string columns.
from bigframes.ml.compose import ColumnTransformer
from bigframes.ml.preprocessing import OneHotEncoder
# Create an aggregate transform that applies StandardScaler to the numeric columns,
# and OneHotEncoder to the string columns
preproc = ColumnTransformer([
("scale", StandardScaler(), ["culmen_length_mm", "culmen_depth_mm", "flipper_length_mm"]),
("encode", OneHotEncoder(), ["species", "sex", "island"])])
# Now we can fit all columns of the training data
preproc.fit(X_train)
processed_X_train = preproc.transform(X_train)
processed_X_test = preproc.transform(X_test)
# View the processed training data
processed_X_train
| onehotencoded_island | standard_scaled_culmen_length_mm | standard_scaled_culmen_depth_mm | standard_scaled_flipper_length_mm | onehotencoded_sex | onehotencoded_species | |
|---|---|---|---|---|---|---|
| penguin_id | ||||||
| 0 | [{'index': 1, 'value': 1.0}] | 1.20778 | -0.651531 | 1.772656 | [{'index': 3, 'value': 1.0}] | [{'index': 3, 'value': 1.0}] |
| 2 | [{'index': 3, 'value': 1.0}] | -0.455602 | 0.662855 | 0.100476 | [{'index': 3, 'value': 1.0}] | [{'index': 1, 'value': 1.0}] |
| 3 | [{'index': 3, 'value': 1.0}] | -0.967412 | -0.095445 | -0.917372 | [{'index': 2, 'value': 1.0}] | [{'index': 1, 'value': 1.0}] |
| 4 | [{'index': 1, 'value': 1.0}] | 0.476623 | -1.207617 | 1.191028 | [{'index': 2, 'value': 1.0}] | [{'index': 3, 'value': 1.0}] |
| 5 | [{'index': 1, 'value': 1.0}] | -1.625454 | 0.359535 | -0.626559 | [{'index': 2, 'value': 1.0}] | [{'index': 1, 'value': 1.0}] |
| 7 | [{'index': 1, 'value': 1.0}] | -0.345929 | -1.86481 | 0.682104 | [{'index': 2, 'value': 1.0}] | [{'index': 3, 'value': 1.0}] |
| 8 | [{'index': 1, 'value': 1.0}] | 0.842202 | -1.561491 | 1.409139 | [{'index': 3, 'value': 1.0}] | [{'index': 3, 'value': 1.0}] |
| 9 | [{'index': 3, 'value': 1.0}] | 0.348671 | 0.865068 | -0.263041 | [{'index': 3, 'value': 1.0}] | [{'index': 1, 'value': 1.0}] |
| 10 | [{'index': 2, 'value': 1.0}] | 0.933596 | 1.218941 | 0.827511 | [{'index': 3, 'value': 1.0}] | [{'index': 2, 'value': 1.0}] |
| 11 | [{'index': 3, 'value': 1.0}] | -1.460943 | -0.297658 | -0.771966 | [{'index': 2, 'value': 1.0}] | [{'index': 1, 'value': 1.0}] |
| 12 | [{'index': 1, 'value': 1.0}] | 1.317454 | -0.449318 | 1.409139 | [{'index': 3, 'value': 1.0}] | [{'index': 3, 'value': 1.0}] |
| 13 | [{'index': 1, 'value': 1.0}] | -0.236255 | -1.763704 | 0.900214 | [{'index': 2, 'value': 1.0}] | [{'index': 3, 'value': 1.0}] |
| 14 | [{'index': 2, 'value': 1.0}] | 0.549739 | -0.297658 | -0.626559 | [{'index': 2, 'value': 1.0}] | [{'index': 2, 'value': 1.0}] |
| 16 | [{'index': 1, 'value': 1.0}] | 0.970154 | -1.005404 | 1.481842 | [{'index': 3, 'value': 1.0}] | [{'index': 3, 'value': 1.0}] |
| 17 | [{'index': 1, 'value': 1.0}] | -1.058807 | -0.348211 | -0.190338 | [{'index': 2, 'value': 1.0}] | [{'index': 1, 'value': 1.0}] |
| 18 | [{'index': 1, 'value': 1.0}] | 1.354012 | -1.510937 | 1.263732 | [{'index': 3, 'value': 1.0}] | [{'index': 3, 'value': 1.0}] |
| 19 | [{'index': 1, 'value': 1.0}] | -0.053466 | -1.662597 | 1.191028 | [{'index': 2, 'value': 1.0}] | [{'index': 3, 'value': 1.0}] |
| 20 | [{'index': 1, 'value': 1.0}] | -0.199697 | -1.510937 | 0.609401 | [{'index': 2, 'value': 1.0}] | [{'index': 3, 'value': 1.0}] |
| 21 | [{'index': 2, 'value': 1.0}] | 1.152943 | 0.763962 | -0.190338 | [{'index': 2, 'value': 1.0}] | [{'index': 2, 'value': 1.0}] |
| 22 | [{'index': 2, 'value': 1.0}] | -1.205038 | 0.308982 | -0.699262 | [{'index': 2, 'value': 1.0}] | [{'index': 1, 'value': 1.0}] |
| 24 | [{'index': 1, 'value': 1.0}] | -0.784623 | 1.775028 | -0.699262 | [{'index': 2, 'value': 1.0}] | [{'index': 1, 'value': 1.0}] |
| 25 | [{'index': 3, 'value': 1.0}] | -0.83946 | 1.724474 | -0.771966 | [{'index': 3, 'value': 1.0}] | [{'index': 1, 'value': 1.0}] |
| 26 | [{'index': 1, 'value': 1.0}] | -0.620113 | 0.359535 | -0.990076 | [{'index': 2, 'value': 1.0}] | [{'index': 1, 'value': 1.0}] |
| 27 | [{'index': 2, 'value': 1.0}] | 0.330392 | -0.095445 | -0.408448 | [{'index': 2, 'value': 1.0}] | [{'index': 2, 'value': 1.0}] |
| 29 | [{'index': 1, 'value': 1.0}] | 2.194842 | -0.095445 | 1.990767 | [{'index': 3, 'value': 1.0}] | [{'index': 3, 'value': 1.0}] |
25 rows × 6 columns
Predictors#
Predictors are estimators that learn and make predictions. In addition to .fit(...), the predictor implements a .predict(...) method, which will use what was learned during .fit(...) to predict some output.
Predictors can be further broken down into two categories:
Supervised predictors
Unsupervised predictors
Supervised predictors#
Supervised learning is when we train a model on input-output pairs, and then ask it to predict the output for new inputs. An example of such a predictor is bigframes.ml.linear_models.LinearRegression.
from bigframes.ml.linear_model import LinearRegression
linreg = LinearRegression()
# Learn from the training data how to predict output y
linreg.fit(processed_X_train, y_train)
# Predict y for the test data
predicted_y_test = linreg.predict(processed_X_test)
# View predictions
predicted_y_test
| predicted_body_mass_g | onehotencoded_island | standard_scaled_culmen_length_mm | standard_scaled_culmen_depth_mm | standard_scaled_flipper_length_mm | onehotencoded_sex | onehotencoded_species | |
|---|---|---|---|---|---|---|---|
| penguin_id | |||||||
| 1 | 4772.376044 | [{'index': 1, 'value': 1.0}] | 0.220718 | -1.359277 | 1.045621 | [{'index': 2, 'value': 1.0}] | [{'index': 3, 'value': 1.0}] |
| 15 | 3883.373922 | [{'index': 2, 'value': 1.0}] | -0.510439 | 0.157322 | -0.771966 | [{'index': 3, 'value': 1.0}] | [{'index': 1, 'value': 1.0}] |
| 28 | 3479.709088 | [{'index': 2, 'value': 1.0}] | -1.058807 | 0.713408 | -0.771966 | [{'index': 2, 'value': 1.0}] | [{'index': 1, 'value': 1.0}] |
| 32 | 4223.853626 | [{'index': 2, 'value': 1.0}] | 1.463685 | 1.168388 | 0.39129 | [{'index': 3, 'value': 1.0}] | [{'index': 2, 'value': 1.0}] |
| 33 | 3197.623474 | [{'index': 2, 'value': 1.0}] | -0.254534 | 0.056215 | -0.990076 | [{'index': 2, 'value': 1.0}] | [{'index': 2, 'value': 1.0}] |
| 34 | 4155.26742 | [{'index': 2, 'value': 1.0}] | -0.510439 | 0.460642 | 0.318587 | [{'index': 3, 'value': 1.0}] | [{'index': 1, 'value': 1.0}] |
| 37 | 3991.314095 | [{'index': 2, 'value': 1.0}] | 1.354012 | 0.511195 | -0.263041 | [{'index': 3, 'value': 1.0}] | [{'index': 2, 'value': 1.0}] |
| 41 | 3232.648242 | [{'index': 3, 'value': 1.0}] | -0.674949 | -0.095445 | -1.789814 | [{'index': 2, 'value': 1.0}] | [{'index': 1, 'value': 1.0}] |
| 47 | 4017.740788 | [{'index': 2, 'value': 1.0}] | -1.168481 | 0.662855 | -0.117634 | [{'index': 3, 'value': 1.0}] | [{'index': 1, 'value': 1.0}] |
| 52 | 3365.080596 | [{'index': 2, 'value': 1.0}] | 0.458344 | 0.308982 | -0.699262 | [{'index': 2, 'value': 1.0}] | [{'index': 2, 'value': 1.0}] |
| 56 | 3791.332002 | [{'index': 1, 'value': 1.0}] | -1.040528 | 0.460642 | -1.135483 | [{'index': 3, 'value': 1.0}] | [{'index': 1, 'value': 1.0}] |
| 57 | 3547.892992 | [{'index': 1, 'value': 1.0}] | -0.967412 | 0.005662 | -0.117634 | [{'index': 2, 'value': 1.0}] | [{'index': 1, 'value': 1.0}] |
| 62 | 5372.087702 | [{'index': 1, 'value': 1.0}] | 0.988433 | -0.752638 | 1.191028 | [{'index': 3, 'value': 1.0}] | [{'index': 3, 'value': 1.0}] |
| 65 | 4263.232169 | [{'index': 2, 'value': 1.0}] | 1.756148 | 1.370601 | 0.318587 | [{'index': 3, 'value': 1.0}] | [{'index': 2, 'value': 1.0}] |
| 67 | 5234.45894 | [{'index': 1, 'value': 1.0}] | 0.677691 | -1.359277 | 1.045621 | [{'index': 3, 'value': 1.0}] | [{'index': 3, 'value': 1.0}] |
| 75 | 3979.314516 | [{'index': 1, 'value': 1.0}] | -1.113644 | 1.421155 | -0.771966 | [{'index': 3, 'value': 1.0}] | [{'index': 1, 'value': 1.0}] |
| 81 | 3481.331391 | [{'index': 2, 'value': 1.0}] | 0.677691 | 0.561748 | -0.408448 | [{'index': 2, 'value': 1.0}] | [{'index': 2, 'value': 1.0}] |
| 89 | 3915.240555 | [{'index': 2, 'value': 1.0}] | -0.857739 | 0.713408 | -0.771966 | [{'index': 3, 'value': 1.0}] | [{'index': 1, 'value': 1.0}] |
| 92 | 3425.563946 | [{'index': 2, 'value': 1.0}] | -0.802902 | 0.308982 | -0.917372 | [{'index': 2, 'value': 1.0}] | [{'index': 1, 'value': 1.0}] |
| 93 | 4141.497717 | [{'index': 1, 'value': 1.0}] | -0.309371 | 1.168388 | -0.263041 | [{'index': 3, 'value': 1.0}] | [{'index': 1, 'value': 1.0}] |
| 96 | 3394.72289 | [{'index': 2, 'value': 1.0}] | -0.309371 | 0.662855 | -1.499 | [{'index': 2, 'value': 1.0}] | [{'index': 1, 'value': 1.0}] |
| 100 | 3507.226918 | [{'index': 2, 'value': 1.0}] | -0.912576 | 0.814515 | -0.771966 | [{'index': 2, 'value': 1.0}] | [{'index': 1, 'value': 1.0}] |
| 101 | 4922.286202 | [{'index': 1, 'value': 1.0}] | 0.549739 | -1.308724 | 1.554546 | [{'index': 2, 'value': 1.0}] | [{'index': 3, 'value': 1.0}] |
| 102 | 4016.243221 | [{'index': 2, 'value': 1.0}] | -0.126582 | 0.662855 | -0.626559 | [{'index': 3, 'value': 1.0}] | [{'index': 1, 'value': 1.0}] |
| 107 | 4933.655362 | [{'index': 1, 'value': 1.0}] | 1.20778 | -1.005404 | 1.118325 | [{'index': 2, 'value': 1.0}] | [{'index': 3, 'value': 1.0}] |
25 rows × 7 columns
Unsupervised predictors#
In unsupervised learning, there are no known outputs in the training data, instead the model learns on input data alone and predicts something else. An example of an unsupervised predictor is bigframes.ml.cluster.KMeans, which learns how to fit input data to a target number of clusters.
from bigframes.ml.cluster import KMeans
# Specify KMeans with four clusters
kmeans = KMeans(n_clusters=4)
# Fit data
kmeans.fit(processed_X_train)
# View predictions
kmeans.predict(processed_X_test)
| CENTROID_ID | NEAREST_CENTROIDS_DISTANCE | onehotencoded_island | standard_scaled_culmen_length_mm | standard_scaled_culmen_depth_mm | standard_scaled_flipper_length_mm | onehotencoded_sex | onehotencoded_species | |
|---|---|---|---|---|---|---|---|---|
| penguin_id | ||||||||
| 1 | 3 | [{'CENTROID_ID': 3, 'DISTANCE': 0.857057881337... | [{'index': 1, 'value': 1.0}] | 0.220718 | -1.359277 | 1.045621 | [{'index': 2, 'value': 1.0}] | [{'index': 3, 'value': 1.0}] |
| 15 | 4 | [{'CENTROID_ID': 4, 'DISTANCE': 1.181613302004... | [{'index': 2, 'value': 1.0}] | -0.510439 | 0.157322 | -0.771966 | [{'index': 3, 'value': 1.0}] | [{'index': 1, 'value': 1.0}] |
| 28 | 1 | [{'CENTROID_ID': 1, 'DISTANCE': 1.006856853050... | [{'index': 2, 'value': 1.0}] | -1.058807 | 0.713408 | -0.771966 | [{'index': 2, 'value': 1.0}] | [{'index': 1, 'value': 1.0}] |
| 32 | 2 | [{'CENTROID_ID': 2, 'DISTANCE': 1.237504384283... | [{'index': 2, 'value': 1.0}] | 1.463685 | 1.168388 | 0.39129 | [{'index': 3, 'value': 1.0}] | [{'index': 2, 'value': 1.0}] |
| 33 | 2 | [{'CENTROID_ID': 2, 'DISTANCE': 1.656439702919... | [{'index': 2, 'value': 1.0}] | -0.254534 | 0.056215 | -0.990076 | [{'index': 2, 'value': 1.0}] | [{'index': 2, 'value': 1.0}] |
| 34 | 4 | [{'CENTROID_ID': 4, 'DISTANCE': 1.343792119214... | [{'index': 2, 'value': 1.0}] | -0.510439 | 0.460642 | 0.318587 | [{'index': 3, 'value': 1.0}] | [{'index': 1, 'value': 1.0}] |
| 37 | 2 | [{'CENTROID_ID': 2, 'DISTANCE': 0.816670297369... | [{'index': 2, 'value': 1.0}] | 1.354012 | 0.511195 | -0.263041 | [{'index': 3, 'value': 1.0}] | [{'index': 2, 'value': 1.0}] |
| 41 | 1 | [{'CENTROID_ID': 1, 'DISTANCE': 1.317560921596... | [{'index': 3, 'value': 1.0}] | -0.674949 | -0.095445 | -1.789814 | [{'index': 2, 'value': 1.0}] | [{'index': 1, 'value': 1.0}] |
| 47 | 4 | [{'CENTROID_ID': 4, 'DISTANCE': 1.135112005343... | [{'index': 2, 'value': 1.0}] | -1.168481 | 0.662855 | -0.117634 | [{'index': 3, 'value': 1.0}] | [{'index': 1, 'value': 1.0}] |
| 52 | 2 | [{'CENTROID_ID': 2, 'DISTANCE': 1.004096945181... | [{'index': 2, 'value': 1.0}] | 0.458344 | 0.308982 | -0.699262 | [{'index': 2, 'value': 1.0}] | [{'index': 2, 'value': 1.0}] |
| 56 | 4 | [{'CENTROID_ID': 4, 'DISTANCE': 1.218648668822... | [{'index': 1, 'value': 1.0}] | -1.040528 | 0.460642 | -1.135483 | [{'index': 3, 'value': 1.0}] | [{'index': 1, 'value': 1.0}] |
| 57 | 1 | [{'CENTROID_ID': 1, 'DISTANCE': 1.238466630273... | [{'index': 1, 'value': 1.0}] | -0.967412 | 0.005662 | -0.117634 | [{'index': 2, 'value': 1.0}] | [{'index': 1, 'value': 1.0}] |
| 62 | 3 | [{'CENTROID_ID': 3, 'DISTANCE': 0.876984617451... | [{'index': 1, 'value': 1.0}] | 0.988433 | -0.752638 | 1.191028 | [{'index': 3, 'value': 1.0}] | [{'index': 3, 'value': 1.0}] |
| 65 | 2 | [{'CENTROID_ID': 2, 'DISTANCE': 1.439604004538... | [{'index': 2, 'value': 1.0}] | 1.756148 | 1.370601 | 0.318587 | [{'index': 3, 'value': 1.0}] | [{'index': 2, 'value': 1.0}] |
| 67 | 3 | [{'CENTROID_ID': 3, 'DISTANCE': 0.763112987694... | [{'index': 1, 'value': 1.0}] | 0.677691 | -1.359277 | 1.045621 | [{'index': 3, 'value': 1.0}] | [{'index': 3, 'value': 1.0}] |
| 75 | 4 | [{'CENTROID_ID': 4, 'DISTANCE': 1.075788925734... | [{'index': 1, 'value': 1.0}] | -1.113644 | 1.421155 | -0.771966 | [{'index': 3, 'value': 1.0}] | [{'index': 1, 'value': 1.0}] |
| 81 | 2 | [{'CENTROID_ID': 2, 'DISTANCE': 0.777307801541... | [{'index': 2, 'value': 1.0}] | 0.677691 | 0.561748 | -0.408448 | [{'index': 2, 'value': 1.0}] | [{'index': 2, 'value': 1.0}] |
| 89 | 4 | [{'CENTROID_ID': 4, 'DISTANCE': 0.891303183824... | [{'index': 2, 'value': 1.0}] | -0.857739 | 0.713408 | -0.771966 | [{'index': 3, 'value': 1.0}] | [{'index': 1, 'value': 1.0}] |
| 92 | 1 | [{'CENTROID_ID': 1, 'DISTANCE': 0.934676470689... | [{'index': 2, 'value': 1.0}] | -0.802902 | 0.308982 | -0.917372 | [{'index': 2, 'value': 1.0}] | [{'index': 1, 'value': 1.0}] |
| 93 | 4 | [{'CENTROID_ID': 4, 'DISTANCE': 0.984620018517... | [{'index': 1, 'value': 1.0}] | -0.309371 | 1.168388 | -0.263041 | [{'index': 3, 'value': 1.0}] | [{'index': 1, 'value': 1.0}] |
| 96 | 1 | [{'CENTROID_ID': 1, 'DISTANCE': 1.446939975674... | [{'index': 2, 'value': 1.0}] | -0.309371 | 0.662855 | -1.499 | [{'index': 2, 'value': 1.0}] | [{'index': 1, 'value': 1.0}] |
| 100 | 1 | [{'CENTROID_ID': 1, 'DISTANCE': 1.101117711572... | [{'index': 2, 'value': 1.0}] | -0.912576 | 0.814515 | -0.771966 | [{'index': 2, 'value': 1.0}] | [{'index': 1, 'value': 1.0}] |
| 101 | 3 | [{'CENTROID_ID': 3, 'DISTANCE': 0.823832007899... | [{'index': 1, 'value': 1.0}] | 0.549739 | -1.308724 | 1.554546 | [{'index': 2, 'value': 1.0}] | [{'index': 3, 'value': 1.0}] |
| 102 | 4 | [{'CENTROID_ID': 4, 'DISTANCE': 0.995348310182... | [{'index': 2, 'value': 1.0}] | -0.126582 | 0.662855 | -0.626559 | [{'index': 3, 'value': 1.0}] | [{'index': 1, 'value': 1.0}] |
| 107 | 3 | [{'CENTROID_ID': 3, 'DISTANCE': 0.930021405831... | [{'index': 1, 'value': 1.0}] | 1.20778 | -1.005404 | 1.118325 | [{'index': 2, 'value': 1.0}] | [{'index': 3, 'value': 1.0}] |
25 rows × 8 columns
Pipelines#
Transfomers and predictors can be chained into a single estimator component using bigframes.ml.pipeline.Pipeline:
from bigframes.ml.pipeline import Pipeline
pipeline = Pipeline([
('preproc', preproc),
('linreg', linreg)
])
# Print our pipeline
pipeline
Pipeline(steps=[('preproc',
ColumnTransformer(transformers=[('scale', StandardScaler(),
['culmen_length_mm',
'culmen_depth_mm',
'flipper_length_mm']),
('encode', OneHotEncoder(),
['species', 'sex',
'island'])])),
('linreg', LinearRegression())])
The pipeline simplifies the workflow by applying each of its component steps automatically:
pipeline.fit(X_train, y_train)
predicted_y_test = pipeline.predict(X_test)
predicted_y_test
| predicted_body_mass_g | island | culmen_length_mm | culmen_depth_mm | flipper_length_mm | sex | species | |
|---|---|---|---|---|---|---|---|
| penguin_id | |||||||
| 1 | 4772.374547 | Biscoe | 45.1 | 14.5 | 215.0 | FEMALE | Gentoo penguin (Pygoscelis papua) |
| 15 | 3883.371052 | Dream | 41.1 | 17.5 | 190.0 | MALE | Adelie Penguin (Pygoscelis adeliae) |
| 28 | 3479.706166 | Dream | 38.1 | 18.6 | 190.0 | FEMALE | Adelie Penguin (Pygoscelis adeliae) |
| 32 | 4223.851137 | Dream | 51.9 | 19.5 | 206.0 | MALE | Chinstrap penguin (Pygoscelis antarctica) |
| 33 | 3197.620461 | Dream | 42.5 | 17.3 | 187.0 | FEMALE | Chinstrap penguin (Pygoscelis antarctica) |
| 34 | 4155.265191 | Dream | 41.1 | 18.1 | 205.0 | MALE | Adelie Penguin (Pygoscelis adeliae) |
| 37 | 3991.311319 | Dream | 51.3 | 18.2 | 197.0 | MALE | Chinstrap penguin (Pygoscelis antarctica) |
| 41 | 3232.644783 | Torgersen | 40.2 | 17.0 | 176.0 | FEMALE | Adelie Penguin (Pygoscelis adeliae) |
| 47 | 4017.738303 | Dream | 37.5 | 18.5 | 199.0 | MALE | Adelie Penguin (Pygoscelis adeliae) |
| 52 | 3365.077659 | Dream | 46.4 | 17.8 | 191.0 | FEMALE | Chinstrap penguin (Pygoscelis antarctica) |
| 56 | 3791.328893 | Biscoe | 38.2 | 18.1 | 185.0 | MALE | Adelie Penguin (Pygoscelis adeliae) |
| 57 | 3547.890609 | Biscoe | 38.6 | 17.2 | 199.0 | FEMALE | Adelie Penguin (Pygoscelis adeliae) |
| 62 | 5372.086117 | Biscoe | 49.3 | 15.7 | 217.0 | MALE | Gentoo penguin (Pygoscelis papua) |
| 65 | 4263.229571 | Dream | 53.5 | 19.9 | 205.0 | MALE | Chinstrap penguin (Pygoscelis antarctica) |
| 67 | 5234.457401 | Biscoe | 47.6 | 14.5 | 215.0 | MALE | Gentoo penguin (Pygoscelis papua) |
| 75 | 3979.311469 | Biscoe | 37.8 | 20.0 | 190.0 | MALE | Adelie Penguin (Pygoscelis adeliae) |
| 81 | 3481.328573 | Dream | 47.6 | 18.3 | 195.0 | FEMALE | Chinstrap penguin (Pygoscelis antarctica) |
| 89 | 3915.237615 | Dream | 39.2 | 18.6 | 190.0 | MALE | Adelie Penguin (Pygoscelis adeliae) |
| 92 | 3425.560982 | Dream | 39.5 | 17.8 | 188.0 | FEMALE | Adelie Penguin (Pygoscelis adeliae) |
| 93 | 4141.494969 | Biscoe | 42.2 | 19.5 | 197.0 | MALE | Adelie Penguin (Pygoscelis adeliae) |
| 96 | 3394.719445 | Dream | 42.2 | 18.5 | 180.0 | FEMALE | Adelie Penguin (Pygoscelis adeliae) |
| 100 | 3507.223965 | Dream | 38.9 | 18.8 | 190.0 | FEMALE | Adelie Penguin (Pygoscelis adeliae) |
| 101 | 4922.284991 | Biscoe | 46.9 | 14.6 | 222.0 | FEMALE | Gentoo penguin (Pygoscelis papua) |
| 102 | 4016.240318 | Dream | 43.2 | 18.5 | 192.0 | MALE | Adelie Penguin (Pygoscelis adeliae) |
| 107 | 4933.653758 | Biscoe | 50.5 | 15.2 | 216.0 | FEMALE | Gentoo penguin (Pygoscelis papua) |
25 rows × 7 columns
In the backend, a pipeline will actually be compiled into a single model with an embedded TRANSFORM step.
Evaluating results#
Some models include a convenient .score(X, y) method for evaulation with a preset accuracy metric:
# In the case of a pipeline, this will be equivalent to calling .score on the contained LinearRegression
pipeline.score(X_test, y_test)
| mean_absolute_error | mean_squared_error | mean_squared_log_error | median_absolute_error | r2_score | explained_variance | |
|---|---|---|---|---|---|---|
| 0 | 225.883512 | 77765.989281 | 0.004457 | 179.548041 | 0.873166 | 0.873315 |
1 rows × 6 columns
For a more general approach, the library bigframes.ml.metrics is provided:
from bigframes.ml.metrics import r2_score
r2_score(y_test, predicted_y_test["predicted_body_mass_g"])
Save to BigQuery#
Estimators can be saved to BigQuery as BQML models, and loaded again in future.
Saving requires bigquery.tables.create permission, and loading requires bigquery.models.getMetadata permission.
These permissions can be at project level or the dataset level.
If you have those permissions, please go ahead and uncomment the code in the following cells and run.
linreg.to_gbq(f"{DATASET}.penguins_model", replace=True)
Pipeline(steps=[('transform',
ColumnTransformer(transformers=[('ont_hot_encoder',
OneHotEncoder(max_categories=1000001,
min_frequency=0),
'island'),
('standard_scaler',
StandardScaler(),
'culmen_length_mm'),
('standard_scaler',
StandardScaler(),
'culmen_depth_mm'),
('standard_scaler',
StandardScaler(),
'flipper_length_mm'),
('ont_hot_encoder',
OneHotEncoder(max_categories=1000001,
min_frequency=0),
'sex'),
('ont_hot_encoder',
OneHotEncoder(max_categories=1000001,
min_frequency=0),
'species')])),
('estimator',
LinearRegression(optimize_strategy='NORMAL_EQUATION'))])
bpd.read_gbq_model(f"{DATASET}.penguins_model")
Pipeline(steps=[('transform',
ColumnTransformer(transformers=[('ont_hot_encoder',
OneHotEncoder(max_categories=1000001,
min_frequency=0),
'island'),
('standard_scaler',
StandardScaler(),
'culmen_length_mm'),
('standard_scaler',
StandardScaler(),
'culmen_depth_mm'),
('standard_scaler',
StandardScaler(),
'flipper_length_mm'),
('ont_hot_encoder',
OneHotEncoder(max_categories=1000001,
min_frequency=0),
'sex'),
('ont_hot_encoder',
OneHotEncoder(max_categories=1000001,
min_frequency=0),
'species')])),
('estimator',
LinearRegression(optimize_strategy='NORMAL_EQUATION'))])
We can also save the pipeline to BigQuery. BigQuery will save this as a single model, with the pre-processing steps embedded in the TRANSFORM property:
pipeline.to_gbq(f"{DATASET}.penguins_pipeline", replace=True)
Pipeline(steps=[('transform',
ColumnTransformer(transformers=[('ont_hot_encoder',
OneHotEncoder(max_categories=1000001,
min_frequency=0),
'island'),
('standard_scaler',
StandardScaler(),
'culmen_length_mm'),
('standard_scaler',
StandardScaler(),
'culmen_depth_mm'),
('standard_scaler',
StandardScaler(),
'flipper_length_mm'),
('ont_hot_encoder',
OneHotEncoder(max_categories=1000001,
min_frequency=0),
'sex'),
('ont_hot_encoder',
OneHotEncoder(max_categories=1000001,
min_frequency=0),
'species')])),
('estimator',
LinearRegression(optimize_strategy='NORMAL_EQUATION'))])
bpd.read_gbq_model(f"{DATASET}.penguins_pipeline")
Pipeline(steps=[('transform',
ColumnTransformer(transformers=[('ont_hot_encoder',
OneHotEncoder(max_categories=1000001,
min_frequency=0),
'island'),
('standard_scaler',
StandardScaler(),
'culmen_length_mm'),
('standard_scaler',
StandardScaler(),
'culmen_depth_mm'),
('standard_scaler',
StandardScaler(),
'flipper_length_mm'),
('ont_hot_encoder',
OneHotEncoder(max_categories=1000001,
min_frequency=0),
'sex'),
('ont_hot_encoder',
OneHotEncoder(max_categories=1000001,
min_frequency=0),
'species')])),
('estimator',
LinearRegression(optimize_strategy='NORMAL_EQUATION'))])
Summary and next steps#
You’ve completed an end-to-end machine learning workflow using the built-in capabilities of BigQuery DataFrames.
Learn more about BigQuery DataFrames in the documentation and find more sample notebooks in the GitHub repo.
Cleaning up#
To clean up all Google Cloud resources used in this project, you can delete the Google Cloud project you used for the tutorial.
Otherwise, you can uncomment the remaining cells and run them to delete the individual resources you created in this tutorial:
# # Delete the BQML models
# MODEL_NAME = f"{PROJECT_ID}:{DATASET}.penguins_model"
# ! bq rm -f --model {MODEL_NAME}
# PIPELINE_NAME = f"{PROJECT_ID}:{DATASET}.penguins_pipeline"
# ! bq rm -f --model {PIPELINE_NAME}
Run in Colab