Using ML - Easy linear regression#

This demo shows BigQuery DataFrames ML providing an SKLearn-like experience for training a linear regression model.

In this “easy” version of linear regression, we use a couple of BQML features to simplify our code:

We rely on automatic preprocessing to encode string values and scale numeric values
We rely on automatic data split & evaluation to test the model

This example is adapted from the BQML linear regression tutorial.

1. Init & load data#

Import bigframes.pandas module and get the default session

import bigframes.pandas
session = bigframes.pandas.get_global_session()

Define a dataset for storing BQML model, and create it if it does not exist.

dataset = f"{session.bqclient.project}.bqml_tutorial"
session.bqclient.create_dataset(dataset, exists_ok=True)

Define a model path

penguins_model = f"{dataset}.penguins_model"

Read the penguins data.

# read a BigQuery table to a BigQuery DataFrame
df = bigframes.pandas.read_gbq(f"bigquery-public-data.ml_datasets.penguins")

# take a peek at the dataframe
df

2. Data cleaning / prep#

# filter down to the data we want to analyze
adelie_data = df[df.species == "Adelie Penguin (Pygoscelis adeliae)"]

# drop the columns we don't care about
adelie_data = adelie_data.drop(columns=["species"])

# drop rows with nulls to get our training data
training_data = adelie_data.dropna()

# take a peek at the training data
training_data

# pick feature columns and label column
feature_columns = training_data[['island', 'culmen_length_mm', 'culmen_depth_mm', 'flipper_length_mm', 'sex']]
label_columns = training_data[['body_mass_g']]                               

# also get the rows that we want to make predictions for (i.e. where the feature column is null)
missing_body_mass = adelie_data[adelie_data.body_mass_g.isnull()]

3. Create, score, fit, predict#

from bigframes.ml.linear_model import LinearRegression

model = LinearRegression()

# Here we pass the feature columns without transforms - BQML will then use
# automatic preprocessing to encode these columns
model.fit(feature_columns, label_columns)

# check how the model performed
model.score(feature_columns, label_columns)

# use the model to predict the missing labels
model.predict(missing_body_mass)

4. Save in BigQuery#

# save the model to a permanent location in BigQuery, so we can use it in future sessions (and elsewhere in BQ)
model.to_gbq(penguins_model, replace=True)

5. Reload from BigQuery#

# WARNING - until b/281709360 is fixed & pipeline is updated, pipelines will load as models,
# and details of their transform steps will be lost (the loaded model will behave the same)
bigframes.pandas.read_gbq_model(penguins_model)