Using ML - Easy linear regression#
This demo shows BigQuery DataFrames ML providing an SKLearn-like experience for training a linear regression model.
In this “easy” version of linear regression, we use a couple of BQML features to simplify our code:
We rely on automatic preprocessing to encode string values and scale numeric values
We rely on automatic data split & evaluation to test the model
This example is adapted from the BQML linear regression tutorial.
1. Init & load data#
Import bigframes.pandas module and get the default session
import bigframes.pandas
session = bigframes.pandas.get_global_session()
Define a dataset for storing BQML model, and create it if it does not exist.
dataset = f"{session.bqclient.project}.bqml_tutorial"
session.bqclient.create_dataset(dataset, exists_ok=True)
Define a model path
penguins_model = f"{dataset}.penguins_model"
Read the penguins data.
# read a BigQuery table to a BigQuery DataFrame
df = bigframes.pandas.read_gbq(f"bigquery-public-data.ml_datasets.penguins")
# take a peek at the dataframe
df
2. Data cleaning / prep#
# filter down to the data we want to analyze
adelie_data = df[df.species == "Adelie Penguin (Pygoscelis adeliae)"]
# drop the columns we don't care about
adelie_data = adelie_data.drop(columns=["species"])
# drop rows with nulls to get our training data
training_data = adelie_data.dropna()
# take a peek at the training data
training_data
# pick feature columns and label column
feature_columns = training_data[['island', 'culmen_length_mm', 'culmen_depth_mm', 'flipper_length_mm', 'sex']]
label_columns = training_data[['body_mass_g']]
# also get the rows that we want to make predictions for (i.e. where the feature column is null)
missing_body_mass = adelie_data[adelie_data.body_mass_g.isnull()]
3. Create, score, fit, predict#
from bigframes.ml.linear_model import LinearRegression
model = LinearRegression()
# Here we pass the feature columns without transforms - BQML will then use
# automatic preprocessing to encode these columns
model.fit(feature_columns, label_columns)
# check how the model performed
model.score(feature_columns, label_columns)
# use the model to predict the missing labels
model.predict(missing_body_mass)
4. Save in BigQuery#
# save the model to a permanent location in BigQuery, so we can use it in future sessions (and elsewhere in BQ)
model.to_gbq(penguins_model, replace=True)
5. Reload from BigQuery#
# WARNING - until b/281709360 is fixed & pipeline is updated, pipelines will load as models,
# and details of their transform steps will be lost (the loaded model will behave the same)
bigframes.pandas.read_gbq_model(penguins_model)