Using ML - SKLearn linear regression#

This demo shows how we can implement a linear regression in BigQuery DataFrames ML, with API that is exactly compatible with scikit-learn.

1. Init & load data#

# Initialize BigQuery DataFrame
import bigframes.pandas

# read a BigQuery table to a BigQuery DataFrame
df = bigframes.pandas.read_gbq("bigframes-dev.bqml_tutorial.penguins")

# take a peek at the dataframe
df
Query job f201b84b-5506-4038-92e6-b4a82318df8f is DONE. 0 Bytes processed. Open Job
Query job 12e0f983-695e-4903-8ff1-2f353d7e8cba is DONE. 28.9 kB processed. Open Job
species island culmen_length_mm culmen_depth_mm flipper_length_mm body_mass_g sex
0 Adelie Penguin (Pygoscelis adeliae) Biscoe 40.1 18.9 188.0 4300.0 MALE
1 Adelie Penguin (Pygoscelis adeliae) Torgersen 39.1 18.7 181.0 3750.0 MALE
2 Gentoo penguin (Pygoscelis papua) Biscoe 47.4 14.6 212.0 4725.0 FEMALE
3 Chinstrap penguin (Pygoscelis antarctica) Dream 42.5 16.7 187.0 3350.0 FEMALE
4 Adelie Penguin (Pygoscelis adeliae) Biscoe 43.2 19.0 197.0 4775.0 MALE
5 Gentoo penguin (Pygoscelis papua) Biscoe 46.7 15.3 219.0 5200.0 MALE
6 Adelie Penguin (Pygoscelis adeliae) Biscoe 41.3 21.1 195.0 4400.0 MALE
7 Gentoo penguin (Pygoscelis papua) Biscoe 45.2 13.8 215.0 4750.0 FEMALE
8 Gentoo penguin (Pygoscelis papua) Biscoe 46.5 13.5 210.0 4550.0 FEMALE
9 Gentoo penguin (Pygoscelis papua) Biscoe 50.5 15.2 216.0 5000.0 FEMALE
10 Gentoo penguin (Pygoscelis papua) Biscoe 48.2 15.6 221.0 5100.0 MALE
11 Adelie Penguin (Pygoscelis adeliae) Dream 38.1 18.6 190.0 3700.0 FEMALE
12 Gentoo penguin (Pygoscelis papua) Biscoe 50.7 15.0 223.0 5550.0 MALE
13 Adelie Penguin (Pygoscelis adeliae) Biscoe 37.8 20.0 190.0 4250.0 MALE
14 Adelie Penguin (Pygoscelis adeliae) Biscoe 35.0 17.9 190.0 3450.0 FEMALE
15 Gentoo penguin (Pygoscelis papua) Biscoe 48.7 15.7 208.0 5350.0 MALE
16 Adelie Penguin (Pygoscelis adeliae) Torgersen 34.6 21.1 198.0 4400.0 MALE
17 Gentoo penguin (Pygoscelis papua) Biscoe 46.8 15.4 215.0 5150.0 MALE
18 Chinstrap penguin (Pygoscelis antarctica) Dream 50.3 20.0 197.0 3300.0 MALE
19 Adelie Penguin (Pygoscelis adeliae) Dream 37.2 18.1 178.0 3900.0 MALE
20 Chinstrap penguin (Pygoscelis antarctica) Dream 51.0 18.8 203.0 4100.0 MALE
21 Adelie Penguin (Pygoscelis adeliae) Biscoe 40.5 17.9 187.0 3200.0 FEMALE
22 Gentoo penguin (Pygoscelis papua) Biscoe 45.5 13.9 210.0 4200.0 FEMALE
23 Adelie Penguin (Pygoscelis adeliae) Dream 42.2 18.5 180.0 3550.0 FEMALE
24 Chinstrap penguin (Pygoscelis antarctica) Dream 51.7 20.3 194.0 3775.0 MALE

25 rows × 7 columns

[344 rows x 7 columns in total]

2. Data cleaning / prep#

# filter down to the data we want to analyze
adelie_data = df[df.species == "Adelie Penguin (Pygoscelis adeliae)"]

# drop the columns we don't care about
adelie_data = adelie_data.drop(columns=["species"])

# drop rows with nulls to get our training data
training_data = adelie_data.dropna()

# take a peek at the training data
training_data
Query job 81305962-a96a-4c86-949c-471b2ae7c86d is DONE. 28.9 kB processed. Open Job
Query job 2af0b0d6-c11b-499e-8d25-a2c628b2853b is DONE. 28.9 kB processed. Open Job
island culmen_length_mm culmen_depth_mm flipper_length_mm body_mass_g sex
0 Biscoe 40.1 18.9 188.0 4300.0 MALE
1 Torgersen 39.1 18.7 181.0 3750.0 MALE
4 Biscoe 43.2 19.0 197.0 4775.0 MALE
6 Biscoe 41.3 21.1 195.0 4400.0 MALE
11 Dream 38.1 18.6 190.0 3700.0 FEMALE
13 Biscoe 37.8 20.0 190.0 4250.0 MALE
14 Biscoe 35.0 17.9 190.0 3450.0 FEMALE
16 Torgersen 34.6 21.1 198.0 4400.0 MALE
19 Dream 37.2 18.1 178.0 3900.0 MALE
21 Biscoe 40.5 17.9 187.0 3200.0 FEMALE
23 Dream 42.2 18.5 180.0 3550.0 FEMALE
30 Dream 39.2 21.1 196.0 4150.0 MALE
32 Torgersen 42.9 17.6 196.0 4700.0 MALE
38 Dream 41.1 17.5 190.0 3900.0 MALE
40 Torgersen 38.6 21.2 191.0 3800.0 MALE
42 Biscoe 35.5 16.2 195.0 3350.0 FEMALE
44 Dream 39.2 18.6 190.0 4250.0 MALE
45 Torgersen 35.2 15.9 186.0 3050.0 FEMALE
46 Dream 43.2 18.5 192.0 4100.0 MALE
49 Biscoe 39.6 17.7 186.0 3500.0 FEMALE
53 Biscoe 45.6 20.3 191.0 4600.0 MALE
58 Torgersen 40.9 16.8 191.0 3700.0 FEMALE
60 Torgersen 40.3 18.0 195.0 3250.0 FEMALE
62 Dream 36.0 18.5 186.0 3100.0 FEMALE
63 Torgersen 39.3 20.6 190.0 3650.0 MALE

25 rows × 6 columns

[146 rows x 6 columns in total]

3. Use model_selection.train_test_split to prepare training data#

from bigframes.ml.model_selection import train_test_split

feature_columns = training_data[['island', 'culmen_length_mm', 'culmen_depth_mm', 'flipper_length_mm', 'sex']]
label_columns = training_data[['body_mass_g']] 

X_train, X_test, y_train, y_test = train_test_split(
  feature_columns, label_columns, test_size=0.2)
Query job 0808457b-a0df-4a37-b7a5-8885f4a4588c is DONE. 28.9 kB processed. Open Job

4. Configure a linear regression pipeline with preprocessing#

from bigframes.ml.linear_model import LinearRegression
from bigframes.ml.pipeline import Pipeline
from bigframes.ml.compose import ColumnTransformer
from bigframes.ml.preprocessing import StandardScaler, OneHotEncoder

preprocessing = ColumnTransformer([
  ("onehot", OneHotEncoder(), ["island", "sex"]),
  ("scaler", StandardScaler(), ["culmen_depth_mm", "culmen_length_mm", "flipper_length_mm"]),
])

model = LinearRegression(fit_intercept=False)

pipeline = Pipeline([
  ('preproc', preprocessing),
  ('linreg', model)
])

# TODO(bmil): pretty printing for pipelines
pipeline
Pipeline(steps=[('preproc',
                 ColumnTransformer(transformers=[('onehot', OneHotEncoder(),
                                                  ['island', 'species', 'sex']),
                                                 ('scaler', StandardScaler(),
                                                  ['culmen_depth_mm',
                                                   'culmen_length_mm',
                                                   'flipper_length_mm'])])),
                ('linreg', LinearRegression(fit_intercept=False))])

5. Fit the pipeline to the training data#

This will create a temporary BQML model in BigQuery

pipeline.fit(X_train, y_train)
Query job e9bfa6a5-a53f-4d8b-ae8c-cc8cd55d0947 is DONE. 28.9 kB processed. Open Job
Query job d8d553cf-3d36-49aa-b18b-9a05576a1fb0 is DONE. 28.9 kB processed. Open Job
Query job 75ef0083-9a4f-4ffb-a6c6-d82974a1659f is DONE. 0 Bytes processed. Open Job
Pipeline(steps=[('preproc',
                 ColumnTransformer(transformers=[('onehot', OneHotEncoder(),
                                                  ['island', 'species', 'sex']),
                                                 ('scaler', StandardScaler(),
                                                  ['culmen_depth_mm',
                                                   'culmen_length_mm',
                                                   'flipper_length_mm'])])),
                ('linreg', LinearRegression(fit_intercept=False))])

6. Score the pipeline on the test data with metrics.r2_score#

from bigframes.ml.metrics import r2_score

y_pred = pipeline.predict(X_test)["predicted_body_mass_g"]

r2_score(y_test, y_pred)
Query job 55c5a9ce-8159-4a1a-99a4-af3a906640ba is DONE. 29.3 kB processed. Open Job
Query job 3e41c470-de70-4f13-89d9-c5564d0b2836 is DONE. 232 Bytes processed. Open Job
Query job ed2f9042-a737-4d13-bd21-8c3d29cd61a2 is DONE. 28.9 kB processed. Open Job
Query job 815d16b5-0a5d-42be-a766-1cff5b8f22f2 is DONE. 28.9 kB processed. Open Job
Query job 37a38dc6-5073-4544-a1e3-da145a843922 is DONE. 29.4 kB processed. Open Job
0.2655729213572775

5. Inference the model on new data#

import pandas

new_penguins = bigframes.pandas.read_pandas(
        pandas.DataFrame(
            {
                "tag_number": [1633, 1672, 1690],
                "species": [
                    "Adelie Penguin (Pygoscelis adeliae)",
                    "Adelie Penguin (Pygoscelis adeliae)",
                    "Adelie Penguin (Pygoscelis adeliae)",
                ],
                "island": ["Torgersen", "Torgersen", "Dream"],
                "culmen_length_mm": [39.5, 38.5, 37.9],
                "culmen_depth_mm": [18.8, 17.2, 18.1],
                "flipper_length_mm": [196.0, 181.0, 188.0],
                "sex": ["MALE", "FEMALE", "FEMALE"],
            }
        ).set_index("tag_number")
    )
Load job 7b46750c-70b4-468d-87ba-9f84f579f2a6 is DONE. Open Job
pipeline.predict(new_penguins)
Query job d10dd37d-5e8e-4e15-9c83-a7e9a4c592a8 is DONE. 593 Bytes processed. Open Job
Query job 207cb787-cf8a-43ea-8e73-644d3f58b11a is DONE. 24 Bytes processed. Open Job
Query job c5dc5075-cac0-4947-9e9f-06aa9cc5bd2a is DONE. 0 Bytes processed. Open Job
Query job 2ca4a569-7186-48ed-b3e4-004dca704798 is DONE. 282 Bytes processed. Open Job
predicted_body_mass_g species island culmen_length_mm culmen_depth_mm flipper_length_mm sex
tag_number
1633 4017.203152 Adelie Penguin (Pygoscelis adeliae) Torgersen 39.5 18.8 196.0 MALE
1672 3127.601519 Adelie Penguin (Pygoscelis adeliae) Torgersen 38.5 17.2 181.0 FEMALE
1690 3386.101231 Adelie Penguin (Pygoscelis adeliae) Dream 37.9 18.1 188.0 FEMALE

3 rows × 7 columns

[3 rows x 7 columns in total]

6. Save in BigQuery#

pipeline.to_gbq("bigframes-dev.bigframes_demo_us.penguin_model", replace=True)
Copy job d1def4a4-1da1-43a9-8ae5-4459444d993d is DONE. Open Job
Pipeline(steps=[('transform',
                 ColumnTransformer(transformers=[('ont_hot_encoder',
                                                  OneHotEncoder(max_categories=1000001,
                                                                min_frequency=0),
                                                  'island'),
                                                 ('standard_scaler',
                                                  StandardScaler(),
                                                  'culmen_length_mm'),
                                                 ('standard_scaler',
                                                  StandardScaler(),
                                                  'culmen_depth_mm'),
                                                 ('standard_scaler',
                                                  StandardScaler(),
                                                  'flipper_length_mm'),
                                                 ('ont_hot_encoder',
                                                  OneHotEncoder(max_categories=1000001,
                                                                min_frequency=0),
                                                  'sex')])),
                ('estimator',
                 LinearRegression(fit_intercept=False,
                                  optimize_strategy='NORMAL_EQUATION'))])