Using ML - SKLearn linear regression#
This demo shows how we can implement a linear regression in BigQuery DataFrames ML, with API that is exactly compatible with scikit-learn.
1. Init & load data#
# Initialize BigQuery DataFrame
import bigframes.pandas
# read a BigQuery table to a BigQuery DataFrame
df = bigframes.pandas.read_gbq("bigframes-dev.bqml_tutorial.penguins")
# take a peek at the dataframe
df
Query job f201b84b-5506-4038-92e6-b4a82318df8f is DONE. 0 Bytes processed. Open Job
Query job 12e0f983-695e-4903-8ff1-2f353d7e8cba is DONE. 28.9 kB processed. Open Job
| species | island | culmen_length_mm | culmen_depth_mm | flipper_length_mm | body_mass_g | sex | |
|---|---|---|---|---|---|---|---|
| 0 | Adelie Penguin (Pygoscelis adeliae) | Biscoe | 40.1 | 18.9 | 188.0 | 4300.0 | MALE |
| 1 | Adelie Penguin (Pygoscelis adeliae) | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | MALE |
| 2 | Gentoo penguin (Pygoscelis papua) | Biscoe | 47.4 | 14.6 | 212.0 | 4725.0 | FEMALE |
| 3 | Chinstrap penguin (Pygoscelis antarctica) | Dream | 42.5 | 16.7 | 187.0 | 3350.0 | FEMALE |
| 4 | Adelie Penguin (Pygoscelis adeliae) | Biscoe | 43.2 | 19.0 | 197.0 | 4775.0 | MALE |
| 5 | Gentoo penguin (Pygoscelis papua) | Biscoe | 46.7 | 15.3 | 219.0 | 5200.0 | MALE |
| 6 | Adelie Penguin (Pygoscelis adeliae) | Biscoe | 41.3 | 21.1 | 195.0 | 4400.0 | MALE |
| 7 | Gentoo penguin (Pygoscelis papua) | Biscoe | 45.2 | 13.8 | 215.0 | 4750.0 | FEMALE |
| 8 | Gentoo penguin (Pygoscelis papua) | Biscoe | 46.5 | 13.5 | 210.0 | 4550.0 | FEMALE |
| 9 | Gentoo penguin (Pygoscelis papua) | Biscoe | 50.5 | 15.2 | 216.0 | 5000.0 | FEMALE |
| 10 | Gentoo penguin (Pygoscelis papua) | Biscoe | 48.2 | 15.6 | 221.0 | 5100.0 | MALE |
| 11 | Adelie Penguin (Pygoscelis adeliae) | Dream | 38.1 | 18.6 | 190.0 | 3700.0 | FEMALE |
| 12 | Gentoo penguin (Pygoscelis papua) | Biscoe | 50.7 | 15.0 | 223.0 | 5550.0 | MALE |
| 13 | Adelie Penguin (Pygoscelis adeliae) | Biscoe | 37.8 | 20.0 | 190.0 | 4250.0 | MALE |
| 14 | Adelie Penguin (Pygoscelis adeliae) | Biscoe | 35.0 | 17.9 | 190.0 | 3450.0 | FEMALE |
| 15 | Gentoo penguin (Pygoscelis papua) | Biscoe | 48.7 | 15.7 | 208.0 | 5350.0 | MALE |
| 16 | Adelie Penguin (Pygoscelis adeliae) | Torgersen | 34.6 | 21.1 | 198.0 | 4400.0 | MALE |
| 17 | Gentoo penguin (Pygoscelis papua) | Biscoe | 46.8 | 15.4 | 215.0 | 5150.0 | MALE |
| 18 | Chinstrap penguin (Pygoscelis antarctica) | Dream | 50.3 | 20.0 | 197.0 | 3300.0 | MALE |
| 19 | Adelie Penguin (Pygoscelis adeliae) | Dream | 37.2 | 18.1 | 178.0 | 3900.0 | MALE |
| 20 | Chinstrap penguin (Pygoscelis antarctica) | Dream | 51.0 | 18.8 | 203.0 | 4100.0 | MALE |
| 21 | Adelie Penguin (Pygoscelis adeliae) | Biscoe | 40.5 | 17.9 | 187.0 | 3200.0 | FEMALE |
| 22 | Gentoo penguin (Pygoscelis papua) | Biscoe | 45.5 | 13.9 | 210.0 | 4200.0 | FEMALE |
| 23 | Adelie Penguin (Pygoscelis adeliae) | Dream | 42.2 | 18.5 | 180.0 | 3550.0 | FEMALE |
| 24 | Chinstrap penguin (Pygoscelis antarctica) | Dream | 51.7 | 20.3 | 194.0 | 3775.0 | MALE |
25 rows × 7 columns
2. Data cleaning / prep#
# filter down to the data we want to analyze
adelie_data = df[df.species == "Adelie Penguin (Pygoscelis adeliae)"]
# drop the columns we don't care about
adelie_data = adelie_data.drop(columns=["species"])
# drop rows with nulls to get our training data
training_data = adelie_data.dropna()
# take a peek at the training data
training_data
Query job 81305962-a96a-4c86-949c-471b2ae7c86d is DONE. 28.9 kB processed. Open Job
Query job 2af0b0d6-c11b-499e-8d25-a2c628b2853b is DONE. 28.9 kB processed. Open Job
| island | culmen_length_mm | culmen_depth_mm | flipper_length_mm | body_mass_g | sex | |
|---|---|---|---|---|---|---|
| 0 | Biscoe | 40.1 | 18.9 | 188.0 | 4300.0 | MALE |
| 1 | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | MALE |
| 4 | Biscoe | 43.2 | 19.0 | 197.0 | 4775.0 | MALE |
| 6 | Biscoe | 41.3 | 21.1 | 195.0 | 4400.0 | MALE |
| 11 | Dream | 38.1 | 18.6 | 190.0 | 3700.0 | FEMALE |
| 13 | Biscoe | 37.8 | 20.0 | 190.0 | 4250.0 | MALE |
| 14 | Biscoe | 35.0 | 17.9 | 190.0 | 3450.0 | FEMALE |
| 16 | Torgersen | 34.6 | 21.1 | 198.0 | 4400.0 | MALE |
| 19 | Dream | 37.2 | 18.1 | 178.0 | 3900.0 | MALE |
| 21 | Biscoe | 40.5 | 17.9 | 187.0 | 3200.0 | FEMALE |
| 23 | Dream | 42.2 | 18.5 | 180.0 | 3550.0 | FEMALE |
| 30 | Dream | 39.2 | 21.1 | 196.0 | 4150.0 | MALE |
| 32 | Torgersen | 42.9 | 17.6 | 196.0 | 4700.0 | MALE |
| 38 | Dream | 41.1 | 17.5 | 190.0 | 3900.0 | MALE |
| 40 | Torgersen | 38.6 | 21.2 | 191.0 | 3800.0 | MALE |
| 42 | Biscoe | 35.5 | 16.2 | 195.0 | 3350.0 | FEMALE |
| 44 | Dream | 39.2 | 18.6 | 190.0 | 4250.0 | MALE |
| 45 | Torgersen | 35.2 | 15.9 | 186.0 | 3050.0 | FEMALE |
| 46 | Dream | 43.2 | 18.5 | 192.0 | 4100.0 | MALE |
| 49 | Biscoe | 39.6 | 17.7 | 186.0 | 3500.0 | FEMALE |
| 53 | Biscoe | 45.6 | 20.3 | 191.0 | 4600.0 | MALE |
| 58 | Torgersen | 40.9 | 16.8 | 191.0 | 3700.0 | FEMALE |
| 60 | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | FEMALE |
| 62 | Dream | 36.0 | 18.5 | 186.0 | 3100.0 | FEMALE |
| 63 | Torgersen | 39.3 | 20.6 | 190.0 | 3650.0 | MALE |
25 rows × 6 columns
3. Use model_selection.train_test_split to prepare training data#
from bigframes.ml.model_selection import train_test_split
feature_columns = training_data[['island', 'culmen_length_mm', 'culmen_depth_mm', 'flipper_length_mm', 'sex']]
label_columns = training_data[['body_mass_g']]
X_train, X_test, y_train, y_test = train_test_split(
feature_columns, label_columns, test_size=0.2)
Query job 0808457b-a0df-4a37-b7a5-8885f4a4588c is DONE. 28.9 kB processed. Open Job
4. Configure a linear regression pipeline with preprocessing#
from bigframes.ml.linear_model import LinearRegression
from bigframes.ml.pipeline import Pipeline
from bigframes.ml.compose import ColumnTransformer
from bigframes.ml.preprocessing import StandardScaler, OneHotEncoder
preprocessing = ColumnTransformer([
("onehot", OneHotEncoder(), ["island", "sex"]),
("scaler", StandardScaler(), ["culmen_depth_mm", "culmen_length_mm", "flipper_length_mm"]),
])
model = LinearRegression(fit_intercept=False)
pipeline = Pipeline([
('preproc', preprocessing),
('linreg', model)
])
# TODO(bmil): pretty printing for pipelines
pipeline
Pipeline(steps=[('preproc',
ColumnTransformer(transformers=[('onehot', OneHotEncoder(),
['island', 'species', 'sex']),
('scaler', StandardScaler(),
['culmen_depth_mm',
'culmen_length_mm',
'flipper_length_mm'])])),
('linreg', LinearRegression(fit_intercept=False))])
5. Fit the pipeline to the training data#
This will create a temporary BQML model in BigQuery
pipeline.fit(X_train, y_train)
Query job e9bfa6a5-a53f-4d8b-ae8c-cc8cd55d0947 is DONE. 28.9 kB processed. Open Job
Query job d8d553cf-3d36-49aa-b18b-9a05576a1fb0 is DONE. 28.9 kB processed. Open Job
Query job 75ef0083-9a4f-4ffb-a6c6-d82974a1659f is DONE. 0 Bytes processed. Open Job
Pipeline(steps=[('preproc',
ColumnTransformer(transformers=[('onehot', OneHotEncoder(),
['island', 'species', 'sex']),
('scaler', StandardScaler(),
['culmen_depth_mm',
'culmen_length_mm',
'flipper_length_mm'])])),
('linreg', LinearRegression(fit_intercept=False))])
6. Score the pipeline on the test data with metrics.r2_score#
from bigframes.ml.metrics import r2_score
y_pred = pipeline.predict(X_test)["predicted_body_mass_g"]
r2_score(y_test, y_pred)
Query job 55c5a9ce-8159-4a1a-99a4-af3a906640ba is DONE. 29.3 kB processed. Open Job
Query job 3e41c470-de70-4f13-89d9-c5564d0b2836 is DONE. 232 Bytes processed. Open Job
Query job ed2f9042-a737-4d13-bd21-8c3d29cd61a2 is DONE. 28.9 kB processed. Open Job
Query job 815d16b5-0a5d-42be-a766-1cff5b8f22f2 is DONE. 28.9 kB processed. Open Job
Query job 37a38dc6-5073-4544-a1e3-da145a843922 is DONE. 29.4 kB processed. Open Job
0.2655729213572775
5. Inference the model on new data#
import pandas
new_penguins = bigframes.pandas.read_pandas(
pandas.DataFrame(
{
"tag_number": [1633, 1672, 1690],
"species": [
"Adelie Penguin (Pygoscelis adeliae)",
"Adelie Penguin (Pygoscelis adeliae)",
"Adelie Penguin (Pygoscelis adeliae)",
],
"island": ["Torgersen", "Torgersen", "Dream"],
"culmen_length_mm": [39.5, 38.5, 37.9],
"culmen_depth_mm": [18.8, 17.2, 18.1],
"flipper_length_mm": [196.0, 181.0, 188.0],
"sex": ["MALE", "FEMALE", "FEMALE"],
}
).set_index("tag_number")
)
Load job 7b46750c-70b4-468d-87ba-9f84f579f2a6 is DONE. Open Job
pipeline.predict(new_penguins)
Query job d10dd37d-5e8e-4e15-9c83-a7e9a4c592a8 is DONE. 593 Bytes processed. Open Job
Query job 207cb787-cf8a-43ea-8e73-644d3f58b11a is DONE. 24 Bytes processed. Open Job
Query job c5dc5075-cac0-4947-9e9f-06aa9cc5bd2a is DONE. 0 Bytes processed. Open Job
Query job 2ca4a569-7186-48ed-b3e4-004dca704798 is DONE. 282 Bytes processed. Open Job
| predicted_body_mass_g | species | island | culmen_length_mm | culmen_depth_mm | flipper_length_mm | sex | |
|---|---|---|---|---|---|---|---|
| tag_number | |||||||
| 1633 | 4017.203152 | Adelie Penguin (Pygoscelis adeliae) | Torgersen | 39.5 | 18.8 | 196.0 | MALE |
| 1672 | 3127.601519 | Adelie Penguin (Pygoscelis adeliae) | Torgersen | 38.5 | 17.2 | 181.0 | FEMALE |
| 1690 | 3386.101231 | Adelie Penguin (Pygoscelis adeliae) | Dream | 37.9 | 18.1 | 188.0 | FEMALE |
3 rows × 7 columns
6. Save in BigQuery#
pipeline.to_gbq("bigframes-dev.bigframes_demo_us.penguin_model", replace=True)
Copy job d1def4a4-1da1-43a9-8ae5-4459444d993d is DONE. Open Job
Pipeline(steps=[('transform',
ColumnTransformer(transformers=[('ont_hot_encoder',
OneHotEncoder(max_categories=1000001,
min_frequency=0),
'island'),
('standard_scaler',
StandardScaler(),
'culmen_length_mm'),
('standard_scaler',
StandardScaler(),
'culmen_depth_mm'),
('standard_scaler',
StandardScaler(),
'flipper_length_mm'),
('ont_hot_encoder',
OneHotEncoder(max_categories=1000001,
min_frequency=0),
'sex')])),
('estimator',
LinearRegression(fit_intercept=False,
optimize_strategy='NORMAL_EQUATION'))])