{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": { "id": "ur8xi4C7S06n" }, "outputs": [], "source": [ "# Copyright 2023 Google LLC\n", "#\n", "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License." ] }, { "cell_type": "markdown", "metadata": { "id": "JAPoU8Sm5E6e" }, "source": [ "# Train a linear regression model with BigQuery DataFrames ML", "\n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "
\n", " \n", " \"Colab Run in Colab\n", " \n", " \n", " \n", " \"GitHub\n", " View on GitHub\n", " \n", " \n", " \n", " \"Vertex\n", " Open in Vertex AI Workbench\n", " \n", " \n", " \n", " \"BQ\n", " Open in BQ Studio\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": { "id": "24743cf4a1e1" }, "source": [ "**_NOTE_**: This notebook has been tested in the following environment:\n", "\n", "* Python version = 3.10" ] }, { "cell_type": "markdown", "metadata": { "id": "tvgnzT1CKxrO" }, "source": [ "## Overview\n", "\n", "Use this notebook to learn how to train a linear regression model using BigQuery ML and the `bigframes.bigquery` module.\n", "\n", "This example is adapted from the [BQML linear regression tutorial](https://cloud.google.com/bigquery-ml/docs/linear-regression-tutorial).\n", "\n", "Learn more about [BigQuery DataFrames](https://dataframes.bigquery.dev/)." ] }, { "cell_type": "markdown", "metadata": { "id": "d975e698c9a4" }, "source": [ "### Objective\n", "\n", "In this tutorial, you use BigQuery DataFrames to create a linear regression model that predicts the weight of an Adelie penguin based on the penguin's island of residence, culmen length and depth, flipper length, and sex.\n", "\n", "The steps include:\n", "\n", "- Creating a DataFrame from a BigQuery table.\n", "- Cleaning and preparing data using pandas.\n", "- Creating a linear regression model using `bigframes.ml`.\n", "- Saving the ML model to BigQuery for future use." ] }, { "cell_type": "markdown", "metadata": { "id": "08d289fa873f" }, "source": [ "### Dataset\n", "\n", "This tutorial uses the [```penguins``` table](https://console.cloud.google.com/bigquery?p=bigquery-public-data&d=ml_datasets&t=penguins) (a BigQuery Public Dataset) which includes data on a set of penguins including species, island of residence, weight, culmen length and depth, flipper length, and sex." ] }, { "cell_type": "markdown", "metadata": { "id": "aed92deeb4a0" }, "source": [ "### Costs\n", "\n", "This tutorial uses billable components of Google Cloud:\n", "\n", "* BigQuery (compute)\n", "* BigQuery ML\n", "\n", "Learn about [BigQuery compute pricing](https://cloud.google.com/bigquery/pricing#analysis_pricing_models)\n", "and [BigQuery ML pricing](https://cloud.google.com/bigquery/pricing#bqml),\n", "and use the [Pricing Calculator](https://cloud.google.com/products/calculator/)\n", "to generate a cost estimate based on your projected usage." ] }, { "cell_type": "markdown", "metadata": { "id": "i7EUnXsZhAGF" }, "source": [ "## Installation\n", "\n", "If you don't have [bigframes](https://pypi.org/project/bigframes/) package already installed, uncomment and execute the following cells to\n", "\n", "1. Install the package\n", "1. Restart the notebook kernel (Jupyter or Colab) to work with the package" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "id": "9O0Ka4W2MNF3" }, "outputs": [], "source": [ "# !pip install bigframes" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "id": "f200f10a1da3" }, "outputs": [], "source": [ "# Automatically restart kernel after installs so that your environment can access the new packages\n", "# import IPython\n", "\n", "# app = IPython.Application.instance()\n", "# app.kernel.do_shutdown(True)" ] }, { "cell_type": "markdown", "metadata": { "id": "BF1j6f9HApxa" }, "source": [ "## Before you begin\n", "\n", "Complete the tasks in this section to set up your environment." ] }, { "cell_type": "markdown", "metadata": { "id": "oDfTjfACBvJk" }, "source": [ "### Set up your Google Cloud project\n", "\n", "**The following steps are required, regardless of your notebook environment.**\n", "\n", "1. [Select or create a Google Cloud project](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get a $300 credit towards your compute/storage costs.\n", "\n", "2. [Make sure that billing is enabled for your project](https://cloud.google.com/billing/docs/how-to/modify-project).\n", "\n", "3. [Enable the BigQuery API](https://console.cloud.google.com/flows/enableapi?apiid=bigquery.googleapis.com).\n", "\n", "4. If you are running this notebook locally, install the [Cloud SDK](https://cloud.google.com/sdk)." ] }, { "cell_type": "markdown", "metadata": { "id": "WReHDGG5g0XY" }, "source": [ "#### Set your project ID\n", "\n", "If you don't know your project ID, try the following:\n", "* Run `gcloud config list`.\n", "* Run `gcloud projects list`.\n", "* See the support page: [Locate the project ID](https://support.google.com/googleapi/answer/7014113)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "oM1iC_MfAts1" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Updated property [core/project].\n" ] } ], "source": [ "PROJECT_ID = \"\" # @param {type:\"string\"}\n", "\n", "# Set the project id\n", "! gcloud config set project {PROJECT_ID}" ] }, { "cell_type": "markdown", "metadata": { "id": "region" }, "source": [ "#### Set the region\n", "\n", "You can also change the `REGION` variable used by BigQuery. Learn more about [BigQuery regions](https://cloud.google.com/bigquery/docs/locations#supported_locations)." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "id": "eF-Twtc4XGem" }, "outputs": [], "source": [ "REGION = \"US\" # @param {type: \"string\"}" ] }, { "cell_type": "markdown", "metadata": { "id": "sBCra4QMA2wR" }, "source": [ "### Authenticate your Google Cloud account\n", "\n", "Depending on your Jupyter environment, you might have to manually authenticate. Follow the relevant instructions below." ] }, { "cell_type": "markdown", "metadata": { "id": "74ccc9e52986" }, "source": [ "**Vertex AI Workbench**\n", "\n", "Do nothing, you are already authenticated." ] }, { "cell_type": "markdown", "metadata": { "id": "de775a3773ba" }, "source": [ "**Local JupyterLab instance**\n", "\n", "Uncomment and run the following cell:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "id": "254614fa0c46" }, "outputs": [], "source": [ "# ! gcloud auth login" ] }, { "cell_type": "markdown", "metadata": { "id": "ef21552ccea8" }, "source": [ "**Colab**\n", "\n", "Uncomment and run the following cell:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "id": "603adbbf0532" }, "outputs": [], "source": [ "# from google.colab import auth\n", "# auth.authenticate_user()" ] }, { "cell_type": "markdown", "metadata": { "id": "960505627ddf" }, "source": [ "### Import libraries" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "id": "PyQmSRbKA8r-" }, "outputs": [], "source": [ "import bigframes.pandas as bpd" ] }, { "cell_type": "markdown", "metadata": { "id": "init_aip:mbsdk,all" }, "source": [ "### Set BigQuery DataFrames options" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "id": "NPPMuw2PXGeo" }, "outputs": [], "source": [ "# Note: The project option is not required in all environments.\n", "# On BigQuery Studio, the project ID is automatically detected.\n", "bpd.options.bigquery.project = PROJECT_ID\n", "\n", "# Note: The location option is not required.\n", "# It defaults to the location of the first table or query\n", "# passed to read_gbq(). For APIs where a location can't be\n", "# auto-detected, the location defaults to the \"US\" location.\n", "bpd.options.bigquery.location = REGION\n", "\n", "# Recommended for performance. Disables pandas default ordering of all rows.\n", "bpd.options.bigquery.ordering_mode = \"partial\"" ] }, { "cell_type": "markdown", "metadata": { "id": "D21CoOlfFTYI" }, "source": [ "If you want to reset the location of the created DataFrame or Series objects, reset the session by executing `bpd.close_session()`. After that, you can reuse `bpd.options.bigquery.location` to specify another location." ] }, { "cell_type": "markdown", "metadata": { "id": "9EMAqR37AfLS" }, "source": [ "## Read a BigQuery table into a BigQuery DataFrames DataFrame\n", "\n", "Read the [```penguins``` table](https://console.cloud.google.com/bigquery?p=bigquery-public-data&d=ml_datasets&t=penguins) into a BigQuery DataFrames DataFrame:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "id": "EDAaIwHpQCDZ" }, "outputs": [], "source": [ "df = bpd.read_gbq(\"bigquery-public-data.ml_datasets.penguins\")" ] }, { "cell_type": "markdown", "metadata": { "id": "DJu837YEXD7B" }, "source": [ "Take a look at the DataFrame:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "id": "_gPD0Zn1Stdb" }, "outputs": [ { "data": { "text/html": [ "✅ Completed. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.microsoft.datawrangler.viewer.v0+json": { "columns": [ { "name": "index", "rawType": "int64", "type": "integer" }, { "name": "species", "rawType": "string", "type": "string" }, { "name": "island", "rawType": "string", "type": "string" }, { "name": "culmen_length_mm", "rawType": "Float64", "type": "float" }, { "name": "culmen_depth_mm", "rawType": "Float64", "type": "float" }, { "name": "flipper_length_mm", "rawType": "Float64", "type": "float" }, { "name": "body_mass_g", "rawType": "Float64", "type": "float" }, { "name": "sex", "rawType": "string", "type": "string" } ], "ref": "a652ba52-0445-4228-a2d5-baf837933515", "rows": [ [ "0", "Adelie Penguin (Pygoscelis adeliae)", "Dream", "36.6", "18.4", "184.0", "3475.0", "FEMALE" ], [ "1", "Adelie Penguin (Pygoscelis adeliae)", "Dream", "39.8", "19.1", "184.0", "4650.0", "MALE" ], [ "2", "Adelie Penguin (Pygoscelis adeliae)", "Dream", "40.9", "18.9", "184.0", "3900.0", "MALE" ], [ "3", "Chinstrap penguin (Pygoscelis antarctica)", "Dream", "46.5", "17.9", "192.0", "3500.0", "FEMALE" ], [ "4", "Adelie Penguin (Pygoscelis adeliae)", "Dream", "37.3", "16.8", "192.0", "3000.0", "FEMALE" ] ], "shape": { "columns": 7, "rows": 5 } }, "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
speciesislandculmen_length_mmculmen_depth_mmflipper_length_mmbody_mass_gsex
0Adelie Penguin (Pygoscelis adeliae)Dream36.618.4184.03475.0FEMALE
1Adelie Penguin (Pygoscelis adeliae)Dream39.819.1184.04650.0MALE
2Adelie Penguin (Pygoscelis adeliae)Dream40.918.9184.03900.0MALE
3Chinstrap penguin (Pygoscelis antarctica)Dream46.517.9192.03500.0FEMALE
4Adelie Penguin (Pygoscelis adeliae)Dream37.316.8192.03000.0FEMALE
\n", "
" ], "text/plain": [ " species island culmen_length_mm \\\n", "0 Adelie Penguin (Pygoscelis adeliae) Dream 36.6 \n", "1 Adelie Penguin (Pygoscelis adeliae) Dream 39.8 \n", "2 Adelie Penguin (Pygoscelis adeliae) Dream 40.9 \n", "3 Chinstrap penguin (Pygoscelis antarctica) Dream 46.5 \n", "4 Adelie Penguin (Pygoscelis adeliae) Dream 37.3 \n", "\n", " culmen_depth_mm flipper_length_mm body_mass_g sex \n", "0 18.4 184.0 3475.0 FEMALE \n", "1 19.1 184.0 4650.0 MALE \n", "2 18.9 184.0 3900.0 MALE \n", "3 17.9 192.0 3500.0 FEMALE \n", "4 16.8 192.0 3000.0 FEMALE " ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.peek()" ] }, { "cell_type": "markdown", "metadata": { "id": "rwPLjqW2Ajzh" }, "source": [ "## Clean and prepare data\n", "\n", "You can use pandas as you normally would on the BigQuery DataFrames DataFrame, but calculations happen in the BigQuery query engine instead of your local environment.\n", "\n", "Because this model will focus on the Adelie Penguin species, you need to filter the data for only those rows representing Adelie penguins. Then you drop the `species` column because it is no longer needed.\n", "\n", "As these functions are applied, only the new DataFrame object `adelie_data` is modified. The source table and the original DataFrame object `df` don't change." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "id": "6i6HkFJZa8na" }, "outputs": [ { "data": { "text/html": [ "✅ Completed. \n", " Query processed 28.9 kB in 12 seconds of slot time. [Job bigframes-dev:US.bb256e8c-f2c7-4eff-b5f3-fcc6836110cf details]\n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "✅ Completed. \n", " Query processed 8.4 kB in a moment of slot time.\n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "✅ Completed. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
islandculmen_length_mmculmen_depth_mmflipper_length_mmbody_mass_gsex
0Dream36.618.4184.03475.0FEMALE
1Dream39.819.1184.04650.0MALE
2Dream40.918.9184.03900.0MALE
3Dream37.316.8192.03000.0FEMALE
4Dream43.218.5192.04100.0MALE
5Dream40.220.1200.03975.0MALE
6Dream40.818.9208.04300.0MALE
7Dream39.018.7185.03650.0MALE
8Dream37.016.9185.03000.0FEMALE
9Dream34.017.1185.03400.0FEMALE
\n", "

10 rows × 6 columns

\n", "
[152 rows x 6 columns in total]" ], "text/plain": [ "island culmen_length_mm culmen_depth_mm flipper_length_mm body_mass_g \\\n", " Dream 36.6 18.4 184.0 3475.0 \n", " Dream 39.8 19.1 184.0 4650.0 \n", " Dream 40.9 18.9 184.0 3900.0 \n", " Dream 37.3 16.8 192.0 3000.0 \n", " Dream 43.2 18.5 192.0 4100.0 \n", " Dream 40.2 20.1 200.0 3975.0 \n", " Dream 40.8 18.9 208.0 4300.0 \n", " Dream 39.0 18.7 185.0 3650.0 \n", " Dream 37.0 16.9 185.0 3000.0 \n", " Dream 34.0 17.1 185.0 3400.0 \n", "\n", " sex \n", "FEMALE \n", " MALE \n", " MALE \n", "FEMALE \n", " MALE \n", " MALE \n", " MALE \n", " MALE \n", "FEMALE \n", "FEMALE \n", "...\n", "\n", "[152 rows x 6 columns]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Filter down to the data to the Adelie Penguin species\n", "adelie_data = df[df.species == \"Adelie Penguin (Pygoscelis adeliae)\"]\n", "\n", "# Drop the species column\n", "adelie_data = adelie_data.drop(columns=[\"species\"])\n", "\n", "# Take a look at the filtered DataFrame\n", "adelie_data" ] }, { "cell_type": "markdown", "metadata": { "id": "jhK2OlyMbY4L" }, "source": [ "Drop rows with `NULL` values in order to create a BigQuery DataFrames DataFrame for the training data:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "id": "0am3hdlXZfxZ" }, "outputs": [ { "data": { "text/html": [ "Starting." ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "✅ Completed. \n", " Query processed 8.1 kB in a moment of slot time.\n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "✅ Completed. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
islandculmen_length_mmculmen_depth_mmflipper_length_mmbody_mass_gsex
0Dream36.618.4184.03475.0FEMALE
1Dream39.819.1184.04650.0MALE
2Dream40.918.9184.03900.0MALE
3Dream37.316.8192.03000.0FEMALE
4Dream43.218.5192.04100.0MALE
5Dream40.220.1200.03975.0MALE
6Dream40.818.9208.04300.0MALE
7Dream39.018.7185.03650.0MALE
8Dream37.016.9185.03000.0FEMALE
9Dream34.017.1185.03400.0FEMALE
\n", "

10 rows × 6 columns

\n", "
[146 rows x 6 columns in total]" ], "text/plain": [ "island culmen_length_mm culmen_depth_mm flipper_length_mm body_mass_g \\\n", " Dream 36.6 18.4 184.0 3475.0 \n", " Dream 39.8 19.1 184.0 4650.0 \n", " Dream 40.9 18.9 184.0 3900.0 \n", " Dream 37.3 16.8 192.0 3000.0 \n", " Dream 43.2 18.5 192.0 4100.0 \n", " Dream 40.2 20.1 200.0 3975.0 \n", " Dream 40.8 18.9 208.0 4300.0 \n", " Dream 39.0 18.7 185.0 3650.0 \n", " Dream 37.0 16.9 185.0 3000.0 \n", " Dream 34.0 17.1 185.0 3400.0 \n", "\n", " sex \n", "FEMALE \n", " MALE \n", " MALE \n", "FEMALE \n", " MALE \n", " MALE \n", " MALE \n", " MALE \n", "FEMALE \n", "FEMALE \n", "...\n", "\n", "[146 rows x 6 columns]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Drop rows with nulls to get training data\n", "training_data = adelie_data.dropna()\n", "\n", "# Take a peek at the training data\n", "training_data" ] }, { "cell_type": "markdown", "metadata": { "id": "Fx4lsNqMorJ-" }, "source": [ "## Create the linear regression model\n", "\n", "In this notebook, you create a linear regression model, a type of regression model that generates a continuous value from a linear combination of input features." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create a BigQuery dataset to house the model, adding a name for your dataset as the `DATASET_ID` variable:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Dataset bqml_tutorial created.\n" ] } ], "source": [ "DATASET_ID = \"bqml_tutorial\" # @param {type:\"string\"}\n", "\n", "from google.cloud import bigquery\n", "client = bigquery.Client(project=PROJECT_ID)\n", "dataset = bigquery.Dataset(PROJECT_ID + \".\" + DATASET_ID)\n", "dataset.location = REGION\n", "dataset = client.create_dataset(dataset, exists_ok=True)\n", "print(f\"Dataset {dataset.dataset_id} created.\")" ] }, { "cell_type": "markdown", "metadata": { "id": "EloGtMnverFF" }, "source": [ "### Create the model using `bigframes.bigquery.ml.create_model`\n", "\n", "When you pass the feature columns without transforms, BigQuery ML uses\n", "[automatic preprocessing](https://cloud.google.com/bigquery/docs/auto-preprocessing) to encode string values and scale numeric values.\n", "\n", "BigQuery ML also [automatically splits the data for training and evaluation](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create-glm#data_split_method), although for datasets with less than 500 rows (such as this one), all rows are used for training." ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "id": "GskyyUQPowBT" }, "outputs": [ { "data": { "text/html": [ "\n", " Query started with request ID bigframes-dev:US.a33b3628-730b-46e8-ad17-c78bb48619ce.
SQL
CREATE OR REPLACE MODEL `bigframes-dev.bqml_tutorial.penguin_weight`\n",
       "OPTIONS(model_type = 'LINEAR_REG')\n",
       "AS SELECT\n",
       "`bfuid_col_3` AS `island`,\n",
       "`bfuid_col_4` AS `culmen_length_mm`,\n",
       "`bfuid_col_5` AS `culmen_depth_mm`,\n",
       "`bfuid_col_6` AS `flipper_length_mm`,\n",
       "`bfuid_col_7` AS `label`,\n",
       "`bfuid_col_8` AS `sex`\n",
       "FROM\n",
       "(SELECT\n",
       "  `t0`.`bfuid_col_3`,\n",
       "  `t0`.`bfuid_col_4`,\n",
       "  `t0`.`bfuid_col_5`,\n",
       "  `t0`.`bfuid_col_6`,\n",
       "  `t0`.`bfuid_col_7`,\n",
       "  `t0`.`bfuid_col_8`\n",
       "FROM `bigframes-dev._63cfa399614a54153cc386c27d6c0c6fdb249f9e._e154f0aa_5b29_492a_b464_a77c5f5a3dbd_bqdf_60fa3196-5a3e-45ae-898e-c2b473bfa1e9` AS `t0`)\n",
       "
\n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.microsoft.datawrangler.viewer.v0+json": { "columns": [ { "name": "index", "rawType": "object", "type": "string" }, { "name": "0", "rawType": "object", "type": "unknown" } ], "ref": "851c170c-08a5-4c06-8c0b-4547dbde3f18", "rows": [ [ "etag", "P3XS+g0ZZM19ywL+hdwUmQ==" ], [ "modelReference", "{'projectId': 'bigframes-dev', 'datasetId': 'bqml_tutorial', 'modelId': 'penguin_weight'}" ], [ "creationTime", "1764779445166" ], [ "lastModifiedTime", "1764779445237" ], [ "modelType", "LINEAR_REGRESSION" ], [ "trainingRuns", "[{'trainingOptions': {'lossType': 'MEAN_SQUARED_LOSS', 'l2Regularization': 0, 'inputLabelColumns': ['label'], 'dataSplitMethod': 'AUTO_SPLIT', 'optimizationStrategy': 'NORMAL_EQUATION', 'calculatePValues': False, 'enableGlobalExplain': False, 'categoryEncodingMethod': 'ONE_HOT_ENCODING', 'fitIntercept': True, 'standardizeFeatures': True}, 'trainingStartTime': '1764779429690', 'results': [{'index': 0, 'durationMs': '3104', 'trainingLoss': 78553.60163372214}], 'evaluationMetrics': {'regressionMetrics': {'meanAbsoluteError': 223.87876300779865, 'meanSquaredError': 78553.60163372215, 'meanSquaredLogError': 0.005614202871872688, 'medianAbsoluteError': 181.33091105963013, 'rSquared': 0.6239507555914934}}, 'startTime': '2025-12-03T16:30:29.690Z'}]" ], [ "featureColumns", "[{'name': 'island', 'type': {'typeKind': 'STRING'}}, {'name': 'culmen_length_mm', 'type': {'typeKind': 'FLOAT64'}}, {'name': 'culmen_depth_mm', 'type': {'typeKind': 'FLOAT64'}}, {'name': 'flipper_length_mm', 'type': {'typeKind': 'FLOAT64'}}, {'name': 'sex', 'type': {'typeKind': 'STRING'}}]" ], [ "labelColumns", "[{'name': 'predicted_label', 'type': {'typeKind': 'FLOAT64'}}]" ], [ "location", "US" ] ], "shape": { "columns": 1, "rows": 9 } }, "text/plain": [ "etag P3XS+g0ZZM19ywL+hdwUmQ==\n", "modelReference {'projectId': 'bigframes-dev', 'datasetId': 'b...\n", "creationTime 1764779445166\n", "lastModifiedTime 1764779445237\n", "modelType LINEAR_REGRESSION\n", "trainingRuns [{'trainingOptions': {'lossType': 'MEAN_SQUARE...\n", "featureColumns [{'name': 'island', 'type': {'typeKind': 'STRI...\n", "labelColumns [{'name': 'predicted_label', 'type': {'typeKin...\n", "location US\n", "dtype: object" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import bigframes.bigquery as bbq\n", "\n", "model_name = f\"{PROJECT_ID}.{DATASET_ID}.penguin_weight\"\n", "model_metadata = bbq.ml.create_model(\n", " model_name,\n", " replace=True,\n", " options={\n", " \"model_type\": \"LINEAR_REG\",\n", " },\n", " training_data=training_data.rename(columns={\"body_mass_g\": \"label\"})\n", ")\n", "model_metadata" ] }, { "cell_type": "markdown", "metadata": { "id": "GskyyUQPowBT" }, "source": [ "### Evaluate the model\n", "\n", "Check how the model performed by using the `evalutate` function. More information on model evaluation can be found [here](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-evaluate#mlevaluate_output)." ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "id": "kGBJKafpo0dl" }, "outputs": [ { "data": { "text/html": [ "✅ Completed. \n", " Query processed 0 Bytes in a moment of slot time.\n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "✅ Completed. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "✅ Completed. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
mean_absolute_errormean_squared_errormean_squared_log_errormedian_absolute_errorr2_scoreexplained_variance
0223.87876378553.6016340.005614181.3309110.6239510.623951
\n", "

1 rows × 6 columns

\n", "
[1 rows x 6 columns in total]" ], "text/plain": [ " mean_absolute_error mean_squared_error mean_squared_log_error \\\n", "0 223.878763 78553.601634 0.005614 \n", "\n", " median_absolute_error r2_score explained_variance \n", "0 181.330911 0.623951 0.623951 \n", "\n", "[1 rows x 6 columns]" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bbq.ml.evaluate(model_name)" ] }, { "cell_type": "markdown", "metadata": { "id": "P2lUiZZ_cjri" }, "source": [ "### Use the model to predict outcomes\n", "\n", "Now that you have evaluated your model, the next step is to use it to predict an\n", "outcome. You can run `bigframes.bigquery.ml.predict` function on the model to\n", "predict the body mass in grams of all penguins that reside on the Biscoe\n", "Islands." ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "id": "bsQ9cmoWo0Ps" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes/bigframes/core/log_adapter.py:182: TimeTravelCacheWarning: Reading cached table from 2025-12-03 16:30:18.272882+00:00 to avoid\n", "incompatibilies with previous reads of this table. To read the latest\n", "version, set `use_cache=False` or close the current session with\n", "Session.close() or bigframes.pandas.close_session().\n", " return method(*args, **kwargs)\n" ] }, { "data": { "text/html": [ "✅ Completed. \n", " Query processed 29.3 kB in a moment of slot time.\n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "✅ Completed. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "✅ Completed. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
predicted_labelspeciesislandculmen_length_mmculmen_depth_mmflipper_length_mmbody_mass_gsex
03945.010052Gentoo penguin (Pygoscelis papua)Biscoe<NA><NA><NA><NA><NA>
13914.916297Adelie Penguin (Pygoscelis adeliae)Biscoe39.718.9184.03550.0MALE
23278.611224Adelie Penguin (Pygoscelis adeliae)Biscoe36.417.1184.02850.0FEMALE
34006.367355Adelie Penguin (Pygoscelis adeliae)Biscoe41.618.0192.03950.0MALE
43417.610478Adelie Penguin (Pygoscelis adeliae)Biscoe35.017.9192.03725.0FEMALE
54009.612421Adelie Penguin (Pygoscelis adeliae)Biscoe41.118.2192.04050.0MALE
64231.330911Adelie Penguin (Pygoscelis adeliae)Biscoe42.019.5200.04050.0MALE
73554.308906Gentoo penguin (Pygoscelis papua)Biscoe43.813.9208.04300.0FEMALE
83550.677455Gentoo penguin (Pygoscelis papua)Biscoe43.314.0208.04575.0FEMALE
93537.882543Gentoo penguin (Pygoscelis papua)Biscoe44.013.6208.04350.0FEMALE
\n", "

10 rows × 8 columns

\n", "
[168 rows x 8 columns in total]" ], "text/plain": [ " predicted_label species island \\\n", "0 3945.010052 Gentoo penguin (Pygoscelis papua) Biscoe \n", "1 3914.916297 Adelie Penguin (Pygoscelis adeliae) Biscoe \n", "2 3278.611224 Adelie Penguin (Pygoscelis adeliae) Biscoe \n", "3 4006.367355 Adelie Penguin (Pygoscelis adeliae) Biscoe \n", "4 3417.610478 Adelie Penguin (Pygoscelis adeliae) Biscoe \n", "5 4009.612421 Adelie Penguin (Pygoscelis adeliae) Biscoe \n", "6 4231.330911 Adelie Penguin (Pygoscelis adeliae) Biscoe \n", "7 3554.308906 Gentoo penguin (Pygoscelis papua) Biscoe \n", "8 3550.677455 Gentoo penguin (Pygoscelis papua) Biscoe \n", "9 3537.882543 Gentoo penguin (Pygoscelis papua) Biscoe \n", "\n", " culmen_length_mm culmen_depth_mm flipper_length_mm body_mass_g sex \n", "0 \n", "1 39.7 18.9 184.0 3550.0 MALE \n", "2 36.4 17.1 184.0 2850.0 FEMALE \n", "3 41.6 18.0 192.0 3950.0 MALE \n", "4 35.0 17.9 192.0 3725.0 FEMALE \n", "5 41.1 18.2 192.0 4050.0 MALE \n", "6 42.0 19.5 200.0 4050.0 MALE \n", "7 43.8 13.9 208.0 4300.0 FEMALE \n", "8 43.3 14.0 208.0 4575.0 FEMALE \n", "9 44.0 13.6 208.0 4350.0 FEMALE \n", "...\n", "\n", "[168 rows x 8 columns]" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = bpd.read_gbq(\"bigquery-public-data.ml_datasets.penguins\")\n", "biscoe = df[df[\"island\"].str.contains(\"Biscoe\")]\n", "bbq.ml.predict(model_name, biscoe)" ] }, { "cell_type": "markdown", "metadata": { "id": "GTRdUw-Ro5R1" }, "source": [ "### Explain the prediction results\n", "\n", "To understand why the model is generating these prediction results, you can use the `explain_predict` function." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " Query started with request ID bigframes-dev:US.161bba69-c852-4916-a2df-bb5b309be6e4.
SQL
SELECT * FROM ML.EXPLAIN_PREDICT(MODEL `bigframes-dev.bqml_tutorial.penguin_weight`, (SELECT\n",
       "`bfuid_col_22` AS `species`,\n",
       "`bfuid_col_23` AS `island`,\n",
       "`bfuid_col_24` AS `culmen_length_mm`,\n",
       "`bfuid_col_25` AS `culmen_depth_mm`,\n",
       "`bfuid_col_26` AS `flipper_length_mm`,\n",
       "`bfuid_col_27` AS `body_mass_g`,\n",
       "`bfuid_col_28` AS `sex`\n",
       "FROM\n",
       "(SELECT\n",
       "  `t0`.`species`,\n",
       "  `t0`.`island`,\n",
       "  `t0`.`culmen_length_mm`,\n",
       "  `t0`.`culmen_depth_mm`,\n",
       "  `t0`.`flipper_length_mm`,\n",
       "  `t0`.`body_mass_g`,\n",
       "  `t0`.`sex`,\n",
       "  `t0`.`species` AS `bfuid_col_22`,\n",
       "  `t0`.`island` AS `bfuid_col_23`,\n",
       "  `t0`.`culmen_length_mm` AS `bfuid_col_24`,\n",
       "  `t0`.`culmen_depth_mm` AS `bfuid_col_25`,\n",
       "  `t0`.`flipper_length_mm` AS `bfuid_col_26`,\n",
       "  `t0`.`body_mass_g` AS `bfuid_col_27`,\n",
       "  `t0`.`sex` AS `bfuid_col_28`,\n",
       "  regexp_contains(`t0`.`island`, 'Biscoe') AS `bfuid_col_29`\n",
       "FROM (\n",
       "  SELECT\n",
       "    `species`,\n",
       "    `island`,\n",
       "    `culmen_length_mm`,\n",
       "    `culmen_depth_mm`,\n",
       "    `flipper_length_mm`,\n",
       "    `body_mass_g`,\n",
       "    `sex`\n",
       "  FROM `bigquery-public-data.ml_datasets.penguins` FOR SYSTEM_TIME AS OF TIMESTAMP('2025-12-03T16:30:18.272882+00:00')\n",
       ") AS `t0`\n",
       "WHERE\n",
       "  regexp_contains(`t0`.`island`, 'Biscoe'))), STRUCT(3 AS top_k_features))\n",
       "
\n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "✅ Completed. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "✅ Completed. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
predicted_labeltop_feature_attributionsbaseline_prediction_valueprediction_valueapproximation_errorspeciesislandculmen_length_mmculmen_depth_mmflipper_length_mmbody_mass_gsex
03945.010052[{'feature': 'island', 'attribution': 0.0}\n", " {'...3945.0100523945.0100520.0Gentoo penguin (Pygoscelis papua)Biscoe<NA><NA><NA><NA><NA>
13914.916297[{'feature': 'flipper_length_mm', 'attribution...3945.0100523914.9162970.0Adelie Penguin (Pygoscelis adeliae)Biscoe39.718.9184.03550.0MALE
23278.611224[{'feature': 'sex', 'attribution': -443.175184...3945.0100523278.6112240.0Adelie Penguin (Pygoscelis adeliae)Biscoe36.417.1184.02850.0FEMALE
34006.367355[{'feature': 'culmen_length_mm', 'attribution'...3945.0100524006.3673550.0Adelie Penguin (Pygoscelis adeliae)Biscoe41.618.0192.03950.0MALE
43417.610478[{'feature': 'sex', 'attribution': -443.175184...3945.0100523417.6104780.0Adelie Penguin (Pygoscelis adeliae)Biscoe35.017.9192.03725.0FEMALE
54009.612421[{'feature': 'culmen_length_mm', 'attribution'...3945.0100524009.6124210.0Adelie Penguin (Pygoscelis adeliae)Biscoe41.118.2192.04050.0MALE
64231.330911[{'feature': 'flipper_length_mm', 'attribution...3945.0100524231.3309110.0Adelie Penguin (Pygoscelis adeliae)Biscoe42.019.5200.04050.0MALE
73554.308906[{'feature': 'sex', 'attribution': -443.175184...3945.0100523554.3089060.0Gentoo penguin (Pygoscelis papua)Biscoe43.813.9208.04300.0FEMALE
83550.677455[{'feature': 'sex', 'attribution': -443.175184...3945.0100523550.6774550.0Gentoo penguin (Pygoscelis papua)Biscoe43.314.0208.04575.0FEMALE
93537.882543[{'feature': 'sex', 'attribution': -443.175184...3945.0100523537.8825430.0Gentoo penguin (Pygoscelis papua)Biscoe44.013.6208.04350.0FEMALE
\n", "

10 rows × 12 columns

\n", "
[168 rows x 12 columns in total]" ], "text/plain": [ " predicted_label top_feature_attributions \\\n", "0 3945.010052 [{'feature': 'island', 'attribution': 0.0}\n", " {'... \n", "1 3914.916297 [{'feature': 'flipper_length_mm', 'attribution... \n", "2 3278.611224 [{'feature': 'sex', 'attribution': -443.175184... \n", "3 4006.367355 [{'feature': 'culmen_length_mm', 'attribution'... \n", "4 3417.610478 [{'feature': 'sex', 'attribution': -443.175184... \n", "5 4009.612421 [{'feature': 'culmen_length_mm', 'attribution'... \n", "6 4231.330911 [{'feature': 'flipper_length_mm', 'attribution... \n", "7 3554.308906 [{'feature': 'sex', 'attribution': -443.175184... \n", "8 3550.677455 [{'feature': 'sex', 'attribution': -443.175184... \n", "9 3537.882543 [{'feature': 'sex', 'attribution': -443.175184... \n", "\n", " baseline_prediction_value prediction_value approximation_error \\\n", "0 3945.010052 3945.010052 0.0 \n", "1 3945.010052 3914.916297 0.0 \n", "2 3945.010052 3278.611224 0.0 \n", "3 3945.010052 4006.367355 0.0 \n", "4 3945.010052 3417.610478 0.0 \n", "5 3945.010052 4009.612421 0.0 \n", "6 3945.010052 4231.330911 0.0 \n", "7 3945.010052 3554.308906 0.0 \n", "8 3945.010052 3550.677455 0.0 \n", "9 3945.010052 3537.882543 0.0 \n", "\n", " species island culmen_length_mm \\\n", "0 Gentoo penguin (Pygoscelis papua) Biscoe \n", "1 Adelie Penguin (Pygoscelis adeliae) Biscoe 39.7 \n", "2 Adelie Penguin (Pygoscelis adeliae) Biscoe 36.4 \n", "3 Adelie Penguin (Pygoscelis adeliae) Biscoe 41.6 \n", "4 Adelie Penguin (Pygoscelis adeliae) Biscoe 35.0 \n", "5 Adelie Penguin (Pygoscelis adeliae) Biscoe 41.1 \n", "6 Adelie Penguin (Pygoscelis adeliae) Biscoe 42.0 \n", "7 Gentoo penguin (Pygoscelis papua) Biscoe 43.8 \n", "8 Gentoo penguin (Pygoscelis papua) Biscoe 43.3 \n", "9 Gentoo penguin (Pygoscelis papua) Biscoe 44.0 \n", "\n", " culmen_depth_mm flipper_length_mm body_mass_g sex \n", "0 \n", "1 18.9 184.0 3550.0 MALE \n", "2 17.1 184.0 2850.0 FEMALE \n", "3 18.0 192.0 3950.0 MALE \n", "4 17.9 192.0 3725.0 FEMALE \n", "5 18.2 192.0 4050.0 MALE \n", "6 19.5 200.0 4050.0 MALE \n", "7 13.9 208.0 4300.0 FEMALE \n", "8 14.0 208.0 4575.0 FEMALE \n", "9 13.6 208.0 4350.0 FEMALE \n", "...\n", "\n", "[168 rows x 12 columns]" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bbq.ml.explain_predict(model_name, biscoe, top_k_features=3)" ] }, { "cell_type": "markdown", "metadata": { "id": "K0mPaoGpcwwy" }, "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Globally explain the model\n", "\n", "To know which features are generally the most important to determine penguin\n", "weight, you can use the `global_explain` function. In order to use\n", "`global_explain`, you must retrain the model with the `enable_global_explain`\n", "option set to `True`." ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "id": "ZSP7gt13QrQt" }, "outputs": [ { "data": { "text/html": [ "✅ Completed. \n", " Query processed 6.9 kB in 53 seconds of slot time. [Job bigframes-dev:US.job_welN8ErlZ_sTG7oOEULsWUgmIg7l details]\n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "model_name = f\"{PROJECT_ID}.{DATASET_ID}.penguin_weight_with_global_explain\"\n", "model_metadata = bbq.ml.create_model(\n", " model_name,\n", " replace=True,\n", " options={\n", " \"model_type\": \"LINEAR_REG\",\n", " \"input_label_cols\": [\"body_mass_g\"],\n", " \"enable_global_explain\": True,\n", " },\n", " training_data=training_data,\n", ")" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "✅ Completed. \n", " Query processed 0 Bytes in a moment of slot time.\n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "✅ Completed. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "✅ Completed. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
featureattribution
0sex221.587592
1flipper_length_mm71.311846
2culmen_depth_mm66.17986
3culmen_length_mm45.443363
4island17.258076
\n", "

5 rows × 2 columns

\n", "
[5 rows x 2 columns in total]" ], "text/plain": [ " feature attribution\n", "0 sex 221.587592\n", "1 flipper_length_mm 71.311846\n", "2 culmen_depth_mm 66.17986\n", "3 culmen_length_mm 45.443363\n", "4 island 17.258076\n", "\n", "[5 rows x 2 columns]" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bbq.ml.global_explain(model_name)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Compatibility with pandas\n", "\n", "The functions in `bigframes.bigquery.ml` can accept pandas DataFrames as well. Use the `to_pandas()` method on the results of methods like `predict()` to get a pandas DataFrame back." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " Query started with request ID bigframes-dev:US.18d9027b-7d55-42c9-ad1b-dabccdda80dc.
SQL
SELECT * FROM ML.PREDICT(MODEL `bigframes-dev.bqml_tutorial.penguin_weight_with_global_explain`, (SELECT\n",
       "`column_0` AS `sex`,\n",
       "`column_1` AS `flipper_length_mm`,\n",
       "`column_2` AS `culmen_depth_mm`,\n",
       "`column_3` AS `culmen_length_mm`,\n",
       "`column_4` AS `island`\n",
       "FROM\n",
       "(SELECT\n",
       "  *\n",
       "FROM (\n",
       "  SELECT\n",
       "    *\n",
       "  FROM UNNEST(ARRAY<STRUCT<`column_0` STRING, `column_1` INT64, `column_2` INT64, `column_3` INT64, `column_4` STRING>>[STRUCT('MALE', 180, 15, 40, 'Biscoe'), STRUCT('FEMALE', 190, 16, 41, 'Biscoe'), STRUCT('MALE', 200, 17, 42, 'Dream'), STRUCT('FEMALE', 210, 18, 43, 'Dream')]) AS `column_0`\n",
       ") AS `t0`)))\n",
       "
\n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "✅ Completed. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.microsoft.datawrangler.viewer.v0+json": { "columns": [ { "name": "index", "rawType": "Int64", "type": "integer" }, { "name": "predicted_body_mass_g", "rawType": "Float64", "type": "float" }, { "name": "sex", "rawType": "string", "type": "string" }, { "name": "flipper_length_mm", "rawType": "Int64", "type": "integer" }, { "name": "culmen_depth_mm", "rawType": "Int64", "type": "integer" }, { "name": "culmen_length_mm", "rawType": "Int64", "type": "integer" }, { "name": "island", "rawType": "string", "type": "string" } ], "ref": "01d67015-64b6-463e-8c16-e8ac1363ff67", "rows": [ [ "0", "3596.332210728767", "MALE", "180", "15", "40", "Biscoe" ], [ "1", "3384.6999176328636", "FEMALE", "190", "16", "41", "Biscoe" ], [ "2", "4049.581795919061", "MALE", "200", "17", "42", "Dream" ], [ "3", "3837.9495028231568", "FEMALE", "210", "18", "43", "Dream" ] ], "shape": { "columns": 6, "rows": 4 } }, "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
predicted_body_mass_gsexflipper_length_mmculmen_depth_mmculmen_length_mmisland
03596.332211MALE1801540Biscoe
13384.699918FEMALE1901641Biscoe
24049.581796MALE2001742Dream
33837.949503FEMALE2101843Dream
\n", "
" ], "text/plain": [ " predicted_body_mass_g sex flipper_length_mm culmen_depth_mm \\\n", "0 3596.332211 MALE 180 15 \n", "1 3384.699918 FEMALE 190 16 \n", "2 4049.581796 MALE 200 17 \n", "3 3837.949503 FEMALE 210 18 \n", "\n", " culmen_length_mm island \n", "0 40 Biscoe \n", "1 41 Biscoe \n", "2 42 Dream \n", "3 43 Dream " ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "predict_df = pd.DataFrame({\n", " \"sex\": [\"MALE\", \"FEMALE\", \"MALE\", \"FEMALE\"],\n", " \"flipper_length_mm\": [180, 190, 200, 210],\n", " \"culmen_depth_mm\": [15, 16, 17, 18],\n", " \"culmen_length_mm\": [40, 41, 42, 43],\n", " \"island\": [\"Biscoe\", \"Biscoe\", \"Dream\", \"Dream\"],\n", "})\n", "bbq.ml.predict(model_metadata, predict_df).to_pandas()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Compatibility with `bigframes.ml`\n", "\n", "The models created with `bigframes.bigquery.ml` can be used with the scikit-learn-like `bigframes.ml` modules by using the `read_gbq_model` method.\n" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "LinearRegression(enable_global_explain=True,\n", " optimize_strategy='NORMAL_EQUATION')" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model = bpd.read_gbq_model(model_name)\n", "model" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/html": [ "✅ Completed. \n", " Query processed 7.3 kB in a moment of slot time. [Job bigframes-dev:US.f2f86927-bbd1-431d-b89e-3d6a064268d7 details]\n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "✅ Completed. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "✅ Completed. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
mean_absolute_errormean_squared_errormean_squared_log_errormedian_absolute_errorr2_scoreexplained_variance
0223.87876378553.6016340.005614181.3309110.6239510.623951
\n", "

1 rows × 6 columns

\n", "
[1 rows x 6 columns in total]" ], "text/plain": [ " mean_absolute_error mean_squared_error mean_squared_log_error \\\n", " 223.878763 78553.601634 0.005614 \n", "\n", " median_absolute_error r2_score explained_variance \n", " 181.330911 0.623951 0.623951 \n", "\n", "[1 rows x 6 columns]" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X = training_data[[\"sex\", \"flipper_length_mm\", \"culmen_depth_mm\", \"culmen_length_mm\", \"island\"]]\n", "y = training_data[[\"body_mass_g\"]]\n", "model.score(X, y)" ] }, { "cell_type": "markdown", "metadata": { "id": "G_wjSfXpWTuy" }, "source": [ "# Summary and next steps\n", "\n", "You've created a linear regression model using `bigframes.bigquery.ml`.\n", "\n", "Learn more about BigQuery DataFrames in the [documentation](https://dataframes.bigquery.dev/) and find more sample notebooks in the [GitHub repo](https://github.com/googleapis/python-bigquery-dataframes/tree/main/notebooks)." ] }, { "cell_type": "markdown", "metadata": { "id": "TpV-iwP9qw9c" }, "source": [ "## Cleaning up\n", "\n", "To clean up all Google Cloud resources used in this project, you can [delete the Google Cloud\n", "project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.\n", "\n", "Otherwise, you can uncomment the remaining cells and run them to delete the individual resources you created in this tutorial:" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "id": "sx_vKniMq9ZX" }, "outputs": [], "source": [ "# # Delete the BigQuery dataset and associated ML model\n", "# from google.cloud import bigquery\n", "# client = bigquery.Client(project=PROJECT_ID)\n", "# client.delete_dataset(\n", "# DATASET_ID, delete_contents=True, not_found_ok=True\n", "# )\n", "# print(\"Deleted dataset '{}'.\".format(DATASET_ID))" ] } ], "metadata": { "colab": { "provenance": [], "toc_visible": true }, "kernelspec": { "display_name": "venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.9" } }, "nbformat": 4, "nbformat_minor": 0 }