{
  "cells": [
    {
      "cell_type": "code",
      "execution_count": 1,
      "metadata": {
        "id": "ur8xi4C7S06n"
      },
      "outputs": [],
      "source": [
        "# Copyright 2024 Google LLC\n",
        "#\n",
        "# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
        "# you may not use this file except in compliance with the License.\n",
        "# You may obtain a copy of the License at\n",
        "#\n",
        "#     https://www.apache.org/licenses/LICENSE-2.0\n",
        "#\n",
        "# Unless required by applicable law or agreed to in writing, software\n",
        "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
        "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
        "# See the License for the specific language governing permissions and\n",
        "# limitations under the License."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "JAPoU8Sm5E6e"
      },
      "source": [
        "# Machine Learning Fundamentals with BigQuery DataFrames\n",
        "\n",
        "<table align=\"left\">\n",
        "\n",
        "  <td>\n",
        "    <a href=\"https://colab.research.google.com/github/googleapis/python-bigquery-dataframes/blob/main/notebooks/getting_started/ml_fundamentals_bq_dataframes.ipynb\">\n",
        "      <img src=\"https://raw.githubusercontent.com/googleapis/python-bigquery-dataframes/refs/heads/main/third_party/logo/colab-logo.png\" alt=\"Colab logo\"> Run in Colab\n",
        "    </a>\n",
        "  </td>\n",
        "  <td>\n",
        "    <a href=\"https://github.com/googleapis/python-bigquery-dataframes/blob/main/notebooks/getting_started/ml_fundamentals_bq_dataframes.ipynb\">\n",
        "      <img src=\"https://raw.githubusercontent.com/googleapis/python-bigquery-dataframes/refs/heads/main/third_party/logo/github-logo.png\" width=\"32\" alt=\"GitHub logo\">\n",
        "      View on GitHub\n",
        "    </a>\n",
        "  </td>\n",
        "  <td>\n",
        "    <a href=\"https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/googleapis/python-bigquery-dataframes/blob/main/notebooks/getting_started/ml_fundamentals_bq_dataframes.ipynb\">\n",
        "      <img src=\"https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32\" alt=\"Vertex AI logo\">\n",
        "      Open in Vertex AI Workbench\n",
        "    </a>\n",
        "  </td>\n",
        "  <td>\n",
        "    <a href=\"https://console.cloud.google.com/bigquery/import?url=https://github.com/googleapis/python-bigquery-dataframes/blob/main/notebooks/getting_started/ml_fundamentals_bq_dataframes.ipynb\">\n",
        "      <img src=\"https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTW1gvOovVlbZAIZylUtf5Iu8-693qS1w5NJw&s\" alt=\"BQ logo\" width=\"35\">\n",
        "      Open in BQ Studio\n",
        "    </a>\n",
        "  </td>\n",
        "</table>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "24743cf4a1e1"
      },
      "source": [
        "**_NOTE_**: This notebook has been tested in the following environment:\n",
        "\n",
        "* Python version = 3.10"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "tvgnzT1CKxrO"
      },
      "source": [
        "## Overview\n",
        "\n",
        "The `bigframes.ml` module implements Scikit-Learn's machine learning API in\n",
        "BigQuery DataFrames. It exposes BigQuery's ML capabilities in a simple, popular\n",
        "API that works seamlessly with the rest of the BigQuery DataFrames API.\n",
        "\n",
        "Learn more about [BigQuery DataFrames](https://cloud.google.com/python/docs/reference/bigframes/latest)."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "d975e698c9a4"
      },
      "source": [
        "### Objective\n",
        "\n",
        "In this tutorial, you will walk through an end-to-end machine learning workflow using BigQuery DataFrames. You will load data, manipulate and prepare it for model training, build supervised and unsupervised models, and evaluate and save a model for future use; all using built-in BigQuery DataFrames functionality."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "08d289fa873f"
      },
      "source": [
        "### Dataset\n",
        "\n",
        "This tutorial uses the [```penguins``` table](https://console.cloud.google.com/bigquery?p=bigquery-public-data&d=ml_datasets&t=penguins) (a BigQuery public dataset), which contains data on a set of penguins including species, island of residence, weight, culmen length and depth, flipper length, and sex."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "aed92deeb4a0"
      },
      "source": [
        "### Costs\n",
        "\n",
        "This tutorial uses billable components of Google Cloud:\n",
        "\n",
        "* BigQuery (storage and compute)\n",
        "* BigQuery ML\n",
        "\n",
        "Learn about [BigQuery storage pricing](https://cloud.google.com/bigquery/pricing#storage),\n",
        "[BigQuery compute pricing](https://cloud.google.com/bigquery/pricing#analysis_pricing_models),\n",
        "and [BigQuery ML pricing](https://cloud.google.com/bigquery/pricing#bqml),\n",
        "and use the [Pricing Calculator](https://cloud.google.com/products/calculator/)\n",
        "to generate a cost estimate based on your projected usage."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "i7EUnXsZhAGF"
      },
      "source": [
        "## Installation\n",
        "\n",
        "Depending on your Jupyter environment, you might have to install packages."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "NRTcBQPZpKWd"
      },
      "source": [
        "**Vertex AI Workbench or Colab**\n",
        "\n",
        "Do nothing, BigQuery DataFrames package is already installed."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "bdOJtFo1pRnc"
      },
      "source": [
        "**Local JupyterLab instance**\n",
        "\n",
        "Uncomment and run the following cell:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 2,
      "metadata": {
        "id": "mfPoOwPLGpSr"
      },
      "outputs": [],
      "source": [
        "# !pip install bigframes"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "BF1j6f9HApxa"
      },
      "source": [
        "## Before you begin\n",
        "\n",
        "Complete the tasks in this section to set up your environment."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Yq7zKYWelRQP"
      },
      "source": [
        "### Set up your Google Cloud project\n",
        "\n",
        "**The following steps are required, regardless of your notebook environment.**\n",
        "\n",
        "1. [Select or create a Google Cloud project](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get a $300 credit towards your compute/storage costs.\n",
        "\n",
        "2. [Make sure that billing is enabled for your project](https://cloud.google.com/billing/docs/how-to/modify-project).\n",
        "\n",
        "3. [Click here](https://console.cloud.google.com/flows/enableapi?apiid=bigquery.googleapis.com) to enable the BigQuery API.\n",
        "\n",
        "4. If you are running this notebook locally, install the [Cloud SDK](https://cloud.google.com/sdk)."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "WReHDGG5g0XY"
      },
      "source": [
        "#### Set your project ID\n",
        "\n",
        "If you don't know your project ID, try the following:\n",
        "* Run `gcloud config list`.\n",
        "* Run `gcloud projects list`.\n",
        "* See the support page: [Locate the project ID](https://support.google.com/googleapi/answer/7014113)."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 3,
      "metadata": {
        "id": "oM1iC_MfAts1"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Updated property [core/project].\n"
          ]
        }
      ],
      "source": [
        "PROJECT_ID = \"\"  # @param {type:\"string\"}\n",
        "\n",
        "# Set the project id\n",
        "! gcloud config set project {PROJECT_ID}"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "region"
      },
      "source": [
        "#### Set the region\n",
        "\n",
        "You can also change the `REGION` variable used by BigQuery. Learn more about [BigQuery regions](https://cloud.google.com/bigquery/docs/locations#supported_locations)."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 4,
      "metadata": {
        "id": "eF-Twtc4XGem"
      },
      "outputs": [],
      "source": [
        "REGION = \"US\"  # @param {type: \"string\"}"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "XcW9adriUQRc"
      },
      "source": [
        "#### Set the dataset ID\n",
        "\n",
        "As part of this notebook, you will save BigQuery ML models to your Google Cloud project, which requires a dataset. Create the dataset, if needed, and provide the ID here as the `DATASET` variable used by BigQuery. Learn how to create a [BigQuery dataset](https://cloud.google.com/bigquery/docs/datasets)."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 5,
      "metadata": {
        "id": "BbMh9JHvUHAn"
      },
      "outputs": [],
      "source": [
        "DATASET = \"\"  # @param {type: \"string\"}"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "NwxfWoR5UGwO"
      },
      "source": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "sBCra4QMA2wR"
      },
      "source": [
        "### Authenticate your Google Cloud account\n",
        "\n",
        "Depending on your Jupyter environment, you might have to manually authenticate. Follow the relevant instructions below."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "74ccc9e52986"
      },
      "source": [
        "**Vertex AI Workbench**\n",
        "\n",
        "Do nothing, you are already authenticated."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "de775a3773ba"
      },
      "source": [
        "**Local JupyterLab instance**\n",
        "\n",
        "Uncomment and run the following cell:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 6,
      "metadata": {
        "id": "254614fa0c46"
      },
      "outputs": [],
      "source": [
        "# ! gcloud auth login"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ef21552ccea8"
      },
      "source": [
        "**Colab**\n",
        "\n",
        "Uncomment and run the following cell:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 7,
      "metadata": {
        "id": "603adbbf0532"
      },
      "outputs": [],
      "source": [
        "# from google.colab import auth\n",
        "# auth.authenticate_user()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "960505627ddf"
      },
      "source": [
        "### Import libraries"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 8,
      "metadata": {
        "id": "PyQmSRbKA8r-"
      },
      "outputs": [],
      "source": [
        "import bigframes.pandas as bpd"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "init_aip:mbsdk,all"
      },
      "source": [
        "\n",
        "### Set BigQuery DataFrames options"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 9,
      "metadata": {
        "id": "NPPMuw2PXGeo"
      },
      "outputs": [],
      "source": [
        "# Note: The project option is not required in all environments.\n",
        "# On BigQuery Studio, the project ID is automatically detected.\n",
        "bpd.options.bigquery.project = PROJECT_ID\n",
        "\n",
        "# Note: The location option is not required.\n",
        "# It defaults to the location of the first table or query\n",
        "# passed to read_gbq(). For APIs where a location can't be\n",
        "# auto-detected, the location defaults to the \"US\" location.\n",
        "bpd.options.bigquery.location = REGION"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "pDfrKwMKE_dK"
      },
      "source": [
        "If you want to reset the location of the created DataFrame or Series objects, reset the session by executing `bpd.reset_session()`. After that, you can reuse `bpd.options.bigquery.location` to specify another location."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "LjfRpSruzg5j"
      },
      "source": [
        "## Import data into BigQuery DataFrames\n",
        "\n",
        "You can create a DataFrame by reading data from a BigQuery table."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 10,
      "metadata": {
        "id": "d86W4hNqzZJb"
      },
      "outputs": [],
      "source": [
        "df = bpd.read_gbq(\"bigquery-public-data.ml_datasets.penguins\")\n",
        "df = df.dropna()\n",
        "\n",
        "# BigQuery DataFrames creates a default numbered index, which we can give a name\n",
        "df.index.name = \"penguin_id\""
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "pDfCJ6-LkRB1"
      },
      "source": [
        "Take a look at a few rows of the DataFrame:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 11,
      "metadata": {
        "id": "arGaUZVWkSwT"
      },
      "outputs": [
        {
          "data": {
            "text/html": [
              "Query job d3acda60-1059-4bb0-9912-ed374491c5c3 is DONE. 28.9 kB processed. <a target=\"_blank\" href=\"https://console.cloud.google.com/bigquery?project=swast-scratch&j=bq:US:d3acda60-1059-4bb0-9912-ed374491c5c3&page=queryresults\">Open Job</a>"
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "text/html": [
              "Query job 51c6aa1c-ff98-4805-921e-00830e125e56 is DONE. 0 Bytes processed. <a target=\"_blank\" href=\"https://console.cloud.google.com/bigquery?project=swast-scratch&j=bq:US:51c6aa1c-ff98-4805-921e-00830e125e56&page=queryresults\">Open Job</a>"
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "text/html": [
              "Query job 01e2cb6d-604b-4cdd-afb0-8f515a9da951 is DONE. 501 Bytes processed. <a target=\"_blank\" href=\"https://console.cloud.google.com/bigquery?project=swast-scratch&j=bq:US:01e2cb6d-604b-4cdd-afb0-8f515a9da951&page=queryresults\">Open Job</a>"
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>species</th>\n",
              "      <th>island</th>\n",
              "      <th>culmen_length_mm</th>\n",
              "      <th>culmen_depth_mm</th>\n",
              "      <th>flipper_length_mm</th>\n",
              "      <th>body_mass_g</th>\n",
              "      <th>sex</th>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>penguin_id</th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>Gentoo penguin (Pygoscelis papua)</td>\n",
              "      <td>Biscoe</td>\n",
              "      <td>50.5</td>\n",
              "      <td>15.9</td>\n",
              "      <td>225.0</td>\n",
              "      <td>5400.0</td>\n",
              "      <td>MALE</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>Gentoo penguin (Pygoscelis papua)</td>\n",
              "      <td>Biscoe</td>\n",
              "      <td>45.1</td>\n",
              "      <td>14.5</td>\n",
              "      <td>215.0</td>\n",
              "      <td>5000.0</td>\n",
              "      <td>FEMALE</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
              "      <td>Torgersen</td>\n",
              "      <td>41.4</td>\n",
              "      <td>18.5</td>\n",
              "      <td>202.0</td>\n",
              "      <td>3875.0</td>\n",
              "      <td>MALE</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
              "      <td>Torgersen</td>\n",
              "      <td>38.6</td>\n",
              "      <td>17.0</td>\n",
              "      <td>188.0</td>\n",
              "      <td>2900.0</td>\n",
              "      <td>FEMALE</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
              "      <td>Gentoo penguin (Pygoscelis papua)</td>\n",
              "      <td>Biscoe</td>\n",
              "      <td>46.5</td>\n",
              "      <td>14.8</td>\n",
              "      <td>217.0</td>\n",
              "      <td>5200.0</td>\n",
              "      <td>FEMALE</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "<p>5 rows × 7 columns</p>\n",
              "</div>[5 rows x 7 columns in total]"
            ],
            "text/plain": [
              "                                        species     island  culmen_length_mm  \\\n",
              "penguin_id                                                                     \n",
              "0             Gentoo penguin (Pygoscelis papua)     Biscoe              50.5   \n",
              "1             Gentoo penguin (Pygoscelis papua)     Biscoe              45.1   \n",
              "2           Adelie Penguin (Pygoscelis adeliae)  Torgersen              41.4   \n",
              "3           Adelie Penguin (Pygoscelis adeliae)  Torgersen              38.6   \n",
              "4             Gentoo penguin (Pygoscelis papua)     Biscoe              46.5   \n",
              "\n",
              "            culmen_depth_mm  flipper_length_mm  body_mass_g     sex  \n",
              "penguin_id                                                           \n",
              "0                      15.9              225.0       5400.0    MALE  \n",
              "1                      14.5              215.0       5000.0  FEMALE  \n",
              "2                      18.5              202.0       3875.0    MALE  \n",
              "3                      17.0              188.0       2900.0  FEMALE  \n",
              "4                      14.8              217.0       5200.0  FEMALE  \n",
              "\n",
              "[5 rows x 7 columns]"
            ]
          },
          "execution_count": 11,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "df.head()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "WkUIcMXPkahu"
      },
      "source": [
        "## Clean and prepare data"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "DScncEoDkiTG"
      },
      "source": [
        "We're are going to start with supervised learning, where a Linear Regression model will learn to predict the body mass (output variable `y`) using input features such as flipper length, sex, species, and more (features `X`)."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 12,
      "metadata": {
        "id": "B9mW93o9z_-L"
      },
      "outputs": [],
      "source": [
        "# Isolate input features and output variable into DataFrames\n",
        "X = df[['island', 'culmen_length_mm', 'culmen_depth_mm', 'flipper_length_mm', 'sex', 'species']]\n",
        "y = df[['body_mass_g']]"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "wkw0Cs62k_cl"
      },
      "source": [
        "Part of preparing data for a machine learning task is splitting it into subsets for training and testing to ensure that the solution is not overfitting. By default, BQML will automatically manage splitting the data for you. However, BQML also supports manually splitting out your training data.\n",
        "\n",
        "Performing a manual data split can be done with `bigframes.ml.model_selection.train_test_split` like so:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 13,
      "metadata": {
        "id": "NysWAWmvlAxB"
      },
      "outputs": [
        {
          "data": {
            "text/html": [
              "Query job 7bd14e04-b3b4-4281-b5be-187f7baad62f is DONE. 28.9 kB processed. <a target=\"_blank\" href=\"https://console.cloud.google.com/bigquery?project=swast-scratch&j=bq:US:7bd14e04-b3b4-4281-b5be-187f7baad62f&page=queryresults\">Open Job</a>"
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "text/html": [
              "Query job 240cc7db-19ac-4bd3-8e76-a79f75ded077 is DONE. 28.9 kB processed. <a target=\"_blank\" href=\"https://console.cloud.google.com/bigquery?project=swast-scratch&j=bq:US:240cc7db-19ac-4bd3-8e76-a79f75ded077&page=queryresults\">Open Job</a>"
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "text/html": [
              "Query job 91194fee-d9b9-4cb9-a469-e49e9d77c624 is DONE. 28.9 kB processed. <a target=\"_blank\" href=\"https://console.cloud.google.com/bigquery?project=swast-scratch&j=bq:US:91194fee-d9b9-4cb9-a469-e49e9d77c624&page=queryresults\">Open Job</a>"
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "text/html": [
              "Query job 84c71647-956b-4385-8dce-c8bc70a917c8 is DONE. 28.9 kB processed. <a target=\"_blank\" href=\"https://console.cloud.google.com/bigquery?project=swast-scratch&j=bq:US:84c71647-956b-4385-8dce-c8bc70a917c8&page=queryresults\">Open Job</a>"
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "text/html": [
              "Query job 9c94600b-2231-4d04-8e3a-fb46f8892b6a is DONE. 28.9 kB processed. <a target=\"_blank\" href=\"https://console.cloud.google.com/bigquery?project=swast-scratch&j=bq:US:9c94600b-2231-4d04-8e3a-fb46f8892b6a&page=queryresults\">Open Job</a>"
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "X_train shape: (267, 6)\n",
            "X_test shape: (67, 6)\n",
            "y_train shape: (267, 1)\n",
            "y_test shape: (67, 1)\n"
          ]
        }
      ],
      "source": [
        "from bigframes.ml.model_selection import train_test_split\n",
        "\n",
        "# This will split X and y into test and training sets, with 20% of the rows in the test set,\n",
        "# and the rest in the training set\n",
        "X_train, X_test, y_train, y_test = train_test_split(\n",
        "  X, y, test_size=0.2)\n",
        "\n",
        "# Show the shape of the data after the split\n",
        "print(f\"\"\"X_train shape: {X_train.shape}\n",
        "X_test shape: {X_test.shape}\n",
        "y_train shape: {y_train.shape}\n",
        "y_test shape: {y_test.shape}\"\"\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "faFnVnNolydu"
      },
      "source": [
        "If we look at the data, we can see that random rows were selected for\n",
        "each side of the split:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 14,
      "metadata": {
        "id": "f8bz1HwLlyLP"
      },
      "outputs": [
        {
          "data": {
            "text/html": [
              "Query job 8ad534c1-eb49-4616-b7a6-f7d8b044b8bf is DONE. 28.9 kB processed. <a target=\"_blank\" href=\"https://console.cloud.google.com/bigquery?project=swast-scratch&j=bq:US:8ad534c1-eb49-4616-b7a6-f7d8b044b8bf&page=queryresults\">Open Job</a>"
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "text/html": [
              "Query job 3793de66-fb3c-4ca4-a337-aa708c718cc5 is DONE. 0 Bytes processed. <a target=\"_blank\" href=\"https://console.cloud.google.com/bigquery?project=swast-scratch&j=bq:US:3793de66-fb3c-4ca4-a337-aa708c718cc5&page=queryresults\">Open Job</a>"
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "text/html": [
              "Query job 66524afb-4509-4927-8902-4a72826e83c4 is DONE. 456 Bytes processed. <a target=\"_blank\" href=\"https://console.cloud.google.com/bigquery?project=swast-scratch&j=bq:US:66524afb-4509-4927-8902-4a72826e83c4&page=queryresults\">Open Job</a>"
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>island</th>\n",
              "      <th>culmen_length_mm</th>\n",
              "      <th>culmen_depth_mm</th>\n",
              "      <th>flipper_length_mm</th>\n",
              "      <th>sex</th>\n",
              "      <th>species</th>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>penguin_id</th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>188</th>\n",
              "      <td>Dream</td>\n",
              "      <td>51.5</td>\n",
              "      <td>18.7</td>\n",
              "      <td>187.0</td>\n",
              "      <td>MALE</td>\n",
              "      <td>Chinstrap penguin (Pygoscelis antarctica)</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>251</th>\n",
              "      <td>Biscoe</td>\n",
              "      <td>49.5</td>\n",
              "      <td>16.1</td>\n",
              "      <td>224.0</td>\n",
              "      <td>MALE</td>\n",
              "      <td>Gentoo penguin (Pygoscelis papua)</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>231</th>\n",
              "      <td>Biscoe</td>\n",
              "      <td>45.7</td>\n",
              "      <td>13.9</td>\n",
              "      <td>214.0</td>\n",
              "      <td>FEMALE</td>\n",
              "      <td>Gentoo penguin (Pygoscelis papua)</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>271</th>\n",
              "      <td>Biscoe</td>\n",
              "      <td>59.6</td>\n",
              "      <td>17.0</td>\n",
              "      <td>230.0</td>\n",
              "      <td>MALE</td>\n",
              "      <td>Gentoo penguin (Pygoscelis papua)</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>128</th>\n",
              "      <td>Biscoe</td>\n",
              "      <td>38.8</td>\n",
              "      <td>17.2</td>\n",
              "      <td>180.0</td>\n",
              "      <td>MALE</td>\n",
              "      <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "<p>5 rows × 6 columns</p>\n",
              "</div>[5 rows x 6 columns in total]"
            ],
            "text/plain": [
              "            island  culmen_length_mm  culmen_depth_mm  flipper_length_mm  \\\n",
              "penguin_id                                                                 \n",
              "188          Dream              51.5             18.7              187.0   \n",
              "251         Biscoe              49.5             16.1              224.0   \n",
              "231         Biscoe              45.7             13.9              214.0   \n",
              "271         Biscoe              59.6             17.0              230.0   \n",
              "128         Biscoe              38.8             17.2              180.0   \n",
              "\n",
              "               sex                                    species  \n",
              "penguin_id                                                     \n",
              "188           MALE  Chinstrap penguin (Pygoscelis antarctica)  \n",
              "251           MALE          Gentoo penguin (Pygoscelis papua)  \n",
              "231         FEMALE          Gentoo penguin (Pygoscelis papua)  \n",
              "271           MALE          Gentoo penguin (Pygoscelis papua)  \n",
              "128           MALE        Adelie Penguin (Pygoscelis adeliae)  \n",
              "\n",
              "[5 rows x 6 columns]"
            ]
          },
          "execution_count": 14,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "X_test.head(5)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "v4ic7GQEl67Y"
      },
      "source": [
        "Note that the `y_test` data matches the same rows in `X_test`:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 15,
      "metadata": {
        "id": "PflbhKGkl8v2"
      },
      "outputs": [
        {
          "data": {
            "text/html": [
              "Query job 6a87fcc2-f2d0-44f5-8ab2-08f109c2b70d is DONE. 28.9 kB processed. <a target=\"_blank\" href=\"https://console.cloud.google.com/bigquery?project=swast-scratch&j=bq:US:6a87fcc2-f2d0-44f5-8ab2-08f109c2b70d&page=queryresults\">Open Job</a>"
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "text/html": [
              "Query job ed8e49f8-0f4c-4ef2-bbc2-b8c5ef9fd064 is DONE. 0 Bytes processed. <a target=\"_blank\" href=\"https://console.cloud.google.com/bigquery?project=swast-scratch&j=bq:US:ed8e49f8-0f4c-4ef2-bbc2-b8c5ef9fd064&page=queryresults\">Open Job</a>"
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "text/html": [
              "Query job 97fea642-03aa-49fd-943e-f4efa5a87f0f is DONE. 120 Bytes processed. <a target=\"_blank\" href=\"https://console.cloud.google.com/bigquery?project=swast-scratch&j=bq:US:97fea642-03aa-49fd-943e-f4efa5a87f0f&page=queryresults\">Open Job</a>"
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>body_mass_g</th>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>penguin_id</th>\n",
              "      <th></th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>188</th>\n",
              "      <td>3250.0</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>251</th>\n",
              "      <td>5650.0</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>231</th>\n",
              "      <td>4400.0</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>271</th>\n",
              "      <td>6050.0</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>128</th>\n",
              "      <td>3800.0</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "<p>5 rows × 1 columns</p>\n",
              "</div>[5 rows x 1 columns in total]"
            ],
            "text/plain": [
              "            body_mass_g\n",
              "penguin_id             \n",
              "188              3250.0\n",
              "251              5650.0\n",
              "231              4400.0\n",
              "271              6050.0\n",
              "128              3800.0\n",
              "\n",
              "[5 rows x 1 columns]"
            ]
          },
          "execution_count": 15,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "y_test.head(5)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Dkf52IdvmSaj"
      },
      "source": [
        "## Estimators\n",
        "\n",
        "Following scikit-learn, all learning components are \"estimators\"; objects that can learn from training data and then apply themselves to new data. Estimators share the following patterns:\n",
        "\n",
        "- a constructor that takes a list of parameters\n",
        "- a standard string representation that shows the class name and all non-default parameters, e.g. `LinearRegression(fit_intercept=False)`\n",
        "- a `.fit(..)` method to fit the estimator to training data\n",
        "\n",
        "There estimators can be further broken down into two main subtypes:\n",
        " 1. Transformers\n",
        " 2. Predictors\n",
        "\n",
        "Let's walk through each of these with our example model."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "55oNSWQ2Q5te"
      },
      "source": [
        "### Transformers\n",
        "\n",
        "Transformers are estimators that are used to prepare data for consumption by other estimators ('preprocessing'). In addition to `.fit(...)`, the transformer implements a `.transform(...)` method, which will apply a transformation based on what was computed during `.fit(..)`. With this pattern dynamic preprocessing steps can be applied to both training and test/production data consistently.\n",
        "\n",
        "An example of a transformer is `bigframes.ml.preprocessing.StandardScaler`, which rescales a dataset to have a mean of zero and a standard deviation of one:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 16,
      "metadata": {
        "id": "yhATDMR-mkdF"
      },
      "outputs": [
        {
          "data": {
            "text/html": [
              "Query job aee64759-42bb-44d6-b8c7-1c737cdd6eed is DONE. 28.9 kB processed. <a target=\"_blank\" href=\"https://console.cloud.google.com/bigquery?project=swast-scratch&j=bq:US:aee64759-42bb-44d6-b8c7-1c737cdd6eed&page=queryresults\">Open Job</a>"
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "text/html": [
              "Query job acb29d04-a20d-4f1c-8d90-51c7e8ac9922 is DONE. 0 Bytes processed. <a target=\"_blank\" href=\"https://console.cloud.google.com/bigquery?project=swast-scratch&j=bq:US:acb29d04-a20d-4f1c-8d90-51c7e8ac9922&page=queryresults\">Open Job</a>"
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "text/html": [
              "Query job 2bd034db-7d9b-467c-be17-49bca094cceb is DONE. 28.9 kB processed. <a target=\"_blank\" href=\"https://console.cloud.google.com/bigquery?project=swast-scratch&j=bq:US:2bd034db-7d9b-467c-be17-49bca094cceb&page=queryresults\">Open Job</a>"
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "text/html": [
              "Query job 5dfb583a-1ced-4f2a-94b9-f1282263134d is DONE. 2.1 kB processed. <a target=\"_blank\" href=\"https://console.cloud.google.com/bigquery?project=swast-scratch&j=bq:US:5dfb583a-1ced-4f2a-94b9-f1282263134d&page=queryresults\">Open Job</a>"
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "text/html": [
              "Query job 8fe87288-4a95-49f4-9895-7c41c1004901 is DONE. 0 Bytes processed. <a target=\"_blank\" href=\"https://console.cloud.google.com/bigquery?project=swast-scratch&j=bq:US:8fe87288-4a95-49f4-9895-7c41c1004901&page=queryresults\">Open Job</a>"
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "text/html": [
              "Query job 7ebcecee-beff-402d-ac71-6384014a54da is DONE. 8.5 kB processed. <a target=\"_blank\" href=\"https://console.cloud.google.com/bigquery?project=swast-scratch&j=bq:US:7ebcecee-beff-402d-ac71-6384014a54da&page=queryresults\">Open Job</a>"
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>standard_scaled_culmen_length_mm</th>\n",
              "      <th>standard_scaled_culmen_depth_mm</th>\n",
              "      <th>standard_scaled_flipper_length_mm</th>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>penguin_id</th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>1.20778</td>\n",
              "      <td>-0.651531</td>\n",
              "      <td>1.772656</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>-0.455602</td>\n",
              "      <td>0.662855</td>\n",
              "      <td>0.100476</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>-0.967412</td>\n",
              "      <td>-0.095445</td>\n",
              "      <td>-0.917372</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
              "      <td>0.476623</td>\n",
              "      <td>-1.207617</td>\n",
              "      <td>1.191028</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>5</th>\n",
              "      <td>-1.625454</td>\n",
              "      <td>0.359535</td>\n",
              "      <td>-0.626559</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>7</th>\n",
              "      <td>-0.345929</td>\n",
              "      <td>-1.86481</td>\n",
              "      <td>0.682104</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>8</th>\n",
              "      <td>0.842202</td>\n",
              "      <td>-1.561491</td>\n",
              "      <td>1.409139</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>9</th>\n",
              "      <td>0.348671</td>\n",
              "      <td>0.865068</td>\n",
              "      <td>-0.263041</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>10</th>\n",
              "      <td>0.933596</td>\n",
              "      <td>1.218941</td>\n",
              "      <td>0.827511</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>11</th>\n",
              "      <td>-1.460943</td>\n",
              "      <td>-0.297658</td>\n",
              "      <td>-0.771966</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>12</th>\n",
              "      <td>1.317454</td>\n",
              "      <td>-0.449318</td>\n",
              "      <td>1.409139</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>13</th>\n",
              "      <td>-0.236255</td>\n",
              "      <td>-1.763704</td>\n",
              "      <td>0.900214</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>14</th>\n",
              "      <td>0.549739</td>\n",
              "      <td>-0.297658</td>\n",
              "      <td>-0.626559</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>16</th>\n",
              "      <td>0.970154</td>\n",
              "      <td>-1.005404</td>\n",
              "      <td>1.481842</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>17</th>\n",
              "      <td>-1.058807</td>\n",
              "      <td>-0.348211</td>\n",
              "      <td>-0.190338</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>18</th>\n",
              "      <td>1.354012</td>\n",
              "      <td>-1.510937</td>\n",
              "      <td>1.263732</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>19</th>\n",
              "      <td>-0.053466</td>\n",
              "      <td>-1.662597</td>\n",
              "      <td>1.191028</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>20</th>\n",
              "      <td>-0.199697</td>\n",
              "      <td>-1.510937</td>\n",
              "      <td>0.609401</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>21</th>\n",
              "      <td>1.152943</td>\n",
              "      <td>0.763962</td>\n",
              "      <td>-0.190338</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>22</th>\n",
              "      <td>-1.205038</td>\n",
              "      <td>0.308982</td>\n",
              "      <td>-0.699262</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>24</th>\n",
              "      <td>-0.784623</td>\n",
              "      <td>1.775028</td>\n",
              "      <td>-0.699262</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>25</th>\n",
              "      <td>-0.83946</td>\n",
              "      <td>1.724474</td>\n",
              "      <td>-0.771966</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>26</th>\n",
              "      <td>-0.620113</td>\n",
              "      <td>0.359535</td>\n",
              "      <td>-0.990076</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>27</th>\n",
              "      <td>0.330392</td>\n",
              "      <td>-0.095445</td>\n",
              "      <td>-0.408448</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>29</th>\n",
              "      <td>2.194842</td>\n",
              "      <td>-0.095445</td>\n",
              "      <td>1.990767</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "<p>25 rows × 3 columns</p>\n",
              "</div>[267 rows x 3 columns in total]"
            ],
            "text/plain": [
              "            standard_scaled_culmen_length_mm  standard_scaled_culmen_depth_mm  \\\n",
              "penguin_id                                                                      \n",
              "0                                    1.20778                        -0.651531   \n",
              "2                                  -0.455602                         0.662855   \n",
              "3                                  -0.967412                        -0.095445   \n",
              "4                                   0.476623                        -1.207617   \n",
              "5                                  -1.625454                         0.359535   \n",
              "7                                  -0.345929                         -1.86481   \n",
              "8                                   0.842202                        -1.561491   \n",
              "9                                   0.348671                         0.865068   \n",
              "10                                  0.933596                         1.218941   \n",
              "11                                 -1.460943                        -0.297658   \n",
              "12                                  1.317454                        -0.449318   \n",
              "13                                 -0.236255                        -1.763704   \n",
              "14                                  0.549739                        -0.297658   \n",
              "16                                  0.970154                        -1.005404   \n",
              "17                                 -1.058807                        -0.348211   \n",
              "18                                  1.354012                        -1.510937   \n",
              "19                                 -0.053466                        -1.662597   \n",
              "20                                 -0.199697                        -1.510937   \n",
              "21                                  1.152943                         0.763962   \n",
              "22                                 -1.205038                         0.308982   \n",
              "24                                 -0.784623                         1.775028   \n",
              "25                                  -0.83946                         1.724474   \n",
              "26                                 -0.620113                         0.359535   \n",
              "27                                  0.330392                        -0.095445   \n",
              "29                                  2.194842                        -0.095445   \n",
              "\n",
              "            standard_scaled_flipper_length_mm  \n",
              "penguin_id                                     \n",
              "0                                    1.772656  \n",
              "2                                    0.100476  \n",
              "3                                   -0.917372  \n",
              "4                                    1.191028  \n",
              "5                                   -0.626559  \n",
              "7                                    0.682104  \n",
              "8                                    1.409139  \n",
              "9                                   -0.263041  \n",
              "10                                   0.827511  \n",
              "11                                  -0.771966  \n",
              "12                                   1.409139  \n",
              "13                                   0.900214  \n",
              "14                                  -0.626559  \n",
              "16                                   1.481842  \n",
              "17                                  -0.190338  \n",
              "18                                   1.263732  \n",
              "19                                   1.191028  \n",
              "20                                   0.609401  \n",
              "21                                  -0.190338  \n",
              "22                                  -0.699262  \n",
              "24                                  -0.699262  \n",
              "25                                  -0.771966  \n",
              "26                                  -0.990076  \n",
              "27                                  -0.408448  \n",
              "29                                   1.990767  \n",
              "...\n",
              "\n",
              "[267 rows x 3 columns]"
            ]
          },
          "execution_count": 16,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "from bigframes.ml.preprocessing import StandardScaler\n",
        "\n",
        "# StandardScaler will only work on numeric columns\n",
        "numeric_columns = [\"culmen_length_mm\", \"culmen_depth_mm\", \"flipper_length_mm\"]\n",
        "\n",
        "scaler = StandardScaler()\n",
        "scaler.fit(X_train[numeric_columns])\n",
        "\n",
        "# Now, standardscaler should transform the numbers to have mean of zero\n",
        "# and standard deviation of one:\n",
        "scaler.transform(X_train[numeric_columns])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "vhywHzH-ml-W"
      },
      "source": [
        "We can then repeat this transformation on the test data:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 17,
      "metadata": {
        "id": "TfwSLOTXmspI"
      },
      "outputs": [
        {
          "data": {
            "text/html": [
              "Query job 6639e06d-3920-4c64-84d8-b40ce042188c is DONE. 28.9 kB processed. <a target=\"_blank\" href=\"https://console.cloud.google.com/bigquery?project=swast-scratch&j=bq:US:6639e06d-3920-4c64-84d8-b40ce042188c&page=queryresults\">Open Job</a>"
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "text/html": [
              "Query job 579dfb14-6d39-44c0-9b92-eb6a40c46df8 is DONE. 536 Bytes processed. <a target=\"_blank\" href=\"https://console.cloud.google.com/bigquery?project=swast-scratch&j=bq:US:579dfb14-6d39-44c0-9b92-eb6a40c46df8&page=queryresults\">Open Job</a>"
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "text/html": [
              "Query job 7f613d94-a68c-42d5-8afe-0413b32de3a0 is DONE. 0 Bytes processed. <a target=\"_blank\" href=\"https://console.cloud.google.com/bigquery?project=swast-scratch&j=bq:US:7f613d94-a68c-42d5-8afe-0413b32de3a0&page=queryresults\">Open Job</a>"
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "text/html": [
              "Query job 140e8b5f-a24b-43a3-831f-30a29a4bd7ea is DONE. 2.1 kB processed. <a target=\"_blank\" href=\"https://console.cloud.google.com/bigquery?project=swast-scratch&j=bq:US:140e8b5f-a24b-43a3-831f-30a29a4bd7ea&page=queryresults\">Open Job</a>"
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>standard_scaled_culmen_length_mm</th>\n",
              "      <th>standard_scaled_culmen_depth_mm</th>\n",
              "      <th>standard_scaled_flipper_length_mm</th>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>penguin_id</th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>0.220718</td>\n",
              "      <td>-1.359277</td>\n",
              "      <td>1.045621</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>15</th>\n",
              "      <td>-0.510439</td>\n",
              "      <td>0.157322</td>\n",
              "      <td>-0.771966</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>28</th>\n",
              "      <td>-1.058807</td>\n",
              "      <td>0.713408</td>\n",
              "      <td>-0.771966</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>32</th>\n",
              "      <td>1.463685</td>\n",
              "      <td>1.168388</td>\n",
              "      <td>0.39129</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>33</th>\n",
              "      <td>-0.254534</td>\n",
              "      <td>0.056215</td>\n",
              "      <td>-0.990076</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>34</th>\n",
              "      <td>-0.510439</td>\n",
              "      <td>0.460642</td>\n",
              "      <td>0.318587</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>37</th>\n",
              "      <td>1.354012</td>\n",
              "      <td>0.511195</td>\n",
              "      <td>-0.263041</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>41</th>\n",
              "      <td>-0.674949</td>\n",
              "      <td>-0.095445</td>\n",
              "      <td>-1.789814</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>47</th>\n",
              "      <td>-1.168481</td>\n",
              "      <td>0.662855</td>\n",
              "      <td>-0.117634</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>52</th>\n",
              "      <td>0.458344</td>\n",
              "      <td>0.308982</td>\n",
              "      <td>-0.699262</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>56</th>\n",
              "      <td>-1.040528</td>\n",
              "      <td>0.460642</td>\n",
              "      <td>-1.135483</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>57</th>\n",
              "      <td>-0.967412</td>\n",
              "      <td>0.005662</td>\n",
              "      <td>-0.117634</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>62</th>\n",
              "      <td>0.988433</td>\n",
              "      <td>-0.752638</td>\n",
              "      <td>1.191028</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>65</th>\n",
              "      <td>1.756148</td>\n",
              "      <td>1.370601</td>\n",
              "      <td>0.318587</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>67</th>\n",
              "      <td>0.677691</td>\n",
              "      <td>-1.359277</td>\n",
              "      <td>1.045621</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>75</th>\n",
              "      <td>-1.113644</td>\n",
              "      <td>1.421155</td>\n",
              "      <td>-0.771966</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>81</th>\n",
              "      <td>0.677691</td>\n",
              "      <td>0.561748</td>\n",
              "      <td>-0.408448</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>89</th>\n",
              "      <td>-0.857739</td>\n",
              "      <td>0.713408</td>\n",
              "      <td>-0.771966</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>92</th>\n",
              "      <td>-0.802902</td>\n",
              "      <td>0.308982</td>\n",
              "      <td>-0.917372</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>93</th>\n",
              "      <td>-0.309371</td>\n",
              "      <td>1.168388</td>\n",
              "      <td>-0.263041</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>96</th>\n",
              "      <td>-0.309371</td>\n",
              "      <td>0.662855</td>\n",
              "      <td>-1.499</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>100</th>\n",
              "      <td>-0.912576</td>\n",
              "      <td>0.814515</td>\n",
              "      <td>-0.771966</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>101</th>\n",
              "      <td>0.549739</td>\n",
              "      <td>-1.308724</td>\n",
              "      <td>1.554546</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>102</th>\n",
              "      <td>-0.126582</td>\n",
              "      <td>0.662855</td>\n",
              "      <td>-0.626559</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>107</th>\n",
              "      <td>1.20778</td>\n",
              "      <td>-1.005404</td>\n",
              "      <td>1.118325</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "<p>25 rows × 3 columns</p>\n",
              "</div>[67 rows x 3 columns in total]"
            ],
            "text/plain": [
              "            standard_scaled_culmen_length_mm  standard_scaled_culmen_depth_mm  \\\n",
              "penguin_id                                                                      \n",
              "1                                   0.220718                        -1.359277   \n",
              "15                                 -0.510439                         0.157322   \n",
              "28                                 -1.058807                         0.713408   \n",
              "32                                  1.463685                         1.168388   \n",
              "33                                 -0.254534                         0.056215   \n",
              "34                                 -0.510439                         0.460642   \n",
              "37                                  1.354012                         0.511195   \n",
              "41                                 -0.674949                        -0.095445   \n",
              "47                                 -1.168481                         0.662855   \n",
              "52                                  0.458344                         0.308982   \n",
              "56                                 -1.040528                         0.460642   \n",
              "57                                 -0.967412                         0.005662   \n",
              "62                                  0.988433                        -0.752638   \n",
              "65                                  1.756148                         1.370601   \n",
              "67                                  0.677691                        -1.359277   \n",
              "75                                 -1.113644                         1.421155   \n",
              "81                                  0.677691                         0.561748   \n",
              "89                                 -0.857739                         0.713408   \n",
              "92                                 -0.802902                         0.308982   \n",
              "93                                 -0.309371                         1.168388   \n",
              "96                                 -0.309371                         0.662855   \n",
              "100                                -0.912576                         0.814515   \n",
              "101                                 0.549739                        -1.308724   \n",
              "102                                -0.126582                         0.662855   \n",
              "107                                  1.20778                        -1.005404   \n",
              "\n",
              "            standard_scaled_flipper_length_mm  \n",
              "penguin_id                                     \n",
              "1                                    1.045621  \n",
              "15                                  -0.771966  \n",
              "28                                  -0.771966  \n",
              "32                                    0.39129  \n",
              "33                                  -0.990076  \n",
              "34                                   0.318587  \n",
              "37                                  -0.263041  \n",
              "41                                  -1.789814  \n",
              "47                                  -0.117634  \n",
              "52                                  -0.699262  \n",
              "56                                  -1.135483  \n",
              "57                                  -0.117634  \n",
              "62                                   1.191028  \n",
              "65                                   0.318587  \n",
              "67                                   1.045621  \n",
              "75                                  -0.771966  \n",
              "81                                  -0.408448  \n",
              "89                                  -0.771966  \n",
              "92                                  -0.917372  \n",
              "93                                  -0.263041  \n",
              "96                                     -1.499  \n",
              "100                                 -0.771966  \n",
              "101                                  1.554546  \n",
              "102                                 -0.626559  \n",
              "107                                  1.118325  \n",
              "...\n",
              "\n",
              "[67 rows x 3 columns]"
            ]
          },
          "execution_count": 17,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "scaler.transform(X_test[numeric_columns])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "9enAdjzPmwmv"
      },
      "source": [
        "#### Composing transformers\n",
        "\n",
        "To process data where different columns need different preprocessors, `bigframes.composition.ColumnTransformer` can be employed.\n",
        "\n",
        "Let's create an aggregate transform that applies `StandardScalar` to the numeric columns and `OneHotEncoder` to the string columns."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 18,
      "metadata": {
        "id": "I8Wwx3emmz2J"
      },
      "outputs": [
        {
          "data": {
            "text/html": [
              "Query job c16fdb5d-3f18-4f85-8a31-705ef4680be5 is DONE. 28.9 kB processed. <a target=\"_blank\" href=\"https://console.cloud.google.com/bigquery?project=swast-scratch&j=bq:US:c16fdb5d-3f18-4f85-8a31-705ef4680be5&page=queryresults\">Open Job</a>"
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "text/html": [
              "Query job 8c94a7c1-7f12-44be-b389-7c854ceead4b is DONE. 0 Bytes processed. <a target=\"_blank\" href=\"https://console.cloud.google.com/bigquery?project=swast-scratch&j=bq:US:8c94a7c1-7f12-44be-b389-7c854ceead4b&page=queryresults\">Open Job</a>"
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "text/html": [
              "Query job 1287628d-1380-4495-a5e9-6806440206bc is DONE. 22.9 kB processed. <a target=\"_blank\" href=\"https://console.cloud.google.com/bigquery?project=swast-scratch&j=bq:US:1287628d-1380-4495-a5e9-6806440206bc&page=queryresults\">Open Job</a>"
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "text/html": [
              "Query job 03163e1a-c789-4046-b71a-b4b4e7bbc043 is DONE. 2.1 kB processed. <a target=\"_blank\" href=\"https://console.cloud.google.com/bigquery?project=swast-scratch&j=bq:US:03163e1a-c789-4046-b71a-b4b4e7bbc043&page=queryresults\">Open Job</a>"
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "text/html": [
              "Query job 86f39b30-00db-4ada-8699-0fe49c94eb2d is DONE. 29.2 kB processed. <a target=\"_blank\" href=\"https://console.cloud.google.com/bigquery?project=swast-scratch&j=bq:US:86f39b30-00db-4ada-8699-0fe49c94eb2d&page=queryresults\">Open Job</a>"
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "text/html": [
              "Query job d5b0e8b0-12cd-47f6-85d2-806b2c252d37 is DONE. 536 Bytes processed. <a target=\"_blank\" href=\"https://console.cloud.google.com/bigquery?project=swast-scratch&j=bq:US:d5b0e8b0-12cd-47f6-85d2-806b2c252d37&page=queryresults\">Open Job</a>"
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "text/html": [
              "Query job 459cdc90-d1f3-4580-9137-9b93d44ca991 is DONE. 0 Bytes processed. <a target=\"_blank\" href=\"https://console.cloud.google.com/bigquery?project=swast-scratch&j=bq:US:459cdc90-d1f3-4580-9137-9b93d44ca991&page=queryresults\">Open Job</a>"
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "text/html": [
              "Query job 80d10913-7263-44e6-89f7-719eac4158a3 is DONE. 21.4 kB processed. <a target=\"_blank\" href=\"https://console.cloud.google.com/bigquery?project=swast-scratch&j=bq:US:80d10913-7263-44e6-89f7-719eac4158a3&page=queryresults\">Open Job</a>"
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>onehotencoded_island</th>\n",
              "      <th>standard_scaled_culmen_length_mm</th>\n",
              "      <th>standard_scaled_culmen_depth_mm</th>\n",
              "      <th>standard_scaled_flipper_length_mm</th>\n",
              "      <th>onehotencoded_sex</th>\n",
              "      <th>onehotencoded_species</th>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>penguin_id</th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>[{'index': 1, 'value': 1.0}]</td>\n",
              "      <td>1.20778</td>\n",
              "      <td>-0.651531</td>\n",
              "      <td>1.772656</td>\n",
              "      <td>[{'index': 3, 'value': 1.0}]</td>\n",
              "      <td>[{'index': 3, 'value': 1.0}]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>[{'index': 3, 'value': 1.0}]</td>\n",
              "      <td>-0.455602</td>\n",
              "      <td>0.662855</td>\n",
              "      <td>0.100476</td>\n",
              "      <td>[{'index': 3, 'value': 1.0}]</td>\n",
              "      <td>[{'index': 1, 'value': 1.0}]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>[{'index': 3, 'value': 1.0}]</td>\n",
              "      <td>-0.967412</td>\n",
              "      <td>-0.095445</td>\n",
              "      <td>-0.917372</td>\n",
              "      <td>[{'index': 2, 'value': 1.0}]</td>\n",
              "      <td>[{'index': 1, 'value': 1.0}]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
              "      <td>[{'index': 1, 'value': 1.0}]</td>\n",
              "      <td>0.476623</td>\n",
              "      <td>-1.207617</td>\n",
              "      <td>1.191028</td>\n",
              "      <td>[{'index': 2, 'value': 1.0}]</td>\n",
              "      <td>[{'index': 3, 'value': 1.0}]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>5</th>\n",
              "      <td>[{'index': 1, 'value': 1.0}]</td>\n",
              "      <td>-1.625454</td>\n",
              "      <td>0.359535</td>\n",
              "      <td>-0.626559</td>\n",
              "      <td>[{'index': 2, 'value': 1.0}]</td>\n",
              "      <td>[{'index': 1, 'value': 1.0}]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>7</th>\n",
              "      <td>[{'index': 1, 'value': 1.0}]</td>\n",
              "      <td>-0.345929</td>\n",
              "      <td>-1.86481</td>\n",
              "      <td>0.682104</td>\n",
              "      <td>[{'index': 2, 'value': 1.0}]</td>\n",
              "      <td>[{'index': 3, 'value': 1.0}]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>8</th>\n",
              "      <td>[{'index': 1, 'value': 1.0}]</td>\n",
              "      <td>0.842202</td>\n",
              "      <td>-1.561491</td>\n",
              "      <td>1.409139</td>\n",
              "      <td>[{'index': 3, 'value': 1.0}]</td>\n",
              "      <td>[{'index': 3, 'value': 1.0}]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>9</th>\n",
              "      <td>[{'index': 3, 'value': 1.0}]</td>\n",
              "      <td>0.348671</td>\n",
              "      <td>0.865068</td>\n",
              "      <td>-0.263041</td>\n",
              "      <td>[{'index': 3, 'value': 1.0}]</td>\n",
              "      <td>[{'index': 1, 'value': 1.0}]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>10</th>\n",
              "      <td>[{'index': 2, 'value': 1.0}]</td>\n",
              "      <td>0.933596</td>\n",
              "      <td>1.218941</td>\n",
              "      <td>0.827511</td>\n",
              "      <td>[{'index': 3, 'value': 1.0}]</td>\n",
              "      <td>[{'index': 2, 'value': 1.0}]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>11</th>\n",
              "      <td>[{'index': 3, 'value': 1.0}]</td>\n",
              "      <td>-1.460943</td>\n",
              "      <td>-0.297658</td>\n",
              "      <td>-0.771966</td>\n",
              "      <td>[{'index': 2, 'value': 1.0}]</td>\n",
              "      <td>[{'index': 1, 'value': 1.0}]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>12</th>\n",
              "      <td>[{'index': 1, 'value': 1.0}]</td>\n",
              "      <td>1.317454</td>\n",
              "      <td>-0.449318</td>\n",
              "      <td>1.409139</td>\n",
              "      <td>[{'index': 3, 'value': 1.0}]</td>\n",
              "      <td>[{'index': 3, 'value': 1.0}]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>13</th>\n",
              "      <td>[{'index': 1, 'value': 1.0}]</td>\n",
              "      <td>-0.236255</td>\n",
              "      <td>-1.763704</td>\n",
              "      <td>0.900214</td>\n",
              "      <td>[{'index': 2, 'value': 1.0}]</td>\n",
              "      <td>[{'index': 3, 'value': 1.0}]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>14</th>\n",
              "      <td>[{'index': 2, 'value': 1.0}]</td>\n",
              "      <td>0.549739</td>\n",
              "      <td>-0.297658</td>\n",
              "      <td>-0.626559</td>\n",
              "      <td>[{'index': 2, 'value': 1.0}]</td>\n",
              "      <td>[{'index': 2, 'value': 1.0}]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>16</th>\n",
              "      <td>[{'index': 1, 'value': 1.0}]</td>\n",
              "      <td>0.970154</td>\n",
              "      <td>-1.005404</td>\n",
              "      <td>1.481842</td>\n",
              "      <td>[{'index': 3, 'value': 1.0}]</td>\n",
              "      <td>[{'index': 3, 'value': 1.0}]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>17</th>\n",
              "      <td>[{'index': 1, 'value': 1.0}]</td>\n",
              "      <td>-1.058807</td>\n",
              "      <td>-0.348211</td>\n",
              "      <td>-0.190338</td>\n",
              "      <td>[{'index': 2, 'value': 1.0}]</td>\n",
              "      <td>[{'index': 1, 'value': 1.0}]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>18</th>\n",
              "      <td>[{'index': 1, 'value': 1.0}]</td>\n",
              "      <td>1.354012</td>\n",
              "      <td>-1.510937</td>\n",
              "      <td>1.263732</td>\n",
              "      <td>[{'index': 3, 'value': 1.0}]</td>\n",
              "      <td>[{'index': 3, 'value': 1.0}]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>19</th>\n",
              "      <td>[{'index': 1, 'value': 1.0}]</td>\n",
              "      <td>-0.053466</td>\n",
              "      <td>-1.662597</td>\n",
              "      <td>1.191028</td>\n",
              "      <td>[{'index': 2, 'value': 1.0}]</td>\n",
              "      <td>[{'index': 3, 'value': 1.0}]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>20</th>\n",
              "      <td>[{'index': 1, 'value': 1.0}]</td>\n",
              "      <td>-0.199697</td>\n",
              "      <td>-1.510937</td>\n",
              "      <td>0.609401</td>\n",
              "      <td>[{'index': 2, 'value': 1.0}]</td>\n",
              "      <td>[{'index': 3, 'value': 1.0}]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>21</th>\n",
              "      <td>[{'index': 2, 'value': 1.0}]</td>\n",
              "      <td>1.152943</td>\n",
              "      <td>0.763962</td>\n",
              "      <td>-0.190338</td>\n",
              "      <td>[{'index': 2, 'value': 1.0}]</td>\n",
              "      <td>[{'index': 2, 'value': 1.0}]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>22</th>\n",
              "      <td>[{'index': 2, 'value': 1.0}]</td>\n",
              "      <td>-1.205038</td>\n",
              "      <td>0.308982</td>\n",
              "      <td>-0.699262</td>\n",
              "      <td>[{'index': 2, 'value': 1.0}]</td>\n",
              "      <td>[{'index': 1, 'value': 1.0}]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>24</th>\n",
              "      <td>[{'index': 1, 'value': 1.0}]</td>\n",
              "      <td>-0.784623</td>\n",
              "      <td>1.775028</td>\n",
              "      <td>-0.699262</td>\n",
              "      <td>[{'index': 2, 'value': 1.0}]</td>\n",
              "      <td>[{'index': 1, 'value': 1.0}]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>25</th>\n",
              "      <td>[{'index': 3, 'value': 1.0}]</td>\n",
              "      <td>-0.83946</td>\n",
              "      <td>1.724474</td>\n",
              "      <td>-0.771966</td>\n",
              "      <td>[{'index': 3, 'value': 1.0}]</td>\n",
              "      <td>[{'index': 1, 'value': 1.0}]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>26</th>\n",
              "      <td>[{'index': 1, 'value': 1.0}]</td>\n",
              "      <td>-0.620113</td>\n",
              "      <td>0.359535</td>\n",
              "      <td>-0.990076</td>\n",
              "      <td>[{'index': 2, 'value': 1.0}]</td>\n",
              "      <td>[{'index': 1, 'value': 1.0}]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>27</th>\n",
              "      <td>[{'index': 2, 'value': 1.0}]</td>\n",
              "      <td>0.330392</td>\n",
              "      <td>-0.095445</td>\n",
              "      <td>-0.408448</td>\n",
              "      <td>[{'index': 2, 'value': 1.0}]</td>\n",
              "      <td>[{'index': 2, 'value': 1.0}]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>29</th>\n",
              "      <td>[{'index': 1, 'value': 1.0}]</td>\n",
              "      <td>2.194842</td>\n",
              "      <td>-0.095445</td>\n",
              "      <td>1.990767</td>\n",
              "      <td>[{'index': 3, 'value': 1.0}]</td>\n",
              "      <td>[{'index': 3, 'value': 1.0}]</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "<p>25 rows × 6 columns</p>\n",
              "</div>[267 rows x 6 columns in total]"
            ],
            "text/plain": [
              "                    onehotencoded_island  standard_scaled_culmen_length_mm  \\\n",
              "penguin_id                                                                   \n",
              "0           [{'index': 1, 'value': 1.0}]                           1.20778   \n",
              "2           [{'index': 3, 'value': 1.0}]                         -0.455602   \n",
              "3           [{'index': 3, 'value': 1.0}]                         -0.967412   \n",
              "4           [{'index': 1, 'value': 1.0}]                          0.476623   \n",
              "5           [{'index': 1, 'value': 1.0}]                         -1.625454   \n",
              "7           [{'index': 1, 'value': 1.0}]                         -0.345929   \n",
              "8           [{'index': 1, 'value': 1.0}]                          0.842202   \n",
              "9           [{'index': 3, 'value': 1.0}]                          0.348671   \n",
              "10          [{'index': 2, 'value': 1.0}]                          0.933596   \n",
              "11          [{'index': 3, 'value': 1.0}]                         -1.460943   \n",
              "12          [{'index': 1, 'value': 1.0}]                          1.317454   \n",
              "13          [{'index': 1, 'value': 1.0}]                         -0.236255   \n",
              "14          [{'index': 2, 'value': 1.0}]                          0.549739   \n",
              "16          [{'index': 1, 'value': 1.0}]                          0.970154   \n",
              "17          [{'index': 1, 'value': 1.0}]                         -1.058807   \n",
              "18          [{'index': 1, 'value': 1.0}]                          1.354012   \n",
              "19          [{'index': 1, 'value': 1.0}]                         -0.053466   \n",
              "20          [{'index': 1, 'value': 1.0}]                         -0.199697   \n",
              "21          [{'index': 2, 'value': 1.0}]                          1.152943   \n",
              "22          [{'index': 2, 'value': 1.0}]                         -1.205038   \n",
              "24          [{'index': 1, 'value': 1.0}]                         -0.784623   \n",
              "25          [{'index': 3, 'value': 1.0}]                          -0.83946   \n",
              "26          [{'index': 1, 'value': 1.0}]                         -0.620113   \n",
              "27          [{'index': 2, 'value': 1.0}]                          0.330392   \n",
              "29          [{'index': 1, 'value': 1.0}]                          2.194842   \n",
              "\n",
              "            standard_scaled_culmen_depth_mm  \\\n",
              "penguin_id                                    \n",
              "0                                 -0.651531   \n",
              "2                                  0.662855   \n",
              "3                                 -0.095445   \n",
              "4                                 -1.207617   \n",
              "5                                  0.359535   \n",
              "7                                  -1.86481   \n",
              "8                                 -1.561491   \n",
              "9                                  0.865068   \n",
              "10                                 1.218941   \n",
              "11                                -0.297658   \n",
              "12                                -0.449318   \n",
              "13                                -1.763704   \n",
              "14                                -0.297658   \n",
              "16                                -1.005404   \n",
              "17                                -0.348211   \n",
              "18                                -1.510937   \n",
              "19                                -1.662597   \n",
              "20                                -1.510937   \n",
              "21                                 0.763962   \n",
              "22                                 0.308982   \n",
              "24                                 1.775028   \n",
              "25                                 1.724474   \n",
              "26                                 0.359535   \n",
              "27                                -0.095445   \n",
              "29                                -0.095445   \n",
              "\n",
              "            standard_scaled_flipper_length_mm             onehotencoded_sex  \\\n",
              "penguin_id                                                                    \n",
              "0                                    1.772656  [{'index': 3, 'value': 1.0}]   \n",
              "2                                    0.100476  [{'index': 3, 'value': 1.0}]   \n",
              "3                                   -0.917372  [{'index': 2, 'value': 1.0}]   \n",
              "4                                    1.191028  [{'index': 2, 'value': 1.0}]   \n",
              "5                                   -0.626559  [{'index': 2, 'value': 1.0}]   \n",
              "7                                    0.682104  [{'index': 2, 'value': 1.0}]   \n",
              "8                                    1.409139  [{'index': 3, 'value': 1.0}]   \n",
              "9                                   -0.263041  [{'index': 3, 'value': 1.0}]   \n",
              "10                                   0.827511  [{'index': 3, 'value': 1.0}]   \n",
              "11                                  -0.771966  [{'index': 2, 'value': 1.0}]   \n",
              "12                                   1.409139  [{'index': 3, 'value': 1.0}]   \n",
              "13                                   0.900214  [{'index': 2, 'value': 1.0}]   \n",
              "14                                  -0.626559  [{'index': 2, 'value': 1.0}]   \n",
              "16                                   1.481842  [{'index': 3, 'value': 1.0}]   \n",
              "17                                  -0.190338  [{'index': 2, 'value': 1.0}]   \n",
              "18                                   1.263732  [{'index': 3, 'value': 1.0}]   \n",
              "19                                   1.191028  [{'index': 2, 'value': 1.0}]   \n",
              "20                                   0.609401  [{'index': 2, 'value': 1.0}]   \n",
              "21                                  -0.190338  [{'index': 2, 'value': 1.0}]   \n",
              "22                                  -0.699262  [{'index': 2, 'value': 1.0}]   \n",
              "24                                  -0.699262  [{'index': 2, 'value': 1.0}]   \n",
              "25                                  -0.771966  [{'index': 3, 'value': 1.0}]   \n",
              "26                                  -0.990076  [{'index': 2, 'value': 1.0}]   \n",
              "27                                  -0.408448  [{'index': 2, 'value': 1.0}]   \n",
              "29                                   1.990767  [{'index': 3, 'value': 1.0}]   \n",
              "\n",
              "                   onehotencoded_species  \n",
              "penguin_id                                \n",
              "0           [{'index': 3, 'value': 1.0}]  \n",
              "2           [{'index': 1, 'value': 1.0}]  \n",
              "3           [{'index': 1, 'value': 1.0}]  \n",
              "4           [{'index': 3, 'value': 1.0}]  \n",
              "5           [{'index': 1, 'value': 1.0}]  \n",
              "7           [{'index': 3, 'value': 1.0}]  \n",
              "8           [{'index': 3, 'value': 1.0}]  \n",
              "9           [{'index': 1, 'value': 1.0}]  \n",
              "10          [{'index': 2, 'value': 1.0}]  \n",
              "11          [{'index': 1, 'value': 1.0}]  \n",
              "12          [{'index': 3, 'value': 1.0}]  \n",
              "13          [{'index': 3, 'value': 1.0}]  \n",
              "14          [{'index': 2, 'value': 1.0}]  \n",
              "16          [{'index': 3, 'value': 1.0}]  \n",
              "17          [{'index': 1, 'value': 1.0}]  \n",
              "18          [{'index': 3, 'value': 1.0}]  \n",
              "19          [{'index': 3, 'value': 1.0}]  \n",
              "20          [{'index': 3, 'value': 1.0}]  \n",
              "21          [{'index': 2, 'value': 1.0}]  \n",
              "22          [{'index': 1, 'value': 1.0}]  \n",
              "24          [{'index': 1, 'value': 1.0}]  \n",
              "25          [{'index': 1, 'value': 1.0}]  \n",
              "26          [{'index': 1, 'value': 1.0}]  \n",
              "27          [{'index': 2, 'value': 1.0}]  \n",
              "29          [{'index': 3, 'value': 1.0}]  \n",
              "...\n",
              "\n",
              "[267 rows x 6 columns]"
            ]
          },
          "execution_count": 18,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "from bigframes.ml.compose import ColumnTransformer\n",
        "from bigframes.ml.preprocessing import OneHotEncoder\n",
        "\n",
        "# Create an aggregate transform that applies StandardScaler to the numeric columns,\n",
        "# and OneHotEncoder to the string columns\n",
        "preproc = ColumnTransformer([\n",
        "    (\"scale\", StandardScaler(), [\"culmen_length_mm\", \"culmen_depth_mm\", \"flipper_length_mm\"]),\n",
        "    (\"encode\", OneHotEncoder(), [\"species\", \"sex\", \"island\"])])\n",
        "\n",
        "# Now we can fit all columns of the training data\n",
        "preproc.fit(X_train)\n",
        "\n",
        "processed_X_train = preproc.transform(X_train)\n",
        "processed_X_test = preproc.transform(X_test)\n",
        "\n",
        "# View the processed training data\n",
        "processed_X_train"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "JhoO4fctm4Q5"
      },
      "source": [
        "### Predictors\n",
        "\n",
        "Predictors are estimators that learn and make predictions. In addition to `.fit(...)`, the predictor implements a `.predict(...)` method, which will use what was learned during `.fit(...)` to predict some output.\n",
        "\n",
        "Predictors can be further broken down into two categories:\n",
        "* Supervised predictors\n",
        "* Unsupervised predictors"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "TqLItVyjslP8"
      },
      "source": [
        "#### Supervised predictors\n",
        "\n",
        "Supervised learning is when we train a model on input-output pairs, and then ask it to predict the output for new inputs. An example of such a predictor is `bigframes.ml.linear_models.LinearRegression`."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 19,
      "metadata": {
        "id": "ZeloMmopm8KI"
      },
      "outputs": [
        {
          "data": {
            "text/html": [
              "Query job a59bf4cc-4c92-4a68-96b1-7465fbcb3ed0 is DONE. 21.4 kB processed. <a target=\"_blank\" href=\"https://console.cloud.google.com/bigquery?project=swast-scratch&j=bq:US:a59bf4cc-4c92-4a68-96b1-7465fbcb3ed0&page=queryresults\">Open Job</a>"
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "text/html": [
              "Query job 6860c534-a218-4a55-866d-a6e011399cd9 is DONE. 28.9 kB processed. <a target=\"_blank\" href=\"https://console.cloud.google.com/bigquery?project=swast-scratch&j=bq:US:6860c534-a218-4a55-866d-a6e011399cd9&page=queryresults\">Open Job</a>"
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "text/html": [
              "Query job 1b3e8da6-2d64-4337-872e-55b874f00596 is DONE. 0 Bytes processed. <a target=\"_blank\" href=\"https://console.cloud.google.com/bigquery?project=swast-scratch&j=bq:US:1b3e8da6-2d64-4337-872e-55b874f00596&page=queryresults\">Open Job</a>"
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "text/html": [
              "Query job fc118469-8dd7-4187-a3c1-7c5c2f1c5e36 is DONE. 5.7 kB processed. <a target=\"_blank\" href=\"https://console.cloud.google.com/bigquery?project=swast-scratch&j=bq:US:fc118469-8dd7-4187-a3c1-7c5c2f1c5e36&page=queryresults\">Open Job</a>"
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "text/html": [
              "Query job 544c5453-cd10-4a08-a338-601d85142df8 is DONE. 536 Bytes processed. <a target=\"_blank\" href=\"https://console.cloud.google.com/bigquery?project=swast-scratch&j=bq:US:544c5453-cd10-4a08-a338-601d85142df8&page=queryresults\">Open Job</a>"
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "text/html": [
              "Query job 41c82cc9-7268-40ae-a736-f7a5f2c8b413 is DONE. 0 Bytes processed. <a target=\"_blank\" href=\"https://console.cloud.google.com/bigquery?project=swast-scratch&j=bq:US:41c82cc9-7268-40ae-a736-f7a5f2c8b413&page=queryresults\">Open Job</a>"
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "text/html": [
              "Query job e9836f6b-160d-4ce4-88b6-0b04f40a1549 is DONE. 5.9 kB processed. <a target=\"_blank\" href=\"https://console.cloud.google.com/bigquery?project=swast-scratch&j=bq:US:e9836f6b-160d-4ce4-88b6-0b04f40a1549&page=queryresults\">Open Job</a>"
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>predicted_body_mass_g</th>\n",
              "      <th>onehotencoded_island</th>\n",
              "      <th>standard_scaled_culmen_length_mm</th>\n",
              "      <th>standard_scaled_culmen_depth_mm</th>\n",
              "      <th>standard_scaled_flipper_length_mm</th>\n",
              "      <th>onehotencoded_sex</th>\n",
              "      <th>onehotencoded_species</th>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>penguin_id</th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>4772.376044</td>\n",
              "      <td>[{'index': 1, 'value': 1.0}]</td>\n",
              "      <td>0.220718</td>\n",
              "      <td>-1.359277</td>\n",
              "      <td>1.045621</td>\n",
              "      <td>[{'index': 2, 'value': 1.0}]</td>\n",
              "      <td>[{'index': 3, 'value': 1.0}]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>15</th>\n",
              "      <td>3883.373922</td>\n",
              "      <td>[{'index': 2, 'value': 1.0}]</td>\n",
              "      <td>-0.510439</td>\n",
              "      <td>0.157322</td>\n",
              "      <td>-0.771966</td>\n",
              "      <td>[{'index': 3, 'value': 1.0}]</td>\n",
              "      <td>[{'index': 1, 'value': 1.0}]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>28</th>\n",
              "      <td>3479.709088</td>\n",
              "      <td>[{'index': 2, 'value': 1.0}]</td>\n",
              "      <td>-1.058807</td>\n",
              "      <td>0.713408</td>\n",
              "      <td>-0.771966</td>\n",
              "      <td>[{'index': 2, 'value': 1.0}]</td>\n",
              "      <td>[{'index': 1, 'value': 1.0}]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>32</th>\n",
              "      <td>4223.853626</td>\n",
              "      <td>[{'index': 2, 'value': 1.0}]</td>\n",
              "      <td>1.463685</td>\n",
              "      <td>1.168388</td>\n",
              "      <td>0.39129</td>\n",
              "      <td>[{'index': 3, 'value': 1.0}]</td>\n",
              "      <td>[{'index': 2, 'value': 1.0}]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>33</th>\n",
              "      <td>3197.623474</td>\n",
              "      <td>[{'index': 2, 'value': 1.0}]</td>\n",
              "      <td>-0.254534</td>\n",
              "      <td>0.056215</td>\n",
              "      <td>-0.990076</td>\n",
              "      <td>[{'index': 2, 'value': 1.0}]</td>\n",
              "      <td>[{'index': 2, 'value': 1.0}]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>34</th>\n",
              "      <td>4155.26742</td>\n",
              "      <td>[{'index': 2, 'value': 1.0}]</td>\n",
              "      <td>-0.510439</td>\n",
              "      <td>0.460642</td>\n",
              "      <td>0.318587</td>\n",
              "      <td>[{'index': 3, 'value': 1.0}]</td>\n",
              "      <td>[{'index': 1, 'value': 1.0}]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>37</th>\n",
              "      <td>3991.314095</td>\n",
              "      <td>[{'index': 2, 'value': 1.0}]</td>\n",
              "      <td>1.354012</td>\n",
              "      <td>0.511195</td>\n",
              "      <td>-0.263041</td>\n",
              "      <td>[{'index': 3, 'value': 1.0}]</td>\n",
              "      <td>[{'index': 2, 'value': 1.0}]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>41</th>\n",
              "      <td>3232.648242</td>\n",
              "      <td>[{'index': 3, 'value': 1.0}]</td>\n",
              "      <td>-0.674949</td>\n",
              "      <td>-0.095445</td>\n",
              "      <td>-1.789814</td>\n",
              "      <td>[{'index': 2, 'value': 1.0}]</td>\n",
              "      <td>[{'index': 1, 'value': 1.0}]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>47</th>\n",
              "      <td>4017.740788</td>\n",
              "      <td>[{'index': 2, 'value': 1.0}]</td>\n",
              "      <td>-1.168481</td>\n",
              "      <td>0.662855</td>\n",
              "      <td>-0.117634</td>\n",
              "      <td>[{'index': 3, 'value': 1.0}]</td>\n",
              "      <td>[{'index': 1, 'value': 1.0}]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>52</th>\n",
              "      <td>3365.080596</td>\n",
              "      <td>[{'index': 2, 'value': 1.0}]</td>\n",
              "      <td>0.458344</td>\n",
              "      <td>0.308982</td>\n",
              "      <td>-0.699262</td>\n",
              "      <td>[{'index': 2, 'value': 1.0}]</td>\n",
              "      <td>[{'index': 2, 'value': 1.0}]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>56</th>\n",
              "      <td>3791.332002</td>\n",
              "      <td>[{'index': 1, 'value': 1.0}]</td>\n",
              "      <td>-1.040528</td>\n",
              "      <td>0.460642</td>\n",
              "      <td>-1.135483</td>\n",
              "      <td>[{'index': 3, 'value': 1.0}]</td>\n",
              "      <td>[{'index': 1, 'value': 1.0}]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>57</th>\n",
              "      <td>3547.892992</td>\n",
              "      <td>[{'index': 1, 'value': 1.0}]</td>\n",
              "      <td>-0.967412</td>\n",
              "      <td>0.005662</td>\n",
              "      <td>-0.117634</td>\n",
              "      <td>[{'index': 2, 'value': 1.0}]</td>\n",
              "      <td>[{'index': 1, 'value': 1.0}]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>62</th>\n",
              "      <td>5372.087702</td>\n",
              "      <td>[{'index': 1, 'value': 1.0}]</td>\n",
              "      <td>0.988433</td>\n",
              "      <td>-0.752638</td>\n",
              "      <td>1.191028</td>\n",
              "      <td>[{'index': 3, 'value': 1.0}]</td>\n",
              "      <td>[{'index': 3, 'value': 1.0}]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>65</th>\n",
              "      <td>4263.232169</td>\n",
              "      <td>[{'index': 2, 'value': 1.0}]</td>\n",
              "      <td>1.756148</td>\n",
              "      <td>1.370601</td>\n",
              "      <td>0.318587</td>\n",
              "      <td>[{'index': 3, 'value': 1.0}]</td>\n",
              "      <td>[{'index': 2, 'value': 1.0}]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>67</th>\n",
              "      <td>5234.45894</td>\n",
              "      <td>[{'index': 1, 'value': 1.0}]</td>\n",
              "      <td>0.677691</td>\n",
              "      <td>-1.359277</td>\n",
              "      <td>1.045621</td>\n",
              "      <td>[{'index': 3, 'value': 1.0}]</td>\n",
              "      <td>[{'index': 3, 'value': 1.0}]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>75</th>\n",
              "      <td>3979.314516</td>\n",
              "      <td>[{'index': 1, 'value': 1.0}]</td>\n",
              "      <td>-1.113644</td>\n",
              "      <td>1.421155</td>\n",
              "      <td>-0.771966</td>\n",
              "      <td>[{'index': 3, 'value': 1.0}]</td>\n",
              "      <td>[{'index': 1, 'value': 1.0}]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>81</th>\n",
              "      <td>3481.331391</td>\n",
              "      <td>[{'index': 2, 'value': 1.0}]</td>\n",
              "      <td>0.677691</td>\n",
              "      <td>0.561748</td>\n",
              "      <td>-0.408448</td>\n",
              "      <td>[{'index': 2, 'value': 1.0}]</td>\n",
              "      <td>[{'index': 2, 'value': 1.0}]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>89</th>\n",
              "      <td>3915.240555</td>\n",
              "      <td>[{'index': 2, 'value': 1.0}]</td>\n",
              "      <td>-0.857739</td>\n",
              "      <td>0.713408</td>\n",
              "      <td>-0.771966</td>\n",
              "      <td>[{'index': 3, 'value': 1.0}]</td>\n",
              "      <td>[{'index': 1, 'value': 1.0}]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>92</th>\n",
              "      <td>3425.563946</td>\n",
              "      <td>[{'index': 2, 'value': 1.0}]</td>\n",
              "      <td>-0.802902</td>\n",
              "      <td>0.308982</td>\n",
              "      <td>-0.917372</td>\n",
              "      <td>[{'index': 2, 'value': 1.0}]</td>\n",
              "      <td>[{'index': 1, 'value': 1.0}]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>93</th>\n",
              "      <td>4141.497717</td>\n",
              "      <td>[{'index': 1, 'value': 1.0}]</td>\n",
              "      <td>-0.309371</td>\n",
              "      <td>1.168388</td>\n",
              "      <td>-0.263041</td>\n",
              "      <td>[{'index': 3, 'value': 1.0}]</td>\n",
              "      <td>[{'index': 1, 'value': 1.0}]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>96</th>\n",
              "      <td>3394.72289</td>\n",
              "      <td>[{'index': 2, 'value': 1.0}]</td>\n",
              "      <td>-0.309371</td>\n",
              "      <td>0.662855</td>\n",
              "      <td>-1.499</td>\n",
              "      <td>[{'index': 2, 'value': 1.0}]</td>\n",
              "      <td>[{'index': 1, 'value': 1.0}]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>100</th>\n",
              "      <td>3507.226918</td>\n",
              "      <td>[{'index': 2, 'value': 1.0}]</td>\n",
              "      <td>-0.912576</td>\n",
              "      <td>0.814515</td>\n",
              "      <td>-0.771966</td>\n",
              "      <td>[{'index': 2, 'value': 1.0}]</td>\n",
              "      <td>[{'index': 1, 'value': 1.0}]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>101</th>\n",
              "      <td>4922.286202</td>\n",
              "      <td>[{'index': 1, 'value': 1.0}]</td>\n",
              "      <td>0.549739</td>\n",
              "      <td>-1.308724</td>\n",
              "      <td>1.554546</td>\n",
              "      <td>[{'index': 2, 'value': 1.0}]</td>\n",
              "      <td>[{'index': 3, 'value': 1.0}]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>102</th>\n",
              "      <td>4016.243221</td>\n",
              "      <td>[{'index': 2, 'value': 1.0}]</td>\n",
              "      <td>-0.126582</td>\n",
              "      <td>0.662855</td>\n",
              "      <td>-0.626559</td>\n",
              "      <td>[{'index': 3, 'value': 1.0}]</td>\n",
              "      <td>[{'index': 1, 'value': 1.0}]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>107</th>\n",
              "      <td>4933.655362</td>\n",
              "      <td>[{'index': 1, 'value': 1.0}]</td>\n",
              "      <td>1.20778</td>\n",
              "      <td>-1.005404</td>\n",
              "      <td>1.118325</td>\n",
              "      <td>[{'index': 2, 'value': 1.0}]</td>\n",
              "      <td>[{'index': 3, 'value': 1.0}]</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "<p>25 rows × 7 columns</p>\n",
              "</div>[67 rows x 7 columns in total]"
            ],
            "text/plain": [
              "            predicted_body_mass_g          onehotencoded_island  \\\n",
              "penguin_id                                                        \n",
              "1                     4772.376044  [{'index': 1, 'value': 1.0}]   \n",
              "15                    3883.373922  [{'index': 2, 'value': 1.0}]   \n",
              "28                    3479.709088  [{'index': 2, 'value': 1.0}]   \n",
              "32                    4223.853626  [{'index': 2, 'value': 1.0}]   \n",
              "33                    3197.623474  [{'index': 2, 'value': 1.0}]   \n",
              "34                     4155.26742  [{'index': 2, 'value': 1.0}]   \n",
              "37                    3991.314095  [{'index': 2, 'value': 1.0}]   \n",
              "41                    3232.648242  [{'index': 3, 'value': 1.0}]   \n",
              "47                    4017.740788  [{'index': 2, 'value': 1.0}]   \n",
              "52                    3365.080596  [{'index': 2, 'value': 1.0}]   \n",
              "56                    3791.332002  [{'index': 1, 'value': 1.0}]   \n",
              "57                    3547.892992  [{'index': 1, 'value': 1.0}]   \n",
              "62                    5372.087702  [{'index': 1, 'value': 1.0}]   \n",
              "65                    4263.232169  [{'index': 2, 'value': 1.0}]   \n",
              "67                     5234.45894  [{'index': 1, 'value': 1.0}]   \n",
              "75                    3979.314516  [{'index': 1, 'value': 1.0}]   \n",
              "81                    3481.331391  [{'index': 2, 'value': 1.0}]   \n",
              "89                    3915.240555  [{'index': 2, 'value': 1.0}]   \n",
              "92                    3425.563946  [{'index': 2, 'value': 1.0}]   \n",
              "93                    4141.497717  [{'index': 1, 'value': 1.0}]   \n",
              "96                     3394.72289  [{'index': 2, 'value': 1.0}]   \n",
              "100                   3507.226918  [{'index': 2, 'value': 1.0}]   \n",
              "101                   4922.286202  [{'index': 1, 'value': 1.0}]   \n",
              "102                   4016.243221  [{'index': 2, 'value': 1.0}]   \n",
              "107                   4933.655362  [{'index': 1, 'value': 1.0}]   \n",
              "\n",
              "            standard_scaled_culmen_length_mm  standard_scaled_culmen_depth_mm  \\\n",
              "penguin_id                                                                      \n",
              "1                                   0.220718                        -1.359277   \n",
              "15                                 -0.510439                         0.157322   \n",
              "28                                 -1.058807                         0.713408   \n",
              "32                                  1.463685                         1.168388   \n",
              "33                                 -0.254534                         0.056215   \n",
              "34                                 -0.510439                         0.460642   \n",
              "37                                  1.354012                         0.511195   \n",
              "41                                 -0.674949                        -0.095445   \n",
              "47                                 -1.168481                         0.662855   \n",
              "52                                  0.458344                         0.308982   \n",
              "56                                 -1.040528                         0.460642   \n",
              "57                                 -0.967412                         0.005662   \n",
              "62                                  0.988433                        -0.752638   \n",
              "65                                  1.756148                         1.370601   \n",
              "67                                  0.677691                        -1.359277   \n",
              "75                                 -1.113644                         1.421155   \n",
              "81                                  0.677691                         0.561748   \n",
              "89                                 -0.857739                         0.713408   \n",
              "92                                 -0.802902                         0.308982   \n",
              "93                                 -0.309371                         1.168388   \n",
              "96                                 -0.309371                         0.662855   \n",
              "100                                -0.912576                         0.814515   \n",
              "101                                 0.549739                        -1.308724   \n",
              "102                                -0.126582                         0.662855   \n",
              "107                                  1.20778                        -1.005404   \n",
              "\n",
              "            standard_scaled_flipper_length_mm             onehotencoded_sex  \\\n",
              "penguin_id                                                                    \n",
              "1                                    1.045621  [{'index': 2, 'value': 1.0}]   \n",
              "15                                  -0.771966  [{'index': 3, 'value': 1.0}]   \n",
              "28                                  -0.771966  [{'index': 2, 'value': 1.0}]   \n",
              "32                                    0.39129  [{'index': 3, 'value': 1.0}]   \n",
              "33                                  -0.990076  [{'index': 2, 'value': 1.0}]   \n",
              "34                                   0.318587  [{'index': 3, 'value': 1.0}]   \n",
              "37                                  -0.263041  [{'index': 3, 'value': 1.0}]   \n",
              "41                                  -1.789814  [{'index': 2, 'value': 1.0}]   \n",
              "47                                  -0.117634  [{'index': 3, 'value': 1.0}]   \n",
              "52                                  -0.699262  [{'index': 2, 'value': 1.0}]   \n",
              "56                                  -1.135483  [{'index': 3, 'value': 1.0}]   \n",
              "57                                  -0.117634  [{'index': 2, 'value': 1.0}]   \n",
              "62                                   1.191028  [{'index': 3, 'value': 1.0}]   \n",
              "65                                   0.318587  [{'index': 3, 'value': 1.0}]   \n",
              "67                                   1.045621  [{'index': 3, 'value': 1.0}]   \n",
              "75                                  -0.771966  [{'index': 3, 'value': 1.0}]   \n",
              "81                                  -0.408448  [{'index': 2, 'value': 1.0}]   \n",
              "89                                  -0.771966  [{'index': 3, 'value': 1.0}]   \n",
              "92                                  -0.917372  [{'index': 2, 'value': 1.0}]   \n",
              "93                                  -0.263041  [{'index': 3, 'value': 1.0}]   \n",
              "96                                     -1.499  [{'index': 2, 'value': 1.0}]   \n",
              "100                                 -0.771966  [{'index': 2, 'value': 1.0}]   \n",
              "101                                  1.554546  [{'index': 2, 'value': 1.0}]   \n",
              "102                                 -0.626559  [{'index': 3, 'value': 1.0}]   \n",
              "107                                  1.118325  [{'index': 2, 'value': 1.0}]   \n",
              "\n",
              "                   onehotencoded_species  \n",
              "penguin_id                                \n",
              "1           [{'index': 3, 'value': 1.0}]  \n",
              "15          [{'index': 1, 'value': 1.0}]  \n",
              "28          [{'index': 1, 'value': 1.0}]  \n",
              "32          [{'index': 2, 'value': 1.0}]  \n",
              "33          [{'index': 2, 'value': 1.0}]  \n",
              "34          [{'index': 1, 'value': 1.0}]  \n",
              "37          [{'index': 2, 'value': 1.0}]  \n",
              "41          [{'index': 1, 'value': 1.0}]  \n",
              "47          [{'index': 1, 'value': 1.0}]  \n",
              "52          [{'index': 2, 'value': 1.0}]  \n",
              "56          [{'index': 1, 'value': 1.0}]  \n",
              "57          [{'index': 1, 'value': 1.0}]  \n",
              "62          [{'index': 3, 'value': 1.0}]  \n",
              "65          [{'index': 2, 'value': 1.0}]  \n",
              "67          [{'index': 3, 'value': 1.0}]  \n",
              "75          [{'index': 1, 'value': 1.0}]  \n",
              "81          [{'index': 2, 'value': 1.0}]  \n",
              "89          [{'index': 1, 'value': 1.0}]  \n",
              "92          [{'index': 1, 'value': 1.0}]  \n",
              "93          [{'index': 1, 'value': 1.0}]  \n",
              "96          [{'index': 1, 'value': 1.0}]  \n",
              "100         [{'index': 1, 'value': 1.0}]  \n",
              "101         [{'index': 3, 'value': 1.0}]  \n",
              "102         [{'index': 1, 'value': 1.0}]  \n",
              "107         [{'index': 3, 'value': 1.0}]  \n",
              "\n",
              "[67 rows x 7 columns]"
            ]
          },
          "execution_count": 19,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "from bigframes.ml.linear_model import LinearRegression\n",
        "\n",
        "linreg = LinearRegression()\n",
        "\n",
        "# Learn from the training data how to predict output y\n",
        "linreg.fit(processed_X_train, y_train)\n",
        "\n",
        "# Predict y for the test data\n",
        "predicted_y_test = linreg.predict(processed_X_test)\n",
        "\n",
        "# View predictions\n",
        "predicted_y_test"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "z42qesW_nAIf"
      },
      "source": [
        "#### Unsupervised predictors\n",
        "\n",
        "In unsupervised learning, there are no known outputs in the training data, instead the model learns on input data alone and predicts something else. An example of an unsupervised predictor is `bigframes.ml.cluster.KMeans`, which learns how to fit input data to a target number of clusters."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 20,
      "metadata": {
        "id": "M13zd02znCIg"
      },
      "outputs": [
        {
          "data": {
            "text/html": [
              "Query job 728068d3-2349-4636-a030-016b500a9812 is DONE. 23.5 kB processed. <a target=\"_blank\" href=\"https://console.cloud.google.com/bigquery?project=swast-scratch&j=bq:US:728068d3-2349-4636-a030-016b500a9812&page=queryresults\">Open Job</a>"
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "text/html": [
              "Query job 37bac685-2afa-4ece-b3a3-e0b84a92c65f is DONE. 0 Bytes processed. <a target=\"_blank\" href=\"https://console.cloud.google.com/bigquery?project=swast-scratch&j=bq:US:37bac685-2afa-4ece-b3a3-e0b84a92c65f&page=queryresults\">Open Job</a>"
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "text/html": [
              "Query job 38416629-4615-45f5-9e27-d9164124f755 is DONE. 6.2 kB processed. <a target=\"_blank\" href=\"https://console.cloud.google.com/bigquery?project=swast-scratch&j=bq:US:38416629-4615-45f5-9e27-d9164124f755&page=queryresults\">Open Job</a>"
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "text/html": [
              "Query job 0241ea1c-8d96-418a-b3d6-08d819854954 is DONE. 536 Bytes processed. <a target=\"_blank\" href=\"https://console.cloud.google.com/bigquery?project=swast-scratch&j=bq:US:0241ea1c-8d96-418a-b3d6-08d819854954&page=queryresults\">Open Job</a>"
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "text/html": [
              "Query job 405bcf9b-d652-42f3-931e-12ca0310fe4f is DONE. 0 Bytes processed. <a target=\"_blank\" href=\"https://console.cloud.google.com/bigquery?project=swast-scratch&j=bq:US:405bcf9b-d652-42f3-931e-12ca0310fe4f&page=queryresults\">Open Job</a>"
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "text/html": [
              "Query job 21ca6f31-2ea2-4f71-b030-c738bf5afe27 is DONE. 10.2 kB processed. <a target=\"_blank\" href=\"https://console.cloud.google.com/bigquery?project=swast-scratch&j=bq:US:21ca6f31-2ea2-4f71-b030-c738bf5afe27&page=queryresults\">Open Job</a>"
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>CENTROID_ID</th>\n",
              "      <th>NEAREST_CENTROIDS_DISTANCE</th>\n",
              "      <th>onehotencoded_island</th>\n",
              "      <th>standard_scaled_culmen_length_mm</th>\n",
              "      <th>standard_scaled_culmen_depth_mm</th>\n",
              "      <th>standard_scaled_flipper_length_mm</th>\n",
              "      <th>onehotencoded_sex</th>\n",
              "      <th>onehotencoded_species</th>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>penguin_id</th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>3</td>\n",
              "      <td>[{'CENTROID_ID': 3, 'DISTANCE': 0.857057881337...</td>\n",
              "      <td>[{'index': 1, 'value': 1.0}]</td>\n",
              "      <td>0.220718</td>\n",
              "      <td>-1.359277</td>\n",
              "      <td>1.045621</td>\n",
              "      <td>[{'index': 2, 'value': 1.0}]</td>\n",
              "      <td>[{'index': 3, 'value': 1.0}]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>15</th>\n",
              "      <td>4</td>\n",
              "      <td>[{'CENTROID_ID': 4, 'DISTANCE': 1.181613302004...</td>\n",
              "      <td>[{'index': 2, 'value': 1.0}]</td>\n",
              "      <td>-0.510439</td>\n",
              "      <td>0.157322</td>\n",
              "      <td>-0.771966</td>\n",
              "      <td>[{'index': 3, 'value': 1.0}]</td>\n",
              "      <td>[{'index': 1, 'value': 1.0}]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>28</th>\n",
              "      <td>1</td>\n",
              "      <td>[{'CENTROID_ID': 1, 'DISTANCE': 1.006856853050...</td>\n",
              "      <td>[{'index': 2, 'value': 1.0}]</td>\n",
              "      <td>-1.058807</td>\n",
              "      <td>0.713408</td>\n",
              "      <td>-0.771966</td>\n",
              "      <td>[{'index': 2, 'value': 1.0}]</td>\n",
              "      <td>[{'index': 1, 'value': 1.0}]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>32</th>\n",
              "      <td>2</td>\n",
              "      <td>[{'CENTROID_ID': 2, 'DISTANCE': 1.237504384283...</td>\n",
              "      <td>[{'index': 2, 'value': 1.0}]</td>\n",
              "      <td>1.463685</td>\n",
              "      <td>1.168388</td>\n",
              "      <td>0.39129</td>\n",
              "      <td>[{'index': 3, 'value': 1.0}]</td>\n",
              "      <td>[{'index': 2, 'value': 1.0}]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>33</th>\n",
              "      <td>2</td>\n",
              "      <td>[{'CENTROID_ID': 2, 'DISTANCE': 1.656439702919...</td>\n",
              "      <td>[{'index': 2, 'value': 1.0}]</td>\n",
              "      <td>-0.254534</td>\n",
              "      <td>0.056215</td>\n",
              "      <td>-0.990076</td>\n",
              "      <td>[{'index': 2, 'value': 1.0}]</td>\n",
              "      <td>[{'index': 2, 'value': 1.0}]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>34</th>\n",
              "      <td>4</td>\n",
              "      <td>[{'CENTROID_ID': 4, 'DISTANCE': 1.343792119214...</td>\n",
              "      <td>[{'index': 2, 'value': 1.0}]</td>\n",
              "      <td>-0.510439</td>\n",
              "      <td>0.460642</td>\n",
              "      <td>0.318587</td>\n",
              "      <td>[{'index': 3, 'value': 1.0}]</td>\n",
              "      <td>[{'index': 1, 'value': 1.0}]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>37</th>\n",
              "      <td>2</td>\n",
              "      <td>[{'CENTROID_ID': 2, 'DISTANCE': 0.816670297369...</td>\n",
              "      <td>[{'index': 2, 'value': 1.0}]</td>\n",
              "      <td>1.354012</td>\n",
              "      <td>0.511195</td>\n",
              "      <td>-0.263041</td>\n",
              "      <td>[{'index': 3, 'value': 1.0}]</td>\n",
              "      <td>[{'index': 2, 'value': 1.0}]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>41</th>\n",
              "      <td>1</td>\n",
              "      <td>[{'CENTROID_ID': 1, 'DISTANCE': 1.317560921596...</td>\n",
              "      <td>[{'index': 3, 'value': 1.0}]</td>\n",
              "      <td>-0.674949</td>\n",
              "      <td>-0.095445</td>\n",
              "      <td>-1.789814</td>\n",
              "      <td>[{'index': 2, 'value': 1.0}]</td>\n",
              "      <td>[{'index': 1, 'value': 1.0}]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>47</th>\n",
              "      <td>4</td>\n",
              "      <td>[{'CENTROID_ID': 4, 'DISTANCE': 1.135112005343...</td>\n",
              "      <td>[{'index': 2, 'value': 1.0}]</td>\n",
              "      <td>-1.168481</td>\n",
              "      <td>0.662855</td>\n",
              "      <td>-0.117634</td>\n",
              "      <td>[{'index': 3, 'value': 1.0}]</td>\n",
              "      <td>[{'index': 1, 'value': 1.0}]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>52</th>\n",
              "      <td>2</td>\n",
              "      <td>[{'CENTROID_ID': 2, 'DISTANCE': 1.004096945181...</td>\n",
              "      <td>[{'index': 2, 'value': 1.0}]</td>\n",
              "      <td>0.458344</td>\n",
              "      <td>0.308982</td>\n",
              "      <td>-0.699262</td>\n",
              "      <td>[{'index': 2, 'value': 1.0}]</td>\n",
              "      <td>[{'index': 2, 'value': 1.0}]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>56</th>\n",
              "      <td>4</td>\n",
              "      <td>[{'CENTROID_ID': 4, 'DISTANCE': 1.218648668822...</td>\n",
              "      <td>[{'index': 1, 'value': 1.0}]</td>\n",
              "      <td>-1.040528</td>\n",
              "      <td>0.460642</td>\n",
              "      <td>-1.135483</td>\n",
              "      <td>[{'index': 3, 'value': 1.0}]</td>\n",
              "      <td>[{'index': 1, 'value': 1.0}]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>57</th>\n",
              "      <td>1</td>\n",
              "      <td>[{'CENTROID_ID': 1, 'DISTANCE': 1.238466630273...</td>\n",
              "      <td>[{'index': 1, 'value': 1.0}]</td>\n",
              "      <td>-0.967412</td>\n",
              "      <td>0.005662</td>\n",
              "      <td>-0.117634</td>\n",
              "      <td>[{'index': 2, 'value': 1.0}]</td>\n",
              "      <td>[{'index': 1, 'value': 1.0}]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>62</th>\n",
              "      <td>3</td>\n",
              "      <td>[{'CENTROID_ID': 3, 'DISTANCE': 0.876984617451...</td>\n",
              "      <td>[{'index': 1, 'value': 1.0}]</td>\n",
              "      <td>0.988433</td>\n",
              "      <td>-0.752638</td>\n",
              "      <td>1.191028</td>\n",
              "      <td>[{'index': 3, 'value': 1.0}]</td>\n",
              "      <td>[{'index': 3, 'value': 1.0}]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>65</th>\n",
              "      <td>2</td>\n",
              "      <td>[{'CENTROID_ID': 2, 'DISTANCE': 1.439604004538...</td>\n",
              "      <td>[{'index': 2, 'value': 1.0}]</td>\n",
              "      <td>1.756148</td>\n",
              "      <td>1.370601</td>\n",
              "      <td>0.318587</td>\n",
              "      <td>[{'index': 3, 'value': 1.0}]</td>\n",
              "      <td>[{'index': 2, 'value': 1.0}]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>67</th>\n",
              "      <td>3</td>\n",
              "      <td>[{'CENTROID_ID': 3, 'DISTANCE': 0.763112987694...</td>\n",
              "      <td>[{'index': 1, 'value': 1.0}]</td>\n",
              "      <td>0.677691</td>\n",
              "      <td>-1.359277</td>\n",
              "      <td>1.045621</td>\n",
              "      <td>[{'index': 3, 'value': 1.0}]</td>\n",
              "      <td>[{'index': 3, 'value': 1.0}]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>75</th>\n",
              "      <td>4</td>\n",
              "      <td>[{'CENTROID_ID': 4, 'DISTANCE': 1.075788925734...</td>\n",
              "      <td>[{'index': 1, 'value': 1.0}]</td>\n",
              "      <td>-1.113644</td>\n",
              "      <td>1.421155</td>\n",
              "      <td>-0.771966</td>\n",
              "      <td>[{'index': 3, 'value': 1.0}]</td>\n",
              "      <td>[{'index': 1, 'value': 1.0}]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>81</th>\n",
              "      <td>2</td>\n",
              "      <td>[{'CENTROID_ID': 2, 'DISTANCE': 0.777307801541...</td>\n",
              "      <td>[{'index': 2, 'value': 1.0}]</td>\n",
              "      <td>0.677691</td>\n",
              "      <td>0.561748</td>\n",
              "      <td>-0.408448</td>\n",
              "      <td>[{'index': 2, 'value': 1.0}]</td>\n",
              "      <td>[{'index': 2, 'value': 1.0}]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>89</th>\n",
              "      <td>4</td>\n",
              "      <td>[{'CENTROID_ID': 4, 'DISTANCE': 0.891303183824...</td>\n",
              "      <td>[{'index': 2, 'value': 1.0}]</td>\n",
              "      <td>-0.857739</td>\n",
              "      <td>0.713408</td>\n",
              "      <td>-0.771966</td>\n",
              "      <td>[{'index': 3, 'value': 1.0}]</td>\n",
              "      <td>[{'index': 1, 'value': 1.0}]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>92</th>\n",
              "      <td>1</td>\n",
              "      <td>[{'CENTROID_ID': 1, 'DISTANCE': 0.934676470689...</td>\n",
              "      <td>[{'index': 2, 'value': 1.0}]</td>\n",
              "      <td>-0.802902</td>\n",
              "      <td>0.308982</td>\n",
              "      <td>-0.917372</td>\n",
              "      <td>[{'index': 2, 'value': 1.0}]</td>\n",
              "      <td>[{'index': 1, 'value': 1.0}]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>93</th>\n",
              "      <td>4</td>\n",
              "      <td>[{'CENTROID_ID': 4, 'DISTANCE': 0.984620018517...</td>\n",
              "      <td>[{'index': 1, 'value': 1.0}]</td>\n",
              "      <td>-0.309371</td>\n",
              "      <td>1.168388</td>\n",
              "      <td>-0.263041</td>\n",
              "      <td>[{'index': 3, 'value': 1.0}]</td>\n",
              "      <td>[{'index': 1, 'value': 1.0}]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>96</th>\n",
              "      <td>1</td>\n",
              "      <td>[{'CENTROID_ID': 1, 'DISTANCE': 1.446939975674...</td>\n",
              "      <td>[{'index': 2, 'value': 1.0}]</td>\n",
              "      <td>-0.309371</td>\n",
              "      <td>0.662855</td>\n",
              "      <td>-1.499</td>\n",
              "      <td>[{'index': 2, 'value': 1.0}]</td>\n",
              "      <td>[{'index': 1, 'value': 1.0}]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>100</th>\n",
              "      <td>1</td>\n",
              "      <td>[{'CENTROID_ID': 1, 'DISTANCE': 1.101117711572...</td>\n",
              "      <td>[{'index': 2, 'value': 1.0}]</td>\n",
              "      <td>-0.912576</td>\n",
              "      <td>0.814515</td>\n",
              "      <td>-0.771966</td>\n",
              "      <td>[{'index': 2, 'value': 1.0}]</td>\n",
              "      <td>[{'index': 1, 'value': 1.0}]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>101</th>\n",
              "      <td>3</td>\n",
              "      <td>[{'CENTROID_ID': 3, 'DISTANCE': 0.823832007899...</td>\n",
              "      <td>[{'index': 1, 'value': 1.0}]</td>\n",
              "      <td>0.549739</td>\n",
              "      <td>-1.308724</td>\n",
              "      <td>1.554546</td>\n",
              "      <td>[{'index': 2, 'value': 1.0}]</td>\n",
              "      <td>[{'index': 3, 'value': 1.0}]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>102</th>\n",
              "      <td>4</td>\n",
              "      <td>[{'CENTROID_ID': 4, 'DISTANCE': 0.995348310182...</td>\n",
              "      <td>[{'index': 2, 'value': 1.0}]</td>\n",
              "      <td>-0.126582</td>\n",
              "      <td>0.662855</td>\n",
              "      <td>-0.626559</td>\n",
              "      <td>[{'index': 3, 'value': 1.0}]</td>\n",
              "      <td>[{'index': 1, 'value': 1.0}]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>107</th>\n",
              "      <td>3</td>\n",
              "      <td>[{'CENTROID_ID': 3, 'DISTANCE': 0.930021405831...</td>\n",
              "      <td>[{'index': 1, 'value': 1.0}]</td>\n",
              "      <td>1.20778</td>\n",
              "      <td>-1.005404</td>\n",
              "      <td>1.118325</td>\n",
              "      <td>[{'index': 2, 'value': 1.0}]</td>\n",
              "      <td>[{'index': 3, 'value': 1.0}]</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "<p>25 rows × 8 columns</p>\n",
              "</div>[67 rows x 8 columns in total]"
            ],
            "text/plain": [
              "            CENTROID_ID                         NEAREST_CENTROIDS_DISTANCE  \\\n",
              "penguin_id                                                                   \n",
              "1                     3  [{'CENTROID_ID': 3, 'DISTANCE': 0.857057881337...   \n",
              "15                    4  [{'CENTROID_ID': 4, 'DISTANCE': 1.181613302004...   \n",
              "28                    1  [{'CENTROID_ID': 1, 'DISTANCE': 1.006856853050...   \n",
              "32                    2  [{'CENTROID_ID': 2, 'DISTANCE': 1.237504384283...   \n",
              "33                    2  [{'CENTROID_ID': 2, 'DISTANCE': 1.656439702919...   \n",
              "34                    4  [{'CENTROID_ID': 4, 'DISTANCE': 1.343792119214...   \n",
              "37                    2  [{'CENTROID_ID': 2, 'DISTANCE': 0.816670297369...   \n",
              "41                    1  [{'CENTROID_ID': 1, 'DISTANCE': 1.317560921596...   \n",
              "47                    4  [{'CENTROID_ID': 4, 'DISTANCE': 1.135112005343...   \n",
              "52                    2  [{'CENTROID_ID': 2, 'DISTANCE': 1.004096945181...   \n",
              "56                    4  [{'CENTROID_ID': 4, 'DISTANCE': 1.218648668822...   \n",
              "57                    1  [{'CENTROID_ID': 1, 'DISTANCE': 1.238466630273...   \n",
              "62                    3  [{'CENTROID_ID': 3, 'DISTANCE': 0.876984617451...   \n",
              "65                    2  [{'CENTROID_ID': 2, 'DISTANCE': 1.439604004538...   \n",
              "67                    3  [{'CENTROID_ID': 3, 'DISTANCE': 0.763112987694...   \n",
              "75                    4  [{'CENTROID_ID': 4, 'DISTANCE': 1.075788925734...   \n",
              "81                    2  [{'CENTROID_ID': 2, 'DISTANCE': 0.777307801541...   \n",
              "89                    4  [{'CENTROID_ID': 4, 'DISTANCE': 0.891303183824...   \n",
              "92                    1  [{'CENTROID_ID': 1, 'DISTANCE': 0.934676470689...   \n",
              "93                    4  [{'CENTROID_ID': 4, 'DISTANCE': 0.984620018517...   \n",
              "96                    1  [{'CENTROID_ID': 1, 'DISTANCE': 1.446939975674...   \n",
              "100                   1  [{'CENTROID_ID': 1, 'DISTANCE': 1.101117711572...   \n",
              "101                   3  [{'CENTROID_ID': 3, 'DISTANCE': 0.823832007899...   \n",
              "102                   4  [{'CENTROID_ID': 4, 'DISTANCE': 0.995348310182...   \n",
              "107                   3  [{'CENTROID_ID': 3, 'DISTANCE': 0.930021405831...   \n",
              "\n",
              "                    onehotencoded_island  standard_scaled_culmen_length_mm  \\\n",
              "penguin_id                                                                   \n",
              "1           [{'index': 1, 'value': 1.0}]                          0.220718   \n",
              "15          [{'index': 2, 'value': 1.0}]                         -0.510439   \n",
              "28          [{'index': 2, 'value': 1.0}]                         -1.058807   \n",
              "32          [{'index': 2, 'value': 1.0}]                          1.463685   \n",
              "33          [{'index': 2, 'value': 1.0}]                         -0.254534   \n",
              "34          [{'index': 2, 'value': 1.0}]                         -0.510439   \n",
              "37          [{'index': 2, 'value': 1.0}]                          1.354012   \n",
              "41          [{'index': 3, 'value': 1.0}]                         -0.674949   \n",
              "47          [{'index': 2, 'value': 1.0}]                         -1.168481   \n",
              "52          [{'index': 2, 'value': 1.0}]                          0.458344   \n",
              "56          [{'index': 1, 'value': 1.0}]                         -1.040528   \n",
              "57          [{'index': 1, 'value': 1.0}]                         -0.967412   \n",
              "62          [{'index': 1, 'value': 1.0}]                          0.988433   \n",
              "65          [{'index': 2, 'value': 1.0}]                          1.756148   \n",
              "67          [{'index': 1, 'value': 1.0}]                          0.677691   \n",
              "75          [{'index': 1, 'value': 1.0}]                         -1.113644   \n",
              "81          [{'index': 2, 'value': 1.0}]                          0.677691   \n",
              "89          [{'index': 2, 'value': 1.0}]                         -0.857739   \n",
              "92          [{'index': 2, 'value': 1.0}]                         -0.802902   \n",
              "93          [{'index': 1, 'value': 1.0}]                         -0.309371   \n",
              "96          [{'index': 2, 'value': 1.0}]                         -0.309371   \n",
              "100         [{'index': 2, 'value': 1.0}]                         -0.912576   \n",
              "101         [{'index': 1, 'value': 1.0}]                          0.549739   \n",
              "102         [{'index': 2, 'value': 1.0}]                         -0.126582   \n",
              "107         [{'index': 1, 'value': 1.0}]                           1.20778   \n",
              "\n",
              "            standard_scaled_culmen_depth_mm  \\\n",
              "penguin_id                                    \n",
              "1                                 -1.359277   \n",
              "15                                 0.157322   \n",
              "28                                 0.713408   \n",
              "32                                 1.168388   \n",
              "33                                 0.056215   \n",
              "34                                 0.460642   \n",
              "37                                 0.511195   \n",
              "41                                -0.095445   \n",
              "47                                 0.662855   \n",
              "52                                 0.308982   \n",
              "56                                 0.460642   \n",
              "57                                 0.005662   \n",
              "62                                -0.752638   \n",
              "65                                 1.370601   \n",
              "67                                -1.359277   \n",
              "75                                 1.421155   \n",
              "81                                 0.561748   \n",
              "89                                 0.713408   \n",
              "92                                 0.308982   \n",
              "93                                 1.168388   \n",
              "96                                 0.662855   \n",
              "100                                0.814515   \n",
              "101                               -1.308724   \n",
              "102                                0.662855   \n",
              "107                               -1.005404   \n",
              "\n",
              "            standard_scaled_flipper_length_mm             onehotencoded_sex  \\\n",
              "penguin_id                                                                    \n",
              "1                                    1.045621  [{'index': 2, 'value': 1.0}]   \n",
              "15                                  -0.771966  [{'index': 3, 'value': 1.0}]   \n",
              "28                                  -0.771966  [{'index': 2, 'value': 1.0}]   \n",
              "32                                    0.39129  [{'index': 3, 'value': 1.0}]   \n",
              "33                                  -0.990076  [{'index': 2, 'value': 1.0}]   \n",
              "34                                   0.318587  [{'index': 3, 'value': 1.0}]   \n",
              "37                                  -0.263041  [{'index': 3, 'value': 1.0}]   \n",
              "41                                  -1.789814  [{'index': 2, 'value': 1.0}]   \n",
              "47                                  -0.117634  [{'index': 3, 'value': 1.0}]   \n",
              "52                                  -0.699262  [{'index': 2, 'value': 1.0}]   \n",
              "56                                  -1.135483  [{'index': 3, 'value': 1.0}]   \n",
              "57                                  -0.117634  [{'index': 2, 'value': 1.0}]   \n",
              "62                                   1.191028  [{'index': 3, 'value': 1.0}]   \n",
              "65                                   0.318587  [{'index': 3, 'value': 1.0}]   \n",
              "67                                   1.045621  [{'index': 3, 'value': 1.0}]   \n",
              "75                                  -0.771966  [{'index': 3, 'value': 1.0}]   \n",
              "81                                  -0.408448  [{'index': 2, 'value': 1.0}]   \n",
              "89                                  -0.771966  [{'index': 3, 'value': 1.0}]   \n",
              "92                                  -0.917372  [{'index': 2, 'value': 1.0}]   \n",
              "93                                  -0.263041  [{'index': 3, 'value': 1.0}]   \n",
              "96                                     -1.499  [{'index': 2, 'value': 1.0}]   \n",
              "100                                 -0.771966  [{'index': 2, 'value': 1.0}]   \n",
              "101                                  1.554546  [{'index': 2, 'value': 1.0}]   \n",
              "102                                 -0.626559  [{'index': 3, 'value': 1.0}]   \n",
              "107                                  1.118325  [{'index': 2, 'value': 1.0}]   \n",
              "\n",
              "                   onehotencoded_species  \n",
              "penguin_id                                \n",
              "1           [{'index': 3, 'value': 1.0}]  \n",
              "15          [{'index': 1, 'value': 1.0}]  \n",
              "28          [{'index': 1, 'value': 1.0}]  \n",
              "32          [{'index': 2, 'value': 1.0}]  \n",
              "33          [{'index': 2, 'value': 1.0}]  \n",
              "34          [{'index': 1, 'value': 1.0}]  \n",
              "37          [{'index': 2, 'value': 1.0}]  \n",
              "41          [{'index': 1, 'value': 1.0}]  \n",
              "47          [{'index': 1, 'value': 1.0}]  \n",
              "52          [{'index': 2, 'value': 1.0}]  \n",
              "56          [{'index': 1, 'value': 1.0}]  \n",
              "57          [{'index': 1, 'value': 1.0}]  \n",
              "62          [{'index': 3, 'value': 1.0}]  \n",
              "65          [{'index': 2, 'value': 1.0}]  \n",
              "67          [{'index': 3, 'value': 1.0}]  \n",
              "75          [{'index': 1, 'value': 1.0}]  \n",
              "81          [{'index': 2, 'value': 1.0}]  \n",
              "89          [{'index': 1, 'value': 1.0}]  \n",
              "92          [{'index': 1, 'value': 1.0}]  \n",
              "93          [{'index': 1, 'value': 1.0}]  \n",
              "96          [{'index': 1, 'value': 1.0}]  \n",
              "100         [{'index': 1, 'value': 1.0}]  \n",
              "101         [{'index': 3, 'value': 1.0}]  \n",
              "102         [{'index': 1, 'value': 1.0}]  \n",
              "107         [{'index': 3, 'value': 1.0}]  \n",
              "\n",
              "[67 rows x 8 columns]"
            ]
          },
          "execution_count": 20,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "from bigframes.ml.cluster import KMeans\n",
        "\n",
        "# Specify KMeans with four clusters\n",
        "kmeans = KMeans(n_clusters=4)\n",
        "\n",
        "# Fit data\n",
        "kmeans.fit(processed_X_train)\n",
        "\n",
        "# View predictions\n",
        "kmeans.predict(processed_X_test)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "DFwsIbscnEvh"
      },
      "source": [
        "## Pipelines\n",
        "\n",
        "Transfomers and predictors can be chained into a single estimator component using `bigframes.ml.pipeline.Pipeline`:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 21,
      "metadata": {
        "id": "Ku2OXqgJnEeR"
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "Pipeline(steps=[('preproc',\n",
              "                 ColumnTransformer(transformers=[('scale', StandardScaler(),\n",
              "                                                  ['culmen_length_mm',\n",
              "                                                   'culmen_depth_mm',\n",
              "                                                   'flipper_length_mm']),\n",
              "                                                 ('encode', OneHotEncoder(),\n",
              "                                                  ['species', 'sex',\n",
              "                                                   'island'])])),\n",
              "                ('linreg', LinearRegression())])"
            ]
          },
          "execution_count": 21,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "from bigframes.ml.pipeline import Pipeline\n",
        "\n",
        "pipeline = Pipeline([\n",
        "  ('preproc', preproc),\n",
        "  ('linreg', linreg)\n",
        "])\n",
        "\n",
        "# Print our pipeline\n",
        "pipeline"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "cCQCY_6wnKz_"
      },
      "source": [
        "The pipeline simplifies the workflow by applying each of its component steps automatically:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 22,
      "metadata": {
        "id": "hsF7FYagnMko"
      },
      "outputs": [
        {
          "data": {
            "text/html": [
              "Query job 95b43592-b198-4f9e-a990-4e837b82121f is DONE. 24.8 kB processed. <a target=\"_blank\" href=\"https://console.cloud.google.com/bigquery?project=swast-scratch&j=bq:US:95b43592-b198-4f9e-a990-4e837b82121f&page=queryresults\">Open Job</a>"
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "text/html": [
              "Query job 615b2afb-0c76-45d6-82c7-bde7c8b2b3a4 is DONE. 8.5 kB processed. <a target=\"_blank\" href=\"https://console.cloud.google.com/bigquery?project=swast-scratch&j=bq:US:615b2afb-0c76-45d6-82c7-bde7c8b2b3a4&page=queryresults\">Open Job</a>"
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "text/html": [
              "Query job cf2ed3ca-01bf-4cb6-a71a-d6e30a8428f6 is DONE. 0 Bytes processed. <a target=\"_blank\" href=\"https://console.cloud.google.com/bigquery?project=swast-scratch&j=bq:US:cf2ed3ca-01bf-4cb6-a71a-d6e30a8428f6&page=queryresults\">Open Job</a>"
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "text/html": [
              "Query job d9780763-1d2b-494d-a778-20364c52bd08 is DONE. 29.6 kB processed. <a target=\"_blank\" href=\"https://console.cloud.google.com/bigquery?project=swast-scratch&j=bq:US:d9780763-1d2b-494d-a778-20364c52bd08&page=queryresults\">Open Job</a>"
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "text/html": [
              "Query job f01296ba-7cd0-4d06-b25a-b5697e46bbf7 is DONE. 536 Bytes processed. <a target=\"_blank\" href=\"https://console.cloud.google.com/bigquery?project=swast-scratch&j=bq:US:f01296ba-7cd0-4d06-b25a-b5697e46bbf7&page=queryresults\">Open Job</a>"
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "text/html": [
              "Query job 5b6fe451-2f8e-471e-a6a0-00b9bffaa826 is DONE. 0 Bytes processed. <a target=\"_blank\" href=\"https://console.cloud.google.com/bigquery?project=swast-scratch&j=bq:US:5b6fe451-2f8e-471e-a6a0-00b9bffaa826&page=queryresults\">Open Job</a>"
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "text/html": [
              "Query job 6a81883b-0514-4251-9f63-490b6346bb8b is DONE. 6.1 kB processed. <a target=\"_blank\" href=\"https://console.cloud.google.com/bigquery?project=swast-scratch&j=bq:US:6a81883b-0514-4251-9f63-490b6346bb8b&page=queryresults\">Open Job</a>"
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>predicted_body_mass_g</th>\n",
              "      <th>island</th>\n",
              "      <th>culmen_length_mm</th>\n",
              "      <th>culmen_depth_mm</th>\n",
              "      <th>flipper_length_mm</th>\n",
              "      <th>sex</th>\n",
              "      <th>species</th>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>penguin_id</th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>4772.374547</td>\n",
              "      <td>Biscoe</td>\n",
              "      <td>45.1</td>\n",
              "      <td>14.5</td>\n",
              "      <td>215.0</td>\n",
              "      <td>FEMALE</td>\n",
              "      <td>Gentoo penguin (Pygoscelis papua)</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>15</th>\n",
              "      <td>3883.371052</td>\n",
              "      <td>Dream</td>\n",
              "      <td>41.1</td>\n",
              "      <td>17.5</td>\n",
              "      <td>190.0</td>\n",
              "      <td>MALE</td>\n",
              "      <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>28</th>\n",
              "      <td>3479.706166</td>\n",
              "      <td>Dream</td>\n",
              "      <td>38.1</td>\n",
              "      <td>18.6</td>\n",
              "      <td>190.0</td>\n",
              "      <td>FEMALE</td>\n",
              "      <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>32</th>\n",
              "      <td>4223.851137</td>\n",
              "      <td>Dream</td>\n",
              "      <td>51.9</td>\n",
              "      <td>19.5</td>\n",
              "      <td>206.0</td>\n",
              "      <td>MALE</td>\n",
              "      <td>Chinstrap penguin (Pygoscelis antarctica)</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>33</th>\n",
              "      <td>3197.620461</td>\n",
              "      <td>Dream</td>\n",
              "      <td>42.5</td>\n",
              "      <td>17.3</td>\n",
              "      <td>187.0</td>\n",
              "      <td>FEMALE</td>\n",
              "      <td>Chinstrap penguin (Pygoscelis antarctica)</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>34</th>\n",
              "      <td>4155.265191</td>\n",
              "      <td>Dream</td>\n",
              "      <td>41.1</td>\n",
              "      <td>18.1</td>\n",
              "      <td>205.0</td>\n",
              "      <td>MALE</td>\n",
              "      <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>37</th>\n",
              "      <td>3991.311319</td>\n",
              "      <td>Dream</td>\n",
              "      <td>51.3</td>\n",
              "      <td>18.2</td>\n",
              "      <td>197.0</td>\n",
              "      <td>MALE</td>\n",
              "      <td>Chinstrap penguin (Pygoscelis antarctica)</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>41</th>\n",
              "      <td>3232.644783</td>\n",
              "      <td>Torgersen</td>\n",
              "      <td>40.2</td>\n",
              "      <td>17.0</td>\n",
              "      <td>176.0</td>\n",
              "      <td>FEMALE</td>\n",
              "      <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>47</th>\n",
              "      <td>4017.738303</td>\n",
              "      <td>Dream</td>\n",
              "      <td>37.5</td>\n",
              "      <td>18.5</td>\n",
              "      <td>199.0</td>\n",
              "      <td>MALE</td>\n",
              "      <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>52</th>\n",
              "      <td>3365.077659</td>\n",
              "      <td>Dream</td>\n",
              "      <td>46.4</td>\n",
              "      <td>17.8</td>\n",
              "      <td>191.0</td>\n",
              "      <td>FEMALE</td>\n",
              "      <td>Chinstrap penguin (Pygoscelis antarctica)</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>56</th>\n",
              "      <td>3791.328893</td>\n",
              "      <td>Biscoe</td>\n",
              "      <td>38.2</td>\n",
              "      <td>18.1</td>\n",
              "      <td>185.0</td>\n",
              "      <td>MALE</td>\n",
              "      <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>57</th>\n",
              "      <td>3547.890609</td>\n",
              "      <td>Biscoe</td>\n",
              "      <td>38.6</td>\n",
              "      <td>17.2</td>\n",
              "      <td>199.0</td>\n",
              "      <td>FEMALE</td>\n",
              "      <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>62</th>\n",
              "      <td>5372.086117</td>\n",
              "      <td>Biscoe</td>\n",
              "      <td>49.3</td>\n",
              "      <td>15.7</td>\n",
              "      <td>217.0</td>\n",
              "      <td>MALE</td>\n",
              "      <td>Gentoo penguin (Pygoscelis papua)</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>65</th>\n",
              "      <td>4263.229571</td>\n",
              "      <td>Dream</td>\n",
              "      <td>53.5</td>\n",
              "      <td>19.9</td>\n",
              "      <td>205.0</td>\n",
              "      <td>MALE</td>\n",
              "      <td>Chinstrap penguin (Pygoscelis antarctica)</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>67</th>\n",
              "      <td>5234.457401</td>\n",
              "      <td>Biscoe</td>\n",
              "      <td>47.6</td>\n",
              "      <td>14.5</td>\n",
              "      <td>215.0</td>\n",
              "      <td>MALE</td>\n",
              "      <td>Gentoo penguin (Pygoscelis papua)</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>75</th>\n",
              "      <td>3979.311469</td>\n",
              "      <td>Biscoe</td>\n",
              "      <td>37.8</td>\n",
              "      <td>20.0</td>\n",
              "      <td>190.0</td>\n",
              "      <td>MALE</td>\n",
              "      <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>81</th>\n",
              "      <td>3481.328573</td>\n",
              "      <td>Dream</td>\n",
              "      <td>47.6</td>\n",
              "      <td>18.3</td>\n",
              "      <td>195.0</td>\n",
              "      <td>FEMALE</td>\n",
              "      <td>Chinstrap penguin (Pygoscelis antarctica)</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>89</th>\n",
              "      <td>3915.237615</td>\n",
              "      <td>Dream</td>\n",
              "      <td>39.2</td>\n",
              "      <td>18.6</td>\n",
              "      <td>190.0</td>\n",
              "      <td>MALE</td>\n",
              "      <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>92</th>\n",
              "      <td>3425.560982</td>\n",
              "      <td>Dream</td>\n",
              "      <td>39.5</td>\n",
              "      <td>17.8</td>\n",
              "      <td>188.0</td>\n",
              "      <td>FEMALE</td>\n",
              "      <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>93</th>\n",
              "      <td>4141.494969</td>\n",
              "      <td>Biscoe</td>\n",
              "      <td>42.2</td>\n",
              "      <td>19.5</td>\n",
              "      <td>197.0</td>\n",
              "      <td>MALE</td>\n",
              "      <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>96</th>\n",
              "      <td>3394.719445</td>\n",
              "      <td>Dream</td>\n",
              "      <td>42.2</td>\n",
              "      <td>18.5</td>\n",
              "      <td>180.0</td>\n",
              "      <td>FEMALE</td>\n",
              "      <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>100</th>\n",
              "      <td>3507.223965</td>\n",
              "      <td>Dream</td>\n",
              "      <td>38.9</td>\n",
              "      <td>18.8</td>\n",
              "      <td>190.0</td>\n",
              "      <td>FEMALE</td>\n",
              "      <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>101</th>\n",
              "      <td>4922.284991</td>\n",
              "      <td>Biscoe</td>\n",
              "      <td>46.9</td>\n",
              "      <td>14.6</td>\n",
              "      <td>222.0</td>\n",
              "      <td>FEMALE</td>\n",
              "      <td>Gentoo penguin (Pygoscelis papua)</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>102</th>\n",
              "      <td>4016.240318</td>\n",
              "      <td>Dream</td>\n",
              "      <td>43.2</td>\n",
              "      <td>18.5</td>\n",
              "      <td>192.0</td>\n",
              "      <td>MALE</td>\n",
              "      <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>107</th>\n",
              "      <td>4933.653758</td>\n",
              "      <td>Biscoe</td>\n",
              "      <td>50.5</td>\n",
              "      <td>15.2</td>\n",
              "      <td>216.0</td>\n",
              "      <td>FEMALE</td>\n",
              "      <td>Gentoo penguin (Pygoscelis papua)</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "<p>25 rows × 7 columns</p>\n",
              "</div>[67 rows x 7 columns in total]"
            ],
            "text/plain": [
              "            predicted_body_mass_g     island  culmen_length_mm  \\\n",
              "penguin_id                                                       \n",
              "1                     4772.374547     Biscoe              45.1   \n",
              "15                    3883.371052      Dream              41.1   \n",
              "28                    3479.706166      Dream              38.1   \n",
              "32                    4223.851137      Dream              51.9   \n",
              "33                    3197.620461      Dream              42.5   \n",
              "34                    4155.265191      Dream              41.1   \n",
              "37                    3991.311319      Dream              51.3   \n",
              "41                    3232.644783  Torgersen              40.2   \n",
              "47                    4017.738303      Dream              37.5   \n",
              "52                    3365.077659      Dream              46.4   \n",
              "56                    3791.328893     Biscoe              38.2   \n",
              "57                    3547.890609     Biscoe              38.6   \n",
              "62                    5372.086117     Biscoe              49.3   \n",
              "65                    4263.229571      Dream              53.5   \n",
              "67                    5234.457401     Biscoe              47.6   \n",
              "75                    3979.311469     Biscoe              37.8   \n",
              "81                    3481.328573      Dream              47.6   \n",
              "89                    3915.237615      Dream              39.2   \n",
              "92                    3425.560982      Dream              39.5   \n",
              "93                    4141.494969     Biscoe              42.2   \n",
              "96                    3394.719445      Dream              42.2   \n",
              "100                   3507.223965      Dream              38.9   \n",
              "101                   4922.284991     Biscoe              46.9   \n",
              "102                   4016.240318      Dream              43.2   \n",
              "107                   4933.653758     Biscoe              50.5   \n",
              "\n",
              "            culmen_depth_mm  flipper_length_mm     sex  \\\n",
              "penguin_id                                               \n",
              "1                      14.5              215.0  FEMALE   \n",
              "15                     17.5              190.0    MALE   \n",
              "28                     18.6              190.0  FEMALE   \n",
              "32                     19.5              206.0    MALE   \n",
              "33                     17.3              187.0  FEMALE   \n",
              "34                     18.1              205.0    MALE   \n",
              "37                     18.2              197.0    MALE   \n",
              "41                     17.0              176.0  FEMALE   \n",
              "47                     18.5              199.0    MALE   \n",
              "52                     17.8              191.0  FEMALE   \n",
              "56                     18.1              185.0    MALE   \n",
              "57                     17.2              199.0  FEMALE   \n",
              "62                     15.7              217.0    MALE   \n",
              "65                     19.9              205.0    MALE   \n",
              "67                     14.5              215.0    MALE   \n",
              "75                     20.0              190.0    MALE   \n",
              "81                     18.3              195.0  FEMALE   \n",
              "89                     18.6              190.0    MALE   \n",
              "92                     17.8              188.0  FEMALE   \n",
              "93                     19.5              197.0    MALE   \n",
              "96                     18.5              180.0  FEMALE   \n",
              "100                    18.8              190.0  FEMALE   \n",
              "101                    14.6              222.0  FEMALE   \n",
              "102                    18.5              192.0    MALE   \n",
              "107                    15.2              216.0  FEMALE   \n",
              "\n",
              "                                              species  \n",
              "penguin_id                                             \n",
              "1                   Gentoo penguin (Pygoscelis papua)  \n",
              "15                Adelie Penguin (Pygoscelis adeliae)  \n",
              "28                Adelie Penguin (Pygoscelis adeliae)  \n",
              "32          Chinstrap penguin (Pygoscelis antarctica)  \n",
              "33          Chinstrap penguin (Pygoscelis antarctica)  \n",
              "34                Adelie Penguin (Pygoscelis adeliae)  \n",
              "37          Chinstrap penguin (Pygoscelis antarctica)  \n",
              "41                Adelie Penguin (Pygoscelis adeliae)  \n",
              "47                Adelie Penguin (Pygoscelis adeliae)  \n",
              "52          Chinstrap penguin (Pygoscelis antarctica)  \n",
              "56                Adelie Penguin (Pygoscelis adeliae)  \n",
              "57                Adelie Penguin (Pygoscelis adeliae)  \n",
              "62                  Gentoo penguin (Pygoscelis papua)  \n",
              "65          Chinstrap penguin (Pygoscelis antarctica)  \n",
              "67                  Gentoo penguin (Pygoscelis papua)  \n",
              "75                Adelie Penguin (Pygoscelis adeliae)  \n",
              "81          Chinstrap penguin (Pygoscelis antarctica)  \n",
              "89                Adelie Penguin (Pygoscelis adeliae)  \n",
              "92                Adelie Penguin (Pygoscelis adeliae)  \n",
              "93                Adelie Penguin (Pygoscelis adeliae)  \n",
              "96                Adelie Penguin (Pygoscelis adeliae)  \n",
              "100               Adelie Penguin (Pygoscelis adeliae)  \n",
              "101                 Gentoo penguin (Pygoscelis papua)  \n",
              "102               Adelie Penguin (Pygoscelis adeliae)  \n",
              "107                 Gentoo penguin (Pygoscelis papua)  \n",
              "\n",
              "[67 rows x 7 columns]"
            ]
          },
          "execution_count": 22,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "pipeline.fit(X_train, y_train)\n",
        "\n",
        "predicted_y_test = pipeline.predict(X_test)\n",
        "predicted_y_test"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "SiLzpsg8nRXn"
      },
      "source": [
        "In the backend, a pipeline will actually be compiled into a single model with an embedded TRANSFORM step."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "sTzAxTv1nUKZ"
      },
      "source": [
        "## Evaluating results\n",
        "\n",
        "Some models include a convenient `.score(X, y)` method for evaulation with a preset accuracy metric:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 23,
      "metadata": {
        "id": "Q8nR1ZqznU-B"
      },
      "outputs": [
        {
          "data": {
            "text/html": [
              "Query job c098e1d1-b3ed-4ec5-94c7-6ba3b2b59e3f is DONE. 29.6 kB processed. <a target=\"_blank\" href=\"https://console.cloud.google.com/bigquery?project=swast-scratch&j=bq:US:c098e1d1-b3ed-4ec5-94c7-6ba3b2b59e3f&page=queryresults\">Open Job</a>"
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "text/html": [
              "Query job 035234b0-537a-44ce-adff-bb51c40b4ffa is DONE. 0 Bytes processed. <a target=\"_blank\" href=\"https://console.cloud.google.com/bigquery?project=swast-scratch&j=bq:US:035234b0-537a-44ce-adff-bb51c40b4ffa&page=queryresults\">Open Job</a>"
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "text/html": [
              "Query job b4a2a367-3e06-4fa3-9f00-bdbca884cfdd is DONE. 48 Bytes processed. <a target=\"_blank\" href=\"https://console.cloud.google.com/bigquery?project=swast-scratch&j=bq:US:b4a2a367-3e06-4fa3-9f00-bdbca884cfdd&page=queryresults\">Open Job</a>"
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>mean_absolute_error</th>\n",
              "      <th>mean_squared_error</th>\n",
              "      <th>mean_squared_log_error</th>\n",
              "      <th>median_absolute_error</th>\n",
              "      <th>r2_score</th>\n",
              "      <th>explained_variance</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>225.883512</td>\n",
              "      <td>77765.989281</td>\n",
              "      <td>0.004457</td>\n",
              "      <td>179.548041</td>\n",
              "      <td>0.873166</td>\n",
              "      <td>0.873315</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "<p>1 rows × 6 columns</p>\n",
              "</div>[1 rows x 6 columns in total]"
            ],
            "text/plain": [
              "   mean_absolute_error  mean_squared_error  mean_squared_log_error  \\\n",
              "0           225.883512        77765.989281                0.004457   \n",
              "\n",
              "   median_absolute_error  r2_score  explained_variance  \n",
              "0             179.548041  0.873166            0.873315  \n",
              "\n",
              "[1 rows x 6 columns]"
            ]
          },
          "execution_count": 23,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "# In the case of a pipeline, this will be equivalent to calling .score on the contained LinearRegression\n",
        "pipeline.score(X_test, y_test)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "UHM7jls6nY8A"
      },
      "source": [
        "For a more general approach, the library `bigframes.ml.metrics` is provided:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 24,
      "metadata": {
        "id": "vdEN4Ob9nan4"
      },
      "outputs": [
        {
          "data": {
            "text/html": [
              "Query job 20ec1716-3e8e-4d3f-ba08-1f7b9970ce3f is DONE. 28.9 kB processed. <a target=\"_blank\" href=\"https://console.cloud.google.com/bigquery?project=swast-scratch&j=bq:US:20ec1716-3e8e-4d3f-ba08-1f7b9970ce3f&page=queryresults\">Open Job</a>"
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "text/html": [
              "Query job 6f628f3b-62df-4a5a-8e05-0b313db0ed07 is DONE. 28.9 kB processed. <a target=\"_blank\" href=\"https://console.cloud.google.com/bigquery?project=swast-scratch&j=bq:US:6f628f3b-62df-4a5a-8e05-0b313db0ed07&page=queryresults\">Open Job</a>"
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "text/html": [
              "Query job c4eee1e5-146f-4a52-8499-83fe5f701f53 is DONE. 30.0 kB processed. <a target=\"_blank\" href=\"https://console.cloud.google.com/bigquery?project=swast-scratch&j=bq:US:c4eee1e5-146f-4a52-8499-83fe5f701f53&page=queryresults\">Open Job</a>"
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "text/plain": [
              "0.8731660699616813"
            ]
          },
          "execution_count": 24,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "from bigframes.ml.metrics import r2_score\n",
        "\n",
        "r2_score(y_test, predicted_y_test[\"predicted_body_mass_g\"])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "opn4ycPyneVh"
      },
      "source": [
        "## Save to BigQuery\n",
        "\n",
        "Estimators can be saved to BigQuery as BQML models, and loaded again in future.\n",
        "\n",
        "Saving requires `bigquery.tables.create` permission, and loading requires `bigquery.models.getMetadata` permission.\n",
        "These permissions can be at project level or the dataset level.\n",
        "\n",
        "If you have those permissions, please go ahead and uncomment the code in the following cells and run."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 25,
      "metadata": {
        "id": "fb0HpkdpnigJ"
      },
      "outputs": [
        {
          "data": {
            "text/html": [
              "Copy job 06c2b62d-a7aa-46a5-a04a-2f189bafc5ee is DONE. <a target=\"_blank\" href=\"https://console.cloud.google.com/bigquery?project=swast-scratch&j=bq:US:06c2b62d-a7aa-46a5-a04a-2f189bafc5ee&page=queryresults\">Open Job</a>"
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "text/plain": [
              "Pipeline(steps=[('transform',\n",
              "                 ColumnTransformer(transformers=[('ont_hot_encoder',\n",
              "                                                  OneHotEncoder(max_categories=1000001,\n",
              "                                                                min_frequency=0),\n",
              "                                                  'island'),\n",
              "                                                 ('standard_scaler',\n",
              "                                                  StandardScaler(),\n",
              "                                                  'culmen_length_mm'),\n",
              "                                                 ('standard_scaler',\n",
              "                                                  StandardScaler(),\n",
              "                                                  'culmen_depth_mm'),\n",
              "                                                 ('standard_scaler',\n",
              "                                                  StandardScaler(),\n",
              "                                                  'flipper_length_mm'),\n",
              "                                                 ('ont_hot_encoder',\n",
              "                                                  OneHotEncoder(max_categories=1000001,\n",
              "                                                                min_frequency=0),\n",
              "                                                  'sex'),\n",
              "                                                 ('ont_hot_encoder',\n",
              "                                                  OneHotEncoder(max_categories=1000001,\n",
              "                                                                min_frequency=0),\n",
              "                                                  'species')])),\n",
              "                ('estimator',\n",
              "                 LinearRegression(optimize_strategy='NORMAL_EQUATION'))])"
            ]
          },
          "execution_count": 25,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "linreg.to_gbq(f\"{DATASET}.penguins_model\", replace=True)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 26,
      "metadata": {
        "id": "_zNOBlHdnkII"
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "Pipeline(steps=[('transform',\n",
              "                 ColumnTransformer(transformers=[('ont_hot_encoder',\n",
              "                                                  OneHotEncoder(max_categories=1000001,\n",
              "                                                                min_frequency=0),\n",
              "                                                  'island'),\n",
              "                                                 ('standard_scaler',\n",
              "                                                  StandardScaler(),\n",
              "                                                  'culmen_length_mm'),\n",
              "                                                 ('standard_scaler',\n",
              "                                                  StandardScaler(),\n",
              "                                                  'culmen_depth_mm'),\n",
              "                                                 ('standard_scaler',\n",
              "                                                  StandardScaler(),\n",
              "                                                  'flipper_length_mm'),\n",
              "                                                 ('ont_hot_encoder',\n",
              "                                                  OneHotEncoder(max_categories=1000001,\n",
              "                                                                min_frequency=0),\n",
              "                                                  'sex'),\n",
              "                                                 ('ont_hot_encoder',\n",
              "                                                  OneHotEncoder(max_categories=1000001,\n",
              "                                                                min_frequency=0),\n",
              "                                                  'species')])),\n",
              "                ('estimator',\n",
              "                 LinearRegression(optimize_strategy='NORMAL_EQUATION'))])"
            ]
          },
          "execution_count": 26,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "bpd.read_gbq_model(f\"{DATASET}.penguins_model\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "RfV-du5uTcBB"
      },
      "source": [
        "We can also save the pipeline to BigQuery. BigQuery will save this as a single model, with the pre-processing steps embedded in the TRANSFORM property:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 27,
      "metadata": {
        "id": "P76_TQ3IR6nB"
      },
      "outputs": [
        {
          "data": {
            "text/html": [
              "Copy job a0ed8c1b-3a3f-4995-853c-e151d41560d7 is DONE. <a target=\"_blank\" href=\"https://console.cloud.google.com/bigquery?project=swast-scratch&j=bq:US:a0ed8c1b-3a3f-4995-853c-e151d41560d7&page=queryresults\">Open Job</a>"
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "text/plain": [
              "Pipeline(steps=[('transform',\n",
              "                 ColumnTransformer(transformers=[('ont_hot_encoder',\n",
              "                                                  OneHotEncoder(max_categories=1000001,\n",
              "                                                                min_frequency=0),\n",
              "                                                  'island'),\n",
              "                                                 ('standard_scaler',\n",
              "                                                  StandardScaler(),\n",
              "                                                  'culmen_length_mm'),\n",
              "                                                 ('standard_scaler',\n",
              "                                                  StandardScaler(),\n",
              "                                                  'culmen_depth_mm'),\n",
              "                                                 ('standard_scaler',\n",
              "                                                  StandardScaler(),\n",
              "                                                  'flipper_length_mm'),\n",
              "                                                 ('ont_hot_encoder',\n",
              "                                                  OneHotEncoder(max_categories=1000001,\n",
              "                                                                min_frequency=0),\n",
              "                                                  'sex'),\n",
              "                                                 ('ont_hot_encoder',\n",
              "                                                  OneHotEncoder(max_categories=1000001,\n",
              "                                                                min_frequency=0),\n",
              "                                                  'species')])),\n",
              "                ('estimator',\n",
              "                 LinearRegression(optimize_strategy='NORMAL_EQUATION'))])"
            ]
          },
          "execution_count": 27,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "pipeline.to_gbq(f\"{DATASET}.penguins_pipeline\", replace=True)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 28,
      "metadata": {
        "id": "GKvlKFjAbToJ"
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "Pipeline(steps=[('transform',\n",
              "                 ColumnTransformer(transformers=[('ont_hot_encoder',\n",
              "                                                  OneHotEncoder(max_categories=1000001,\n",
              "                                                                min_frequency=0),\n",
              "                                                  'island'),\n",
              "                                                 ('standard_scaler',\n",
              "                                                  StandardScaler(),\n",
              "                                                  'culmen_length_mm'),\n",
              "                                                 ('standard_scaler',\n",
              "                                                  StandardScaler(),\n",
              "                                                  'culmen_depth_mm'),\n",
              "                                                 ('standard_scaler',\n",
              "                                                  StandardScaler(),\n",
              "                                                  'flipper_length_mm'),\n",
              "                                                 ('ont_hot_encoder',\n",
              "                                                  OneHotEncoder(max_categories=1000001,\n",
              "                                                                min_frequency=0),\n",
              "                                                  'sex'),\n",
              "                                                 ('ont_hot_encoder',\n",
              "                                                  OneHotEncoder(max_categories=1000001,\n",
              "                                                                min_frequency=0),\n",
              "                                                  'species')])),\n",
              "                ('estimator',\n",
              "                 LinearRegression(optimize_strategy='NORMAL_EQUATION'))])"
            ]
          },
          "execution_count": 28,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "bpd.read_gbq_model(f\"{DATASET}.penguins_pipeline\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "wCsmt0IwFkDy"
      },
      "source": [
        "## Summary and next steps\n",
        "\n",
        "You've completed an end-to-end machine learning workflow using the built-in capabilities of BigQuery DataFrames.\n",
        "\n",
        "Learn more about BigQuery DataFrames in the [documentation](https://cloud.google.com/python/docs/reference/bigframes/latest) and find more sample notebooks in the [GitHub repo](https://github.com/googleapis/python-bigquery-dataframes/tree/main/notebooks)."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "TpV-iwP9qw9c"
      },
      "source": [
        "### Cleaning up\n",
        "\n",
        "To clean up all Google Cloud resources used in this project, you can [delete the Google Cloud\n",
        "project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.\n",
        "\n",
        "Otherwise, you can uncomment the remaining cells and run them to delete the individual resources you created in this tutorial:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 29,
      "metadata": {
        "id": "QwumLUKmVpuH"
      },
      "outputs": [],
      "source": [
        "# # Delete the BQML models\n",
        "# MODEL_NAME = f\"{PROJECT_ID}:{DATASET}.penguins_model\"\n",
        "# ! bq rm -f --model {MODEL_NAME}\n",
        "# PIPELINE_NAME = f\"{PROJECT_ID}:{DATASET}.penguins_pipeline\"\n",
        "# ! bq rm -f --model {PIPELINE_NAME}"
      ]
    }
  ],
  "metadata": {
    "colab": {
      "provenance": []
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.11.1"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}