{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": { "id": "ur8xi4C7S06n" }, "outputs": [], "source": [ "# Copyright 2024 Google LLC\n", "#\n", "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License." ] }, { "cell_type": "markdown", "metadata": { "id": "JAPoU8Sm5E6e" }, "source": [ "# Machine Learning Fundamentals with BigQuery DataFrames\n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "
\n", " \n", " \"Colab Run in Colab\n", " \n", " \n", " \n", " \"GitHub\n", " View on GitHub\n", " \n", " \n", " \n", " \"Vertex\n", " Open in Vertex AI Workbench\n", " \n", " \n", " \n", " \"BQ\n", " Open in BQ Studio\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": { "id": "24743cf4a1e1" }, "source": [ "**_NOTE_**: This notebook has been tested in the following environment:\n", "\n", "* Python version = 3.10" ] }, { "cell_type": "markdown", "metadata": { "id": "tvgnzT1CKxrO" }, "source": [ "## Overview\n", "\n", "The `bigframes.ml` module implements Scikit-Learn's machine learning API in\n", "BigQuery DataFrames. It exposes BigQuery's ML capabilities in a simple, popular\n", "API that works seamlessly with the rest of the BigQuery DataFrames API.\n", "\n", "Learn more about [BigQuery DataFrames](https://cloud.google.com/python/docs/reference/bigframes/latest)." ] }, { "cell_type": "markdown", "metadata": { "id": "d975e698c9a4" }, "source": [ "### Objective\n", "\n", "In this tutorial, you will walk through an end-to-end machine learning workflow using BigQuery DataFrames. You will load data, manipulate and prepare it for model training, build supervised and unsupervised models, and evaluate and save a model for future use; all using built-in BigQuery DataFrames functionality." ] }, { "cell_type": "markdown", "metadata": { "id": "08d289fa873f" }, "source": [ "### Dataset\n", "\n", "This tutorial uses the [```penguins``` table](https://console.cloud.google.com/bigquery?p=bigquery-public-data&d=ml_datasets&t=penguins) (a BigQuery public dataset), which contains data on a set of penguins including species, island of residence, weight, culmen length and depth, flipper length, and sex." ] }, { "cell_type": "markdown", "metadata": { "id": "aed92deeb4a0" }, "source": [ "### Costs\n", "\n", "This tutorial uses billable components of Google Cloud:\n", "\n", "* BigQuery (storage and compute)\n", "* BigQuery ML\n", "\n", "Learn about [BigQuery storage pricing](https://cloud.google.com/bigquery/pricing#storage),\n", "[BigQuery compute pricing](https://cloud.google.com/bigquery/pricing#analysis_pricing_models),\n", "and [BigQuery ML pricing](https://cloud.google.com/bigquery/pricing#bqml),\n", "and use the [Pricing Calculator](https://cloud.google.com/products/calculator/)\n", "to generate a cost estimate based on your projected usage." ] }, { "cell_type": "markdown", "metadata": { "id": "i7EUnXsZhAGF" }, "source": [ "## Installation\n", "\n", "Depending on your Jupyter environment, you might have to install packages." ] }, { "cell_type": "markdown", "metadata": { "id": "NRTcBQPZpKWd" }, "source": [ "**Vertex AI Workbench or Colab**\n", "\n", "Do nothing, BigQuery DataFrames package is already installed." ] }, { "cell_type": "markdown", "metadata": { "id": "bdOJtFo1pRnc" }, "source": [ "**Local JupyterLab instance**\n", "\n", "Uncomment and run the following cell:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "id": "mfPoOwPLGpSr" }, "outputs": [], "source": [ "# !pip install bigframes" ] }, { "cell_type": "markdown", "metadata": { "id": "BF1j6f9HApxa" }, "source": [ "## Before you begin\n", "\n", "Complete the tasks in this section to set up your environment." ] }, { "cell_type": "markdown", "metadata": { "id": "Yq7zKYWelRQP" }, "source": [ "### Set up your Google Cloud project\n", "\n", "**The following steps are required, regardless of your notebook environment.**\n", "\n", "1. [Select or create a Google Cloud project](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get a $300 credit towards your compute/storage costs.\n", "\n", "2. [Make sure that billing is enabled for your project](https://cloud.google.com/billing/docs/how-to/modify-project).\n", "\n", "3. [Click here](https://console.cloud.google.com/flows/enableapi?apiid=bigquery.googleapis.com) to enable the BigQuery API.\n", "\n", "4. If you are running this notebook locally, install the [Cloud SDK](https://cloud.google.com/sdk)." ] }, { "cell_type": "markdown", "metadata": { "id": "WReHDGG5g0XY" }, "source": [ "#### Set your project ID\n", "\n", "If you don't know your project ID, try the following:\n", "* Run `gcloud config list`.\n", "* Run `gcloud projects list`.\n", "* See the support page: [Locate the project ID](https://support.google.com/googleapi/answer/7014113)." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "id": "oM1iC_MfAts1" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Updated property [core/project].\n" ] } ], "source": [ "PROJECT_ID = \"\" # @param {type:\"string\"}\n", "\n", "# Set the project id\n", "! gcloud config set project {PROJECT_ID}" ] }, { "cell_type": "markdown", "metadata": { "id": "region" }, "source": [ "#### Set the region\n", "\n", "You can also change the `REGION` variable used by BigQuery. Learn more about [BigQuery regions](https://cloud.google.com/bigquery/docs/locations#supported_locations)." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "id": "eF-Twtc4XGem" }, "outputs": [], "source": [ "REGION = \"US\" # @param {type: \"string\"}" ] }, { "cell_type": "markdown", "metadata": { "id": "XcW9adriUQRc" }, "source": [ "#### Set the dataset ID\n", "\n", "As part of this notebook, you will save BigQuery ML models to your Google Cloud project, which requires a dataset. Create the dataset, if needed, and provide the ID here as the `DATASET` variable used by BigQuery. Learn how to create a [BigQuery dataset](https://cloud.google.com/bigquery/docs/datasets)." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "id": "BbMh9JHvUHAn" }, "outputs": [], "source": [ "DATASET = \"\" # @param {type: \"string\"}" ] }, { "cell_type": "markdown", "metadata": { "id": "NwxfWoR5UGwO" }, "source": [] }, { "cell_type": "markdown", "metadata": { "id": "sBCra4QMA2wR" }, "source": [ "### Authenticate your Google Cloud account\n", "\n", "Depending on your Jupyter environment, you might have to manually authenticate. Follow the relevant instructions below." ] }, { "cell_type": "markdown", "metadata": { "id": "74ccc9e52986" }, "source": [ "**Vertex AI Workbench**\n", "\n", "Do nothing, you are already authenticated." ] }, { "cell_type": "markdown", "metadata": { "id": "de775a3773ba" }, "source": [ "**Local JupyterLab instance**\n", "\n", "Uncomment and run the following cell:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "id": "254614fa0c46" }, "outputs": [], "source": [ "# ! gcloud auth login" ] }, { "cell_type": "markdown", "metadata": { "id": "ef21552ccea8" }, "source": [ "**Colab**\n", "\n", "Uncomment and run the following cell:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "id": "603adbbf0532" }, "outputs": [], "source": [ "# from google.colab import auth\n", "# auth.authenticate_user()" ] }, { "cell_type": "markdown", "metadata": { "id": "960505627ddf" }, "source": [ "### Import libraries" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "id": "PyQmSRbKA8r-" }, "outputs": [], "source": [ "import bigframes.pandas as bpd" ] }, { "cell_type": "markdown", "metadata": { "id": "init_aip:mbsdk,all" }, "source": [ "\n", "### Set BigQuery DataFrames options" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "id": "NPPMuw2PXGeo" }, "outputs": [], "source": [ "# Note: The project option is not required in all environments.\n", "# On BigQuery Studio, the project ID is automatically detected.\n", "bpd.options.bigquery.project = PROJECT_ID\n", "\n", "# Note: The location option is not required.\n", "# It defaults to the location of the first table or query\n", "# passed to read_gbq(). For APIs where a location can't be\n", "# auto-detected, the location defaults to the \"US\" location.\n", "bpd.options.bigquery.location = REGION" ] }, { "cell_type": "markdown", "metadata": { "id": "pDfrKwMKE_dK" }, "source": [ "If you want to reset the location of the created DataFrame or Series objects, reset the session by executing `bpd.reset_session()`. After that, you can reuse `bpd.options.bigquery.location` to specify another location." ] }, { "cell_type": "markdown", "metadata": { "id": "LjfRpSruzg5j" }, "source": [ "## Import data into BigQuery DataFrames\n", "\n", "You can create a DataFrame by reading data from a BigQuery table." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "id": "d86W4hNqzZJb" }, "outputs": [], "source": [ "df = bpd.read_gbq(\"bigquery-public-data.ml_datasets.penguins\")\n", "df = df.dropna()\n", "\n", "# BigQuery DataFrames creates a default numbered index, which we can give a name\n", "df.index.name = \"penguin_id\"" ] }, { "cell_type": "markdown", "metadata": { "id": "pDfCJ6-LkRB1" }, "source": [ "Take a look at a few rows of the DataFrame:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "id": "arGaUZVWkSwT" }, "outputs": [ { "data": { "text/html": [ "Query job d3acda60-1059-4bb0-9912-ed374491c5c3 is DONE. 28.9 kB processed. Open Job" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "Query job 51c6aa1c-ff98-4805-921e-00830e125e56 is DONE. 0 Bytes processed. Open Job" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "Query job 01e2cb6d-604b-4cdd-afb0-8f515a9da951 is DONE. 501 Bytes processed. Open Job" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
speciesislandculmen_length_mmculmen_depth_mmflipper_length_mmbody_mass_gsex
penguin_id
0Gentoo penguin (Pygoscelis papua)Biscoe50.515.9225.05400.0MALE
1Gentoo penguin (Pygoscelis papua)Biscoe45.114.5215.05000.0FEMALE
2Adelie Penguin (Pygoscelis adeliae)Torgersen41.418.5202.03875.0MALE
3Adelie Penguin (Pygoscelis adeliae)Torgersen38.617.0188.02900.0FEMALE
4Gentoo penguin (Pygoscelis papua)Biscoe46.514.8217.05200.0FEMALE
\n", "

5 rows × 7 columns

\n", "
[5 rows x 7 columns in total]" ], "text/plain": [ " species island culmen_length_mm \\\n", "penguin_id \n", "0 Gentoo penguin (Pygoscelis papua) Biscoe 50.5 \n", "1 Gentoo penguin (Pygoscelis papua) Biscoe 45.1 \n", "2 Adelie Penguin (Pygoscelis adeliae) Torgersen 41.4 \n", "3 Adelie Penguin (Pygoscelis adeliae) Torgersen 38.6 \n", "4 Gentoo penguin (Pygoscelis papua) Biscoe 46.5 \n", "\n", " culmen_depth_mm flipper_length_mm body_mass_g sex \n", "penguin_id \n", "0 15.9 225.0 5400.0 MALE \n", "1 14.5 215.0 5000.0 FEMALE \n", "2 18.5 202.0 3875.0 MALE \n", "3 17.0 188.0 2900.0 FEMALE \n", "4 14.8 217.0 5200.0 FEMALE \n", "\n", "[5 rows x 7 columns]" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "markdown", "metadata": { "id": "WkUIcMXPkahu" }, "source": [ "## Clean and prepare data" ] }, { "cell_type": "markdown", "metadata": { "id": "DScncEoDkiTG" }, "source": [ "We're are going to start with supervised learning, where a Linear Regression model will learn to predict the body mass (output variable `y`) using input features such as flipper length, sex, species, and more (features `X`)." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "id": "B9mW93o9z_-L" }, "outputs": [], "source": [ "# Isolate input features and output variable into DataFrames\n", "X = df[['island', 'culmen_length_mm', 'culmen_depth_mm', 'flipper_length_mm', 'sex', 'species']]\n", "y = df[['body_mass_g']]" ] }, { "cell_type": "markdown", "metadata": { "id": "wkw0Cs62k_cl" }, "source": [ "Part of preparing data for a machine learning task is splitting it into subsets for training and testing to ensure that the solution is not overfitting. By default, BQML will automatically manage splitting the data for you. However, BQML also supports manually splitting out your training data.\n", "\n", "Performing a manual data split can be done with `bigframes.ml.model_selection.train_test_split` like so:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "id": "NysWAWmvlAxB" }, "outputs": [ { "data": { "text/html": [ "Query job 7bd14e04-b3b4-4281-b5be-187f7baad62f is DONE. 28.9 kB processed. Open Job" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "Query job 240cc7db-19ac-4bd3-8e76-a79f75ded077 is DONE. 28.9 kB processed. Open Job" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "Query job 91194fee-d9b9-4cb9-a469-e49e9d77c624 is DONE. 28.9 kB processed. Open Job" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "Query job 84c71647-956b-4385-8dce-c8bc70a917c8 is DONE. 28.9 kB processed. Open Job" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "Query job 9c94600b-2231-4d04-8e3a-fb46f8892b6a is DONE. 28.9 kB processed. Open Job" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "X_train shape: (267, 6)\n", "X_test shape: (67, 6)\n", "y_train shape: (267, 1)\n", "y_test shape: (67, 1)\n" ] } ], "source": [ "from bigframes.ml.model_selection import train_test_split\n", "\n", "# This will split X and y into test and training sets, with 20% of the rows in the test set,\n", "# and the rest in the training set\n", "X_train, X_test, y_train, y_test = train_test_split(\n", " X, y, test_size=0.2)\n", "\n", "# Show the shape of the data after the split\n", "print(f\"\"\"X_train shape: {X_train.shape}\n", "X_test shape: {X_test.shape}\n", "y_train shape: {y_train.shape}\n", "y_test shape: {y_test.shape}\"\"\")" ] }, { "cell_type": "markdown", "metadata": { "id": "faFnVnNolydu" }, "source": [ "If we look at the data, we can see that random rows were selected for\n", "each side of the split:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "id": "f8bz1HwLlyLP" }, "outputs": [ { "data": { "text/html": [ "Query job 8ad534c1-eb49-4616-b7a6-f7d8b044b8bf is DONE. 28.9 kB processed. Open Job" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "Query job 3793de66-fb3c-4ca4-a337-aa708c718cc5 is DONE. 0 Bytes processed. Open Job" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "Query job 66524afb-4509-4927-8902-4a72826e83c4 is DONE. 456 Bytes processed. Open Job" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
islandculmen_length_mmculmen_depth_mmflipper_length_mmsexspecies
penguin_id
188Dream51.518.7187.0MALEChinstrap penguin (Pygoscelis antarctica)
251Biscoe49.516.1224.0MALEGentoo penguin (Pygoscelis papua)
231Biscoe45.713.9214.0FEMALEGentoo penguin (Pygoscelis papua)
271Biscoe59.617.0230.0MALEGentoo penguin (Pygoscelis papua)
128Biscoe38.817.2180.0MALEAdelie Penguin (Pygoscelis adeliae)
\n", "

5 rows × 6 columns

\n", "
[5 rows x 6 columns in total]" ], "text/plain": [ " island culmen_length_mm culmen_depth_mm flipper_length_mm \\\n", "penguin_id \n", "188 Dream 51.5 18.7 187.0 \n", "251 Biscoe 49.5 16.1 224.0 \n", "231 Biscoe 45.7 13.9 214.0 \n", "271 Biscoe 59.6 17.0 230.0 \n", "128 Biscoe 38.8 17.2 180.0 \n", "\n", " sex species \n", "penguin_id \n", "188 MALE Chinstrap penguin (Pygoscelis antarctica) \n", "251 MALE Gentoo penguin (Pygoscelis papua) \n", "231 FEMALE Gentoo penguin (Pygoscelis papua) \n", "271 MALE Gentoo penguin (Pygoscelis papua) \n", "128 MALE Adelie Penguin (Pygoscelis adeliae) \n", "\n", "[5 rows x 6 columns]" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_test.head(5)" ] }, { "cell_type": "markdown", "metadata": { "id": "v4ic7GQEl67Y" }, "source": [ "Note that the `y_test` data matches the same rows in `X_test`:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "id": "PflbhKGkl8v2" }, "outputs": [ { "data": { "text/html": [ "Query job 6a87fcc2-f2d0-44f5-8ab2-08f109c2b70d is DONE. 28.9 kB processed. Open Job" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "Query job ed8e49f8-0f4c-4ef2-bbc2-b8c5ef9fd064 is DONE. 0 Bytes processed. Open Job" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "Query job 97fea642-03aa-49fd-943e-f4efa5a87f0f is DONE. 120 Bytes processed. Open Job" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
body_mass_g
penguin_id
1883250.0
2515650.0
2314400.0
2716050.0
1283800.0
\n", "

5 rows × 1 columns

\n", "
[5 rows x 1 columns in total]" ], "text/plain": [ " body_mass_g\n", "penguin_id \n", "188 3250.0\n", "251 5650.0\n", "231 4400.0\n", "271 6050.0\n", "128 3800.0\n", "\n", "[5 rows x 1 columns]" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_test.head(5)" ] }, { "cell_type": "markdown", "metadata": { "id": "Dkf52IdvmSaj" }, "source": [ "## Estimators\n", "\n", "Following scikit-learn, all learning components are \"estimators\"; objects that can learn from training data and then apply themselves to new data. Estimators share the following patterns:\n", "\n", "- a constructor that takes a list of parameters\n", "- a standard string representation that shows the class name and all non-default parameters, e.g. `LinearRegression(fit_intercept=False)`\n", "- a `.fit(..)` method to fit the estimator to training data\n", "\n", "There estimators can be further broken down into two main subtypes:\n", " 1. Transformers\n", " 2. Predictors\n", "\n", "Let's walk through each of these with our example model." ] }, { "cell_type": "markdown", "metadata": { "id": "55oNSWQ2Q5te" }, "source": [ "### Transformers\n", "\n", "Transformers are estimators that are used to prepare data for consumption by other estimators ('preprocessing'). In addition to `.fit(...)`, the transformer implements a `.transform(...)` method, which will apply a transformation based on what was computed during `.fit(..)`. With this pattern dynamic preprocessing steps can be applied to both training and test/production data consistently.\n", "\n", "An example of a transformer is `bigframes.ml.preprocessing.StandardScaler`, which rescales a dataset to have a mean of zero and a standard deviation of one:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "id": "yhATDMR-mkdF" }, "outputs": [ { "data": { "text/html": [ "Query job aee64759-42bb-44d6-b8c7-1c737cdd6eed is DONE. 28.9 kB processed. Open Job" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "Query job acb29d04-a20d-4f1c-8d90-51c7e8ac9922 is DONE. 0 Bytes processed. Open Job" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "Query job 2bd034db-7d9b-467c-be17-49bca094cceb is DONE. 28.9 kB processed. Open Job" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "Query job 5dfb583a-1ced-4f2a-94b9-f1282263134d is DONE. 2.1 kB processed. Open Job" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "Query job 8fe87288-4a95-49f4-9895-7c41c1004901 is DONE. 0 Bytes processed. Open Job" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "Query job 7ebcecee-beff-402d-ac71-6384014a54da is DONE. 8.5 kB processed. Open Job" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
standard_scaled_culmen_length_mmstandard_scaled_culmen_depth_mmstandard_scaled_flipper_length_mm
penguin_id
01.20778-0.6515311.772656
2-0.4556020.6628550.100476
3-0.967412-0.095445-0.917372
40.476623-1.2076171.191028
5-1.6254540.359535-0.626559
7-0.345929-1.864810.682104
80.842202-1.5614911.409139
90.3486710.865068-0.263041
100.9335961.2189410.827511
11-1.460943-0.297658-0.771966
121.317454-0.4493181.409139
13-0.236255-1.7637040.900214
140.549739-0.297658-0.626559
160.970154-1.0054041.481842
17-1.058807-0.348211-0.190338
181.354012-1.5109371.263732
19-0.053466-1.6625971.191028
20-0.199697-1.5109370.609401
211.1529430.763962-0.190338
22-1.2050380.308982-0.699262
24-0.7846231.775028-0.699262
25-0.839461.724474-0.771966
26-0.6201130.359535-0.990076
270.330392-0.095445-0.408448
292.194842-0.0954451.990767
\n", "

25 rows × 3 columns

\n", "
[267 rows x 3 columns in total]" ], "text/plain": [ " standard_scaled_culmen_length_mm standard_scaled_culmen_depth_mm \\\n", "penguin_id \n", "0 1.20778 -0.651531 \n", "2 -0.455602 0.662855 \n", "3 -0.967412 -0.095445 \n", "4 0.476623 -1.207617 \n", "5 -1.625454 0.359535 \n", "7 -0.345929 -1.86481 \n", "8 0.842202 -1.561491 \n", "9 0.348671 0.865068 \n", "10 0.933596 1.218941 \n", "11 -1.460943 -0.297658 \n", "12 1.317454 -0.449318 \n", "13 -0.236255 -1.763704 \n", "14 0.549739 -0.297658 \n", "16 0.970154 -1.005404 \n", "17 -1.058807 -0.348211 \n", "18 1.354012 -1.510937 \n", "19 -0.053466 -1.662597 \n", "20 -0.199697 -1.510937 \n", "21 1.152943 0.763962 \n", "22 -1.205038 0.308982 \n", "24 -0.784623 1.775028 \n", "25 -0.83946 1.724474 \n", "26 -0.620113 0.359535 \n", "27 0.330392 -0.095445 \n", "29 2.194842 -0.095445 \n", "\n", " standard_scaled_flipper_length_mm \n", "penguin_id \n", "0 1.772656 \n", "2 0.100476 \n", "3 -0.917372 \n", "4 1.191028 \n", "5 -0.626559 \n", "7 0.682104 \n", "8 1.409139 \n", "9 -0.263041 \n", "10 0.827511 \n", "11 -0.771966 \n", "12 1.409139 \n", "13 0.900214 \n", "14 -0.626559 \n", "16 1.481842 \n", "17 -0.190338 \n", "18 1.263732 \n", "19 1.191028 \n", "20 0.609401 \n", "21 -0.190338 \n", "22 -0.699262 \n", "24 -0.699262 \n", "25 -0.771966 \n", "26 -0.990076 \n", "27 -0.408448 \n", "29 1.990767 \n", "...\n", "\n", "[267 rows x 3 columns]" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from bigframes.ml.preprocessing import StandardScaler\n", "\n", "# StandardScaler will only work on numeric columns\n", "numeric_columns = [\"culmen_length_mm\", \"culmen_depth_mm\", \"flipper_length_mm\"]\n", "\n", "scaler = StandardScaler()\n", "scaler.fit(X_train[numeric_columns])\n", "\n", "# Now, standardscaler should transform the numbers to have mean of zero\n", "# and standard deviation of one:\n", "scaler.transform(X_train[numeric_columns])" ] }, { "cell_type": "markdown", "metadata": { "id": "vhywHzH-ml-W" }, "source": [ "We can then repeat this transformation on the test data:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "id": "TfwSLOTXmspI" }, "outputs": [ { "data": { "text/html": [ "Query job 6639e06d-3920-4c64-84d8-b40ce042188c is DONE. 28.9 kB processed. Open Job" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "Query job 579dfb14-6d39-44c0-9b92-eb6a40c46df8 is DONE. 536 Bytes processed. Open Job" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "Query job 7f613d94-a68c-42d5-8afe-0413b32de3a0 is DONE. 0 Bytes processed. Open Job" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "Query job 140e8b5f-a24b-43a3-831f-30a29a4bd7ea is DONE. 2.1 kB processed. Open Job" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
standard_scaled_culmen_length_mmstandard_scaled_culmen_depth_mmstandard_scaled_flipper_length_mm
penguin_id
10.220718-1.3592771.045621
15-0.5104390.157322-0.771966
28-1.0588070.713408-0.771966
321.4636851.1683880.39129
33-0.2545340.056215-0.990076
34-0.5104390.4606420.318587
371.3540120.511195-0.263041
41-0.674949-0.095445-1.789814
47-1.1684810.662855-0.117634
520.4583440.308982-0.699262
56-1.0405280.460642-1.135483
57-0.9674120.005662-0.117634
620.988433-0.7526381.191028
651.7561481.3706010.318587
670.677691-1.3592771.045621
75-1.1136441.421155-0.771966
810.6776910.561748-0.408448
89-0.8577390.713408-0.771966
92-0.8029020.308982-0.917372
93-0.3093711.168388-0.263041
96-0.3093710.662855-1.499
100-0.9125760.814515-0.771966
1010.549739-1.3087241.554546
102-0.1265820.662855-0.626559
1071.20778-1.0054041.118325
\n", "

25 rows × 3 columns

\n", "
[67 rows x 3 columns in total]" ], "text/plain": [ " standard_scaled_culmen_length_mm standard_scaled_culmen_depth_mm \\\n", "penguin_id \n", "1 0.220718 -1.359277 \n", "15 -0.510439 0.157322 \n", "28 -1.058807 0.713408 \n", "32 1.463685 1.168388 \n", "33 -0.254534 0.056215 \n", "34 -0.510439 0.460642 \n", "37 1.354012 0.511195 \n", "41 -0.674949 -0.095445 \n", "47 -1.168481 0.662855 \n", "52 0.458344 0.308982 \n", "56 -1.040528 0.460642 \n", "57 -0.967412 0.005662 \n", "62 0.988433 -0.752638 \n", "65 1.756148 1.370601 \n", "67 0.677691 -1.359277 \n", "75 -1.113644 1.421155 \n", "81 0.677691 0.561748 \n", "89 -0.857739 0.713408 \n", "92 -0.802902 0.308982 \n", "93 -0.309371 1.168388 \n", "96 -0.309371 0.662855 \n", "100 -0.912576 0.814515 \n", "101 0.549739 -1.308724 \n", "102 -0.126582 0.662855 \n", "107 1.20778 -1.005404 \n", "\n", " standard_scaled_flipper_length_mm \n", "penguin_id \n", "1 1.045621 \n", "15 -0.771966 \n", "28 -0.771966 \n", "32 0.39129 \n", "33 -0.990076 \n", "34 0.318587 \n", "37 -0.263041 \n", "41 -1.789814 \n", "47 -0.117634 \n", "52 -0.699262 \n", "56 -1.135483 \n", "57 -0.117634 \n", "62 1.191028 \n", "65 0.318587 \n", "67 1.045621 \n", "75 -0.771966 \n", "81 -0.408448 \n", "89 -0.771966 \n", "92 -0.917372 \n", "93 -0.263041 \n", "96 -1.499 \n", "100 -0.771966 \n", "101 1.554546 \n", "102 -0.626559 \n", "107 1.118325 \n", "...\n", "\n", "[67 rows x 3 columns]" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "scaler.transform(X_test[numeric_columns])" ] }, { "cell_type": "markdown", "metadata": { "id": "9enAdjzPmwmv" }, "source": [ "#### Composing transformers\n", "\n", "To process data where different columns need different preprocessors, `bigframes.composition.ColumnTransformer` can be employed.\n", "\n", "Let's create an aggregate transform that applies `StandardScalar` to the numeric columns and `OneHotEncoder` to the string columns." ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "id": "I8Wwx3emmz2J" }, "outputs": [ { "data": { "text/html": [ "Query job c16fdb5d-3f18-4f85-8a31-705ef4680be5 is DONE. 28.9 kB processed. Open Job" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "Query job 8c94a7c1-7f12-44be-b389-7c854ceead4b is DONE. 0 Bytes processed. Open Job" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "Query job 1287628d-1380-4495-a5e9-6806440206bc is DONE. 22.9 kB processed. Open Job" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "Query job 03163e1a-c789-4046-b71a-b4b4e7bbc043 is DONE. 2.1 kB processed. Open Job" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "Query job 86f39b30-00db-4ada-8699-0fe49c94eb2d is DONE. 29.2 kB processed. Open Job" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "Query job d5b0e8b0-12cd-47f6-85d2-806b2c252d37 is DONE. 536 Bytes processed. Open Job" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "Query job 459cdc90-d1f3-4580-9137-9b93d44ca991 is DONE. 0 Bytes processed. Open Job" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "Query job 80d10913-7263-44e6-89f7-719eac4158a3 is DONE. 21.4 kB processed. Open Job" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
onehotencoded_islandstandard_scaled_culmen_length_mmstandard_scaled_culmen_depth_mmstandard_scaled_flipper_length_mmonehotencoded_sexonehotencoded_species
penguin_id
0[{'index': 1, 'value': 1.0}]1.20778-0.6515311.772656[{'index': 3, 'value': 1.0}][{'index': 3, 'value': 1.0}]
2[{'index': 3, 'value': 1.0}]-0.4556020.6628550.100476[{'index': 3, 'value': 1.0}][{'index': 1, 'value': 1.0}]
3[{'index': 3, 'value': 1.0}]-0.967412-0.095445-0.917372[{'index': 2, 'value': 1.0}][{'index': 1, 'value': 1.0}]
4[{'index': 1, 'value': 1.0}]0.476623-1.2076171.191028[{'index': 2, 'value': 1.0}][{'index': 3, 'value': 1.0}]
5[{'index': 1, 'value': 1.0}]-1.6254540.359535-0.626559[{'index': 2, 'value': 1.0}][{'index': 1, 'value': 1.0}]
7[{'index': 1, 'value': 1.0}]-0.345929-1.864810.682104[{'index': 2, 'value': 1.0}][{'index': 3, 'value': 1.0}]
8[{'index': 1, 'value': 1.0}]0.842202-1.5614911.409139[{'index': 3, 'value': 1.0}][{'index': 3, 'value': 1.0}]
9[{'index': 3, 'value': 1.0}]0.3486710.865068-0.263041[{'index': 3, 'value': 1.0}][{'index': 1, 'value': 1.0}]
10[{'index': 2, 'value': 1.0}]0.9335961.2189410.827511[{'index': 3, 'value': 1.0}][{'index': 2, 'value': 1.0}]
11[{'index': 3, 'value': 1.0}]-1.460943-0.297658-0.771966[{'index': 2, 'value': 1.0}][{'index': 1, 'value': 1.0}]
12[{'index': 1, 'value': 1.0}]1.317454-0.4493181.409139[{'index': 3, 'value': 1.0}][{'index': 3, 'value': 1.0}]
13[{'index': 1, 'value': 1.0}]-0.236255-1.7637040.900214[{'index': 2, 'value': 1.0}][{'index': 3, 'value': 1.0}]
14[{'index': 2, 'value': 1.0}]0.549739-0.297658-0.626559[{'index': 2, 'value': 1.0}][{'index': 2, 'value': 1.0}]
16[{'index': 1, 'value': 1.0}]0.970154-1.0054041.481842[{'index': 3, 'value': 1.0}][{'index': 3, 'value': 1.0}]
17[{'index': 1, 'value': 1.0}]-1.058807-0.348211-0.190338[{'index': 2, 'value': 1.0}][{'index': 1, 'value': 1.0}]
18[{'index': 1, 'value': 1.0}]1.354012-1.5109371.263732[{'index': 3, 'value': 1.0}][{'index': 3, 'value': 1.0}]
19[{'index': 1, 'value': 1.0}]-0.053466-1.6625971.191028[{'index': 2, 'value': 1.0}][{'index': 3, 'value': 1.0}]
20[{'index': 1, 'value': 1.0}]-0.199697-1.5109370.609401[{'index': 2, 'value': 1.0}][{'index': 3, 'value': 1.0}]
21[{'index': 2, 'value': 1.0}]1.1529430.763962-0.190338[{'index': 2, 'value': 1.0}][{'index': 2, 'value': 1.0}]
22[{'index': 2, 'value': 1.0}]-1.2050380.308982-0.699262[{'index': 2, 'value': 1.0}][{'index': 1, 'value': 1.0}]
24[{'index': 1, 'value': 1.0}]-0.7846231.775028-0.699262[{'index': 2, 'value': 1.0}][{'index': 1, 'value': 1.0}]
25[{'index': 3, 'value': 1.0}]-0.839461.724474-0.771966[{'index': 3, 'value': 1.0}][{'index': 1, 'value': 1.0}]
26[{'index': 1, 'value': 1.0}]-0.6201130.359535-0.990076[{'index': 2, 'value': 1.0}][{'index': 1, 'value': 1.0}]
27[{'index': 2, 'value': 1.0}]0.330392-0.095445-0.408448[{'index': 2, 'value': 1.0}][{'index': 2, 'value': 1.0}]
29[{'index': 1, 'value': 1.0}]2.194842-0.0954451.990767[{'index': 3, 'value': 1.0}][{'index': 3, 'value': 1.0}]
\n", "

25 rows × 6 columns

\n", "
[267 rows x 6 columns in total]" ], "text/plain": [ " onehotencoded_island standard_scaled_culmen_length_mm \\\n", "penguin_id \n", "0 [{'index': 1, 'value': 1.0}] 1.20778 \n", "2 [{'index': 3, 'value': 1.0}] -0.455602 \n", "3 [{'index': 3, 'value': 1.0}] -0.967412 \n", "4 [{'index': 1, 'value': 1.0}] 0.476623 \n", "5 [{'index': 1, 'value': 1.0}] -1.625454 \n", "7 [{'index': 1, 'value': 1.0}] -0.345929 \n", "8 [{'index': 1, 'value': 1.0}] 0.842202 \n", "9 [{'index': 3, 'value': 1.0}] 0.348671 \n", "10 [{'index': 2, 'value': 1.0}] 0.933596 \n", "11 [{'index': 3, 'value': 1.0}] -1.460943 \n", "12 [{'index': 1, 'value': 1.0}] 1.317454 \n", "13 [{'index': 1, 'value': 1.0}] -0.236255 \n", "14 [{'index': 2, 'value': 1.0}] 0.549739 \n", "16 [{'index': 1, 'value': 1.0}] 0.970154 \n", "17 [{'index': 1, 'value': 1.0}] -1.058807 \n", "18 [{'index': 1, 'value': 1.0}] 1.354012 \n", "19 [{'index': 1, 'value': 1.0}] -0.053466 \n", "20 [{'index': 1, 'value': 1.0}] -0.199697 \n", "21 [{'index': 2, 'value': 1.0}] 1.152943 \n", "22 [{'index': 2, 'value': 1.0}] -1.205038 \n", "24 [{'index': 1, 'value': 1.0}] -0.784623 \n", "25 [{'index': 3, 'value': 1.0}] -0.83946 \n", "26 [{'index': 1, 'value': 1.0}] -0.620113 \n", "27 [{'index': 2, 'value': 1.0}] 0.330392 \n", "29 [{'index': 1, 'value': 1.0}] 2.194842 \n", "\n", " standard_scaled_culmen_depth_mm \\\n", "penguin_id \n", "0 -0.651531 \n", "2 0.662855 \n", "3 -0.095445 \n", "4 -1.207617 \n", "5 0.359535 \n", "7 -1.86481 \n", "8 -1.561491 \n", "9 0.865068 \n", "10 1.218941 \n", "11 -0.297658 \n", "12 -0.449318 \n", "13 -1.763704 \n", "14 -0.297658 \n", "16 -1.005404 \n", "17 -0.348211 \n", "18 -1.510937 \n", "19 -1.662597 \n", "20 -1.510937 \n", "21 0.763962 \n", "22 0.308982 \n", "24 1.775028 \n", "25 1.724474 \n", "26 0.359535 \n", "27 -0.095445 \n", "29 -0.095445 \n", "\n", " standard_scaled_flipper_length_mm onehotencoded_sex \\\n", "penguin_id \n", "0 1.772656 [{'index': 3, 'value': 1.0}] \n", "2 0.100476 [{'index': 3, 'value': 1.0}] \n", "3 -0.917372 [{'index': 2, 'value': 1.0}] \n", "4 1.191028 [{'index': 2, 'value': 1.0}] \n", "5 -0.626559 [{'index': 2, 'value': 1.0}] \n", "7 0.682104 [{'index': 2, 'value': 1.0}] \n", "8 1.409139 [{'index': 3, 'value': 1.0}] \n", "9 -0.263041 [{'index': 3, 'value': 1.0}] \n", "10 0.827511 [{'index': 3, 'value': 1.0}] \n", "11 -0.771966 [{'index': 2, 'value': 1.0}] \n", "12 1.409139 [{'index': 3, 'value': 1.0}] \n", "13 0.900214 [{'index': 2, 'value': 1.0}] \n", "14 -0.626559 [{'index': 2, 'value': 1.0}] \n", "16 1.481842 [{'index': 3, 'value': 1.0}] \n", "17 -0.190338 [{'index': 2, 'value': 1.0}] \n", "18 1.263732 [{'index': 3, 'value': 1.0}] \n", "19 1.191028 [{'index': 2, 'value': 1.0}] \n", "20 0.609401 [{'index': 2, 'value': 1.0}] \n", "21 -0.190338 [{'index': 2, 'value': 1.0}] \n", "22 -0.699262 [{'index': 2, 'value': 1.0}] \n", "24 -0.699262 [{'index': 2, 'value': 1.0}] \n", "25 -0.771966 [{'index': 3, 'value': 1.0}] \n", "26 -0.990076 [{'index': 2, 'value': 1.0}] \n", "27 -0.408448 [{'index': 2, 'value': 1.0}] \n", "29 1.990767 [{'index': 3, 'value': 1.0}] \n", "\n", " onehotencoded_species \n", "penguin_id \n", "0 [{'index': 3, 'value': 1.0}] \n", "2 [{'index': 1, 'value': 1.0}] \n", "3 [{'index': 1, 'value': 1.0}] \n", "4 [{'index': 3, 'value': 1.0}] \n", "5 [{'index': 1, 'value': 1.0}] \n", "7 [{'index': 3, 'value': 1.0}] \n", "8 [{'index': 3, 'value': 1.0}] \n", "9 [{'index': 1, 'value': 1.0}] \n", "10 [{'index': 2, 'value': 1.0}] \n", "11 [{'index': 1, 'value': 1.0}] \n", "12 [{'index': 3, 'value': 1.0}] \n", "13 [{'index': 3, 'value': 1.0}] \n", "14 [{'index': 2, 'value': 1.0}] \n", "16 [{'index': 3, 'value': 1.0}] \n", "17 [{'index': 1, 'value': 1.0}] \n", "18 [{'index': 3, 'value': 1.0}] \n", "19 [{'index': 3, 'value': 1.0}] \n", "20 [{'index': 3, 'value': 1.0}] \n", "21 [{'index': 2, 'value': 1.0}] \n", "22 [{'index': 1, 'value': 1.0}] \n", "24 [{'index': 1, 'value': 1.0}] \n", "25 [{'index': 1, 'value': 1.0}] \n", "26 [{'index': 1, 'value': 1.0}] \n", "27 [{'index': 2, 'value': 1.0}] \n", "29 [{'index': 3, 'value': 1.0}] \n", "...\n", "\n", "[267 rows x 6 columns]" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from bigframes.ml.compose import ColumnTransformer\n", "from bigframes.ml.preprocessing import OneHotEncoder\n", "\n", "# Create an aggregate transform that applies StandardScaler to the numeric columns,\n", "# and OneHotEncoder to the string columns\n", "preproc = ColumnTransformer([\n", " (\"scale\", StandardScaler(), [\"culmen_length_mm\", \"culmen_depth_mm\", \"flipper_length_mm\"]),\n", " (\"encode\", OneHotEncoder(), [\"species\", \"sex\", \"island\"])])\n", "\n", "# Now we can fit all columns of the training data\n", "preproc.fit(X_train)\n", "\n", "processed_X_train = preproc.transform(X_train)\n", "processed_X_test = preproc.transform(X_test)\n", "\n", "# View the processed training data\n", "processed_X_train" ] }, { "cell_type": "markdown", "metadata": { "id": "JhoO4fctm4Q5" }, "source": [ "### Predictors\n", "\n", "Predictors are estimators that learn and make predictions. In addition to `.fit(...)`, the predictor implements a `.predict(...)` method, which will use what was learned during `.fit(...)` to predict some output.\n", "\n", "Predictors can be further broken down into two categories:\n", "* Supervised predictors\n", "* Unsupervised predictors" ] }, { "cell_type": "markdown", "metadata": { "id": "TqLItVyjslP8" }, "source": [ "#### Supervised predictors\n", "\n", "Supervised learning is when we train a model on input-output pairs, and then ask it to predict the output for new inputs. An example of such a predictor is `bigframes.ml.linear_models.LinearRegression`." ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "id": "ZeloMmopm8KI" }, "outputs": [ { "data": { "text/html": [ "Query job a59bf4cc-4c92-4a68-96b1-7465fbcb3ed0 is DONE. 21.4 kB processed. Open Job" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "Query job 6860c534-a218-4a55-866d-a6e011399cd9 is DONE. 28.9 kB processed. Open Job" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "Query job 1b3e8da6-2d64-4337-872e-55b874f00596 is DONE. 0 Bytes processed. Open Job" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "Query job fc118469-8dd7-4187-a3c1-7c5c2f1c5e36 is DONE. 5.7 kB processed. Open Job" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "Query job 544c5453-cd10-4a08-a338-601d85142df8 is DONE. 536 Bytes processed. Open Job" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "Query job 41c82cc9-7268-40ae-a736-f7a5f2c8b413 is DONE. 0 Bytes processed. Open Job" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "Query job e9836f6b-160d-4ce4-88b6-0b04f40a1549 is DONE. 5.9 kB processed. Open Job" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
predicted_body_mass_gonehotencoded_islandstandard_scaled_culmen_length_mmstandard_scaled_culmen_depth_mmstandard_scaled_flipper_length_mmonehotencoded_sexonehotencoded_species
penguin_id
14772.376044[{'index': 1, 'value': 1.0}]0.220718-1.3592771.045621[{'index': 2, 'value': 1.0}][{'index': 3, 'value': 1.0}]
153883.373922[{'index': 2, 'value': 1.0}]-0.5104390.157322-0.771966[{'index': 3, 'value': 1.0}][{'index': 1, 'value': 1.0}]
283479.709088[{'index': 2, 'value': 1.0}]-1.0588070.713408-0.771966[{'index': 2, 'value': 1.0}][{'index': 1, 'value': 1.0}]
324223.853626[{'index': 2, 'value': 1.0}]1.4636851.1683880.39129[{'index': 3, 'value': 1.0}][{'index': 2, 'value': 1.0}]
333197.623474[{'index': 2, 'value': 1.0}]-0.2545340.056215-0.990076[{'index': 2, 'value': 1.0}][{'index': 2, 'value': 1.0}]
344155.26742[{'index': 2, 'value': 1.0}]-0.5104390.4606420.318587[{'index': 3, 'value': 1.0}][{'index': 1, 'value': 1.0}]
373991.314095[{'index': 2, 'value': 1.0}]1.3540120.511195-0.263041[{'index': 3, 'value': 1.0}][{'index': 2, 'value': 1.0}]
413232.648242[{'index': 3, 'value': 1.0}]-0.674949-0.095445-1.789814[{'index': 2, 'value': 1.0}][{'index': 1, 'value': 1.0}]
474017.740788[{'index': 2, 'value': 1.0}]-1.1684810.662855-0.117634[{'index': 3, 'value': 1.0}][{'index': 1, 'value': 1.0}]
523365.080596[{'index': 2, 'value': 1.0}]0.4583440.308982-0.699262[{'index': 2, 'value': 1.0}][{'index': 2, 'value': 1.0}]
563791.332002[{'index': 1, 'value': 1.0}]-1.0405280.460642-1.135483[{'index': 3, 'value': 1.0}][{'index': 1, 'value': 1.0}]
573547.892992[{'index': 1, 'value': 1.0}]-0.9674120.005662-0.117634[{'index': 2, 'value': 1.0}][{'index': 1, 'value': 1.0}]
625372.087702[{'index': 1, 'value': 1.0}]0.988433-0.7526381.191028[{'index': 3, 'value': 1.0}][{'index': 3, 'value': 1.0}]
654263.232169[{'index': 2, 'value': 1.0}]1.7561481.3706010.318587[{'index': 3, 'value': 1.0}][{'index': 2, 'value': 1.0}]
675234.45894[{'index': 1, 'value': 1.0}]0.677691-1.3592771.045621[{'index': 3, 'value': 1.0}][{'index': 3, 'value': 1.0}]
753979.314516[{'index': 1, 'value': 1.0}]-1.1136441.421155-0.771966[{'index': 3, 'value': 1.0}][{'index': 1, 'value': 1.0}]
813481.331391[{'index': 2, 'value': 1.0}]0.6776910.561748-0.408448[{'index': 2, 'value': 1.0}][{'index': 2, 'value': 1.0}]
893915.240555[{'index': 2, 'value': 1.0}]-0.8577390.713408-0.771966[{'index': 3, 'value': 1.0}][{'index': 1, 'value': 1.0}]
923425.563946[{'index': 2, 'value': 1.0}]-0.8029020.308982-0.917372[{'index': 2, 'value': 1.0}][{'index': 1, 'value': 1.0}]
934141.497717[{'index': 1, 'value': 1.0}]-0.3093711.168388-0.263041[{'index': 3, 'value': 1.0}][{'index': 1, 'value': 1.0}]
963394.72289[{'index': 2, 'value': 1.0}]-0.3093710.662855-1.499[{'index': 2, 'value': 1.0}][{'index': 1, 'value': 1.0}]
1003507.226918[{'index': 2, 'value': 1.0}]-0.9125760.814515-0.771966[{'index': 2, 'value': 1.0}][{'index': 1, 'value': 1.0}]
1014922.286202[{'index': 1, 'value': 1.0}]0.549739-1.3087241.554546[{'index': 2, 'value': 1.0}][{'index': 3, 'value': 1.0}]
1024016.243221[{'index': 2, 'value': 1.0}]-0.1265820.662855-0.626559[{'index': 3, 'value': 1.0}][{'index': 1, 'value': 1.0}]
1074933.655362[{'index': 1, 'value': 1.0}]1.20778-1.0054041.118325[{'index': 2, 'value': 1.0}][{'index': 3, 'value': 1.0}]
\n", "

25 rows × 7 columns

\n", "
[67 rows x 7 columns in total]" ], "text/plain": [ " predicted_body_mass_g onehotencoded_island \\\n", "penguin_id \n", "1 4772.376044 [{'index': 1, 'value': 1.0}] \n", "15 3883.373922 [{'index': 2, 'value': 1.0}] \n", "28 3479.709088 [{'index': 2, 'value': 1.0}] \n", "32 4223.853626 [{'index': 2, 'value': 1.0}] \n", "33 3197.623474 [{'index': 2, 'value': 1.0}] \n", "34 4155.26742 [{'index': 2, 'value': 1.0}] \n", "37 3991.314095 [{'index': 2, 'value': 1.0}] \n", "41 3232.648242 [{'index': 3, 'value': 1.0}] \n", "47 4017.740788 [{'index': 2, 'value': 1.0}] \n", "52 3365.080596 [{'index': 2, 'value': 1.0}] \n", "56 3791.332002 [{'index': 1, 'value': 1.0}] \n", "57 3547.892992 [{'index': 1, 'value': 1.0}] \n", "62 5372.087702 [{'index': 1, 'value': 1.0}] \n", "65 4263.232169 [{'index': 2, 'value': 1.0}] \n", "67 5234.45894 [{'index': 1, 'value': 1.0}] \n", "75 3979.314516 [{'index': 1, 'value': 1.0}] \n", "81 3481.331391 [{'index': 2, 'value': 1.0}] \n", "89 3915.240555 [{'index': 2, 'value': 1.0}] \n", "92 3425.563946 [{'index': 2, 'value': 1.0}] \n", "93 4141.497717 [{'index': 1, 'value': 1.0}] \n", "96 3394.72289 [{'index': 2, 'value': 1.0}] \n", "100 3507.226918 [{'index': 2, 'value': 1.0}] \n", "101 4922.286202 [{'index': 1, 'value': 1.0}] \n", "102 4016.243221 [{'index': 2, 'value': 1.0}] \n", "107 4933.655362 [{'index': 1, 'value': 1.0}] \n", "\n", " standard_scaled_culmen_length_mm standard_scaled_culmen_depth_mm \\\n", "penguin_id \n", "1 0.220718 -1.359277 \n", "15 -0.510439 0.157322 \n", "28 -1.058807 0.713408 \n", "32 1.463685 1.168388 \n", "33 -0.254534 0.056215 \n", "34 -0.510439 0.460642 \n", "37 1.354012 0.511195 \n", "41 -0.674949 -0.095445 \n", "47 -1.168481 0.662855 \n", "52 0.458344 0.308982 \n", "56 -1.040528 0.460642 \n", "57 -0.967412 0.005662 \n", "62 0.988433 -0.752638 \n", "65 1.756148 1.370601 \n", "67 0.677691 -1.359277 \n", "75 -1.113644 1.421155 \n", "81 0.677691 0.561748 \n", "89 -0.857739 0.713408 \n", "92 -0.802902 0.308982 \n", "93 -0.309371 1.168388 \n", "96 -0.309371 0.662855 \n", "100 -0.912576 0.814515 \n", "101 0.549739 -1.308724 \n", "102 -0.126582 0.662855 \n", "107 1.20778 -1.005404 \n", "\n", " standard_scaled_flipper_length_mm onehotencoded_sex \\\n", "penguin_id \n", "1 1.045621 [{'index': 2, 'value': 1.0}] \n", "15 -0.771966 [{'index': 3, 'value': 1.0}] \n", "28 -0.771966 [{'index': 2, 'value': 1.0}] \n", "32 0.39129 [{'index': 3, 'value': 1.0}] \n", "33 -0.990076 [{'index': 2, 'value': 1.0}] \n", "34 0.318587 [{'index': 3, 'value': 1.0}] \n", "37 -0.263041 [{'index': 3, 'value': 1.0}] \n", "41 -1.789814 [{'index': 2, 'value': 1.0}] \n", "47 -0.117634 [{'index': 3, 'value': 1.0}] \n", "52 -0.699262 [{'index': 2, 'value': 1.0}] \n", "56 -1.135483 [{'index': 3, 'value': 1.0}] \n", "57 -0.117634 [{'index': 2, 'value': 1.0}] \n", "62 1.191028 [{'index': 3, 'value': 1.0}] \n", "65 0.318587 [{'index': 3, 'value': 1.0}] \n", "67 1.045621 [{'index': 3, 'value': 1.0}] \n", "75 -0.771966 [{'index': 3, 'value': 1.0}] \n", "81 -0.408448 [{'index': 2, 'value': 1.0}] \n", "89 -0.771966 [{'index': 3, 'value': 1.0}] \n", "92 -0.917372 [{'index': 2, 'value': 1.0}] \n", "93 -0.263041 [{'index': 3, 'value': 1.0}] \n", "96 -1.499 [{'index': 2, 'value': 1.0}] \n", "100 -0.771966 [{'index': 2, 'value': 1.0}] \n", "101 1.554546 [{'index': 2, 'value': 1.0}] \n", "102 -0.626559 [{'index': 3, 'value': 1.0}] \n", "107 1.118325 [{'index': 2, 'value': 1.0}] \n", "\n", " onehotencoded_species \n", "penguin_id \n", "1 [{'index': 3, 'value': 1.0}] \n", "15 [{'index': 1, 'value': 1.0}] \n", "28 [{'index': 1, 'value': 1.0}] \n", "32 [{'index': 2, 'value': 1.0}] \n", "33 [{'index': 2, 'value': 1.0}] \n", "34 [{'index': 1, 'value': 1.0}] \n", "37 [{'index': 2, 'value': 1.0}] \n", "41 [{'index': 1, 'value': 1.0}] \n", "47 [{'index': 1, 'value': 1.0}] \n", "52 [{'index': 2, 'value': 1.0}] \n", "56 [{'index': 1, 'value': 1.0}] \n", "57 [{'index': 1, 'value': 1.0}] \n", "62 [{'index': 3, 'value': 1.0}] \n", "65 [{'index': 2, 'value': 1.0}] \n", "67 [{'index': 3, 'value': 1.0}] \n", "75 [{'index': 1, 'value': 1.0}] \n", "81 [{'index': 2, 'value': 1.0}] \n", "89 [{'index': 1, 'value': 1.0}] \n", "92 [{'index': 1, 'value': 1.0}] \n", "93 [{'index': 1, 'value': 1.0}] \n", "96 [{'index': 1, 'value': 1.0}] \n", "100 [{'index': 1, 'value': 1.0}] \n", "101 [{'index': 3, 'value': 1.0}] \n", "102 [{'index': 1, 'value': 1.0}] \n", "107 [{'index': 3, 'value': 1.0}] \n", "\n", "[67 rows x 7 columns]" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from bigframes.ml.linear_model import LinearRegression\n", "\n", "linreg = LinearRegression()\n", "\n", "# Learn from the training data how to predict output y\n", "linreg.fit(processed_X_train, y_train)\n", "\n", "# Predict y for the test data\n", "predicted_y_test = linreg.predict(processed_X_test)\n", "\n", "# View predictions\n", "predicted_y_test" ] }, { "cell_type": "markdown", "metadata": { "id": "z42qesW_nAIf" }, "source": [ "#### Unsupervised predictors\n", "\n", "In unsupervised learning, there are no known outputs in the training data, instead the model learns on input data alone and predicts something else. An example of an unsupervised predictor is `bigframes.ml.cluster.KMeans`, which learns how to fit input data to a target number of clusters." ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "id": "M13zd02znCIg" }, "outputs": [ { "data": { "text/html": [ "Query job 728068d3-2349-4636-a030-016b500a9812 is DONE. 23.5 kB processed. Open Job" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "Query job 37bac685-2afa-4ece-b3a3-e0b84a92c65f is DONE. 0 Bytes processed. Open Job" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "Query job 38416629-4615-45f5-9e27-d9164124f755 is DONE. 6.2 kB processed. Open Job" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "Query job 0241ea1c-8d96-418a-b3d6-08d819854954 is DONE. 536 Bytes processed. Open Job" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "Query job 405bcf9b-d652-42f3-931e-12ca0310fe4f is DONE. 0 Bytes processed. Open Job" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "Query job 21ca6f31-2ea2-4f71-b030-c738bf5afe27 is DONE. 10.2 kB processed. Open Job" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CENTROID_IDNEAREST_CENTROIDS_DISTANCEonehotencoded_islandstandard_scaled_culmen_length_mmstandard_scaled_culmen_depth_mmstandard_scaled_flipper_length_mmonehotencoded_sexonehotencoded_species
penguin_id
13[{'CENTROID_ID': 3, 'DISTANCE': 0.857057881337...[{'index': 1, 'value': 1.0}]0.220718-1.3592771.045621[{'index': 2, 'value': 1.0}][{'index': 3, 'value': 1.0}]
154[{'CENTROID_ID': 4, 'DISTANCE': 1.181613302004...[{'index': 2, 'value': 1.0}]-0.5104390.157322-0.771966[{'index': 3, 'value': 1.0}][{'index': 1, 'value': 1.0}]
281[{'CENTROID_ID': 1, 'DISTANCE': 1.006856853050...[{'index': 2, 'value': 1.0}]-1.0588070.713408-0.771966[{'index': 2, 'value': 1.0}][{'index': 1, 'value': 1.0}]
322[{'CENTROID_ID': 2, 'DISTANCE': 1.237504384283...[{'index': 2, 'value': 1.0}]1.4636851.1683880.39129[{'index': 3, 'value': 1.0}][{'index': 2, 'value': 1.0}]
332[{'CENTROID_ID': 2, 'DISTANCE': 1.656439702919...[{'index': 2, 'value': 1.0}]-0.2545340.056215-0.990076[{'index': 2, 'value': 1.0}][{'index': 2, 'value': 1.0}]
344[{'CENTROID_ID': 4, 'DISTANCE': 1.343792119214...[{'index': 2, 'value': 1.0}]-0.5104390.4606420.318587[{'index': 3, 'value': 1.0}][{'index': 1, 'value': 1.0}]
372[{'CENTROID_ID': 2, 'DISTANCE': 0.816670297369...[{'index': 2, 'value': 1.0}]1.3540120.511195-0.263041[{'index': 3, 'value': 1.0}][{'index': 2, 'value': 1.0}]
411[{'CENTROID_ID': 1, 'DISTANCE': 1.317560921596...[{'index': 3, 'value': 1.0}]-0.674949-0.095445-1.789814[{'index': 2, 'value': 1.0}][{'index': 1, 'value': 1.0}]
474[{'CENTROID_ID': 4, 'DISTANCE': 1.135112005343...[{'index': 2, 'value': 1.0}]-1.1684810.662855-0.117634[{'index': 3, 'value': 1.0}][{'index': 1, 'value': 1.0}]
522[{'CENTROID_ID': 2, 'DISTANCE': 1.004096945181...[{'index': 2, 'value': 1.0}]0.4583440.308982-0.699262[{'index': 2, 'value': 1.0}][{'index': 2, 'value': 1.0}]
564[{'CENTROID_ID': 4, 'DISTANCE': 1.218648668822...[{'index': 1, 'value': 1.0}]-1.0405280.460642-1.135483[{'index': 3, 'value': 1.0}][{'index': 1, 'value': 1.0}]
571[{'CENTROID_ID': 1, 'DISTANCE': 1.238466630273...[{'index': 1, 'value': 1.0}]-0.9674120.005662-0.117634[{'index': 2, 'value': 1.0}][{'index': 1, 'value': 1.0}]
623[{'CENTROID_ID': 3, 'DISTANCE': 0.876984617451...[{'index': 1, 'value': 1.0}]0.988433-0.7526381.191028[{'index': 3, 'value': 1.0}][{'index': 3, 'value': 1.0}]
652[{'CENTROID_ID': 2, 'DISTANCE': 1.439604004538...[{'index': 2, 'value': 1.0}]1.7561481.3706010.318587[{'index': 3, 'value': 1.0}][{'index': 2, 'value': 1.0}]
673[{'CENTROID_ID': 3, 'DISTANCE': 0.763112987694...[{'index': 1, 'value': 1.0}]0.677691-1.3592771.045621[{'index': 3, 'value': 1.0}][{'index': 3, 'value': 1.0}]
754[{'CENTROID_ID': 4, 'DISTANCE': 1.075788925734...[{'index': 1, 'value': 1.0}]-1.1136441.421155-0.771966[{'index': 3, 'value': 1.0}][{'index': 1, 'value': 1.0}]
812[{'CENTROID_ID': 2, 'DISTANCE': 0.777307801541...[{'index': 2, 'value': 1.0}]0.6776910.561748-0.408448[{'index': 2, 'value': 1.0}][{'index': 2, 'value': 1.0}]
894[{'CENTROID_ID': 4, 'DISTANCE': 0.891303183824...[{'index': 2, 'value': 1.0}]-0.8577390.713408-0.771966[{'index': 3, 'value': 1.0}][{'index': 1, 'value': 1.0}]
921[{'CENTROID_ID': 1, 'DISTANCE': 0.934676470689...[{'index': 2, 'value': 1.0}]-0.8029020.308982-0.917372[{'index': 2, 'value': 1.0}][{'index': 1, 'value': 1.0}]
934[{'CENTROID_ID': 4, 'DISTANCE': 0.984620018517...[{'index': 1, 'value': 1.0}]-0.3093711.168388-0.263041[{'index': 3, 'value': 1.0}][{'index': 1, 'value': 1.0}]
961[{'CENTROID_ID': 1, 'DISTANCE': 1.446939975674...[{'index': 2, 'value': 1.0}]-0.3093710.662855-1.499[{'index': 2, 'value': 1.0}][{'index': 1, 'value': 1.0}]
1001[{'CENTROID_ID': 1, 'DISTANCE': 1.101117711572...[{'index': 2, 'value': 1.0}]-0.9125760.814515-0.771966[{'index': 2, 'value': 1.0}][{'index': 1, 'value': 1.0}]
1013[{'CENTROID_ID': 3, 'DISTANCE': 0.823832007899...[{'index': 1, 'value': 1.0}]0.549739-1.3087241.554546[{'index': 2, 'value': 1.0}][{'index': 3, 'value': 1.0}]
1024[{'CENTROID_ID': 4, 'DISTANCE': 0.995348310182...[{'index': 2, 'value': 1.0}]-0.1265820.662855-0.626559[{'index': 3, 'value': 1.0}][{'index': 1, 'value': 1.0}]
1073[{'CENTROID_ID': 3, 'DISTANCE': 0.930021405831...[{'index': 1, 'value': 1.0}]1.20778-1.0054041.118325[{'index': 2, 'value': 1.0}][{'index': 3, 'value': 1.0}]
\n", "

25 rows × 8 columns

\n", "
[67 rows x 8 columns in total]" ], "text/plain": [ " CENTROID_ID NEAREST_CENTROIDS_DISTANCE \\\n", "penguin_id \n", "1 3 [{'CENTROID_ID': 3, 'DISTANCE': 0.857057881337... \n", "15 4 [{'CENTROID_ID': 4, 'DISTANCE': 1.181613302004... \n", "28 1 [{'CENTROID_ID': 1, 'DISTANCE': 1.006856853050... \n", "32 2 [{'CENTROID_ID': 2, 'DISTANCE': 1.237504384283... \n", "33 2 [{'CENTROID_ID': 2, 'DISTANCE': 1.656439702919... \n", "34 4 [{'CENTROID_ID': 4, 'DISTANCE': 1.343792119214... \n", "37 2 [{'CENTROID_ID': 2, 'DISTANCE': 0.816670297369... \n", "41 1 [{'CENTROID_ID': 1, 'DISTANCE': 1.317560921596... \n", "47 4 [{'CENTROID_ID': 4, 'DISTANCE': 1.135112005343... \n", "52 2 [{'CENTROID_ID': 2, 'DISTANCE': 1.004096945181... \n", "56 4 [{'CENTROID_ID': 4, 'DISTANCE': 1.218648668822... \n", "57 1 [{'CENTROID_ID': 1, 'DISTANCE': 1.238466630273... \n", "62 3 [{'CENTROID_ID': 3, 'DISTANCE': 0.876984617451... \n", "65 2 [{'CENTROID_ID': 2, 'DISTANCE': 1.439604004538... \n", "67 3 [{'CENTROID_ID': 3, 'DISTANCE': 0.763112987694... \n", "75 4 [{'CENTROID_ID': 4, 'DISTANCE': 1.075788925734... \n", "81 2 [{'CENTROID_ID': 2, 'DISTANCE': 0.777307801541... \n", "89 4 [{'CENTROID_ID': 4, 'DISTANCE': 0.891303183824... \n", "92 1 [{'CENTROID_ID': 1, 'DISTANCE': 0.934676470689... \n", "93 4 [{'CENTROID_ID': 4, 'DISTANCE': 0.984620018517... \n", "96 1 [{'CENTROID_ID': 1, 'DISTANCE': 1.446939975674... \n", "100 1 [{'CENTROID_ID': 1, 'DISTANCE': 1.101117711572... \n", "101 3 [{'CENTROID_ID': 3, 'DISTANCE': 0.823832007899... \n", "102 4 [{'CENTROID_ID': 4, 'DISTANCE': 0.995348310182... \n", "107 3 [{'CENTROID_ID': 3, 'DISTANCE': 0.930021405831... \n", "\n", " onehotencoded_island standard_scaled_culmen_length_mm \\\n", "penguin_id \n", "1 [{'index': 1, 'value': 1.0}] 0.220718 \n", "15 [{'index': 2, 'value': 1.0}] -0.510439 \n", "28 [{'index': 2, 'value': 1.0}] -1.058807 \n", "32 [{'index': 2, 'value': 1.0}] 1.463685 \n", "33 [{'index': 2, 'value': 1.0}] -0.254534 \n", "34 [{'index': 2, 'value': 1.0}] -0.510439 \n", "37 [{'index': 2, 'value': 1.0}] 1.354012 \n", "41 [{'index': 3, 'value': 1.0}] -0.674949 \n", "47 [{'index': 2, 'value': 1.0}] -1.168481 \n", "52 [{'index': 2, 'value': 1.0}] 0.458344 \n", "56 [{'index': 1, 'value': 1.0}] -1.040528 \n", "57 [{'index': 1, 'value': 1.0}] -0.967412 \n", "62 [{'index': 1, 'value': 1.0}] 0.988433 \n", "65 [{'index': 2, 'value': 1.0}] 1.756148 \n", "67 [{'index': 1, 'value': 1.0}] 0.677691 \n", "75 [{'index': 1, 'value': 1.0}] -1.113644 \n", "81 [{'index': 2, 'value': 1.0}] 0.677691 \n", "89 [{'index': 2, 'value': 1.0}] -0.857739 \n", "92 [{'index': 2, 'value': 1.0}] -0.802902 \n", "93 [{'index': 1, 'value': 1.0}] -0.309371 \n", "96 [{'index': 2, 'value': 1.0}] -0.309371 \n", "100 [{'index': 2, 'value': 1.0}] -0.912576 \n", "101 [{'index': 1, 'value': 1.0}] 0.549739 \n", "102 [{'index': 2, 'value': 1.0}] -0.126582 \n", "107 [{'index': 1, 'value': 1.0}] 1.20778 \n", "\n", " standard_scaled_culmen_depth_mm \\\n", "penguin_id \n", "1 -1.359277 \n", "15 0.157322 \n", "28 0.713408 \n", "32 1.168388 \n", "33 0.056215 \n", "34 0.460642 \n", "37 0.511195 \n", "41 -0.095445 \n", "47 0.662855 \n", "52 0.308982 \n", "56 0.460642 \n", "57 0.005662 \n", "62 -0.752638 \n", "65 1.370601 \n", "67 -1.359277 \n", "75 1.421155 \n", "81 0.561748 \n", "89 0.713408 \n", "92 0.308982 \n", "93 1.168388 \n", "96 0.662855 \n", "100 0.814515 \n", "101 -1.308724 \n", "102 0.662855 \n", "107 -1.005404 \n", "\n", " standard_scaled_flipper_length_mm onehotencoded_sex \\\n", "penguin_id \n", "1 1.045621 [{'index': 2, 'value': 1.0}] \n", "15 -0.771966 [{'index': 3, 'value': 1.0}] \n", "28 -0.771966 [{'index': 2, 'value': 1.0}] \n", "32 0.39129 [{'index': 3, 'value': 1.0}] \n", "33 -0.990076 [{'index': 2, 'value': 1.0}] \n", "34 0.318587 [{'index': 3, 'value': 1.0}] \n", "37 -0.263041 [{'index': 3, 'value': 1.0}] \n", "41 -1.789814 [{'index': 2, 'value': 1.0}] \n", "47 -0.117634 [{'index': 3, 'value': 1.0}] \n", "52 -0.699262 [{'index': 2, 'value': 1.0}] \n", "56 -1.135483 [{'index': 3, 'value': 1.0}] \n", "57 -0.117634 [{'index': 2, 'value': 1.0}] \n", "62 1.191028 [{'index': 3, 'value': 1.0}] \n", "65 0.318587 [{'index': 3, 'value': 1.0}] \n", "67 1.045621 [{'index': 3, 'value': 1.0}] \n", "75 -0.771966 [{'index': 3, 'value': 1.0}] \n", "81 -0.408448 [{'index': 2, 'value': 1.0}] \n", "89 -0.771966 [{'index': 3, 'value': 1.0}] \n", "92 -0.917372 [{'index': 2, 'value': 1.0}] \n", "93 -0.263041 [{'index': 3, 'value': 1.0}] \n", "96 -1.499 [{'index': 2, 'value': 1.0}] \n", "100 -0.771966 [{'index': 2, 'value': 1.0}] \n", "101 1.554546 [{'index': 2, 'value': 1.0}] \n", "102 -0.626559 [{'index': 3, 'value': 1.0}] \n", "107 1.118325 [{'index': 2, 'value': 1.0}] \n", "\n", " onehotencoded_species \n", "penguin_id \n", "1 [{'index': 3, 'value': 1.0}] \n", "15 [{'index': 1, 'value': 1.0}] \n", "28 [{'index': 1, 'value': 1.0}] \n", "32 [{'index': 2, 'value': 1.0}] \n", "33 [{'index': 2, 'value': 1.0}] \n", "34 [{'index': 1, 'value': 1.0}] \n", "37 [{'index': 2, 'value': 1.0}] \n", "41 [{'index': 1, 'value': 1.0}] \n", "47 [{'index': 1, 'value': 1.0}] \n", "52 [{'index': 2, 'value': 1.0}] \n", "56 [{'index': 1, 'value': 1.0}] \n", "57 [{'index': 1, 'value': 1.0}] \n", "62 [{'index': 3, 'value': 1.0}] \n", "65 [{'index': 2, 'value': 1.0}] \n", "67 [{'index': 3, 'value': 1.0}] \n", "75 [{'index': 1, 'value': 1.0}] \n", "81 [{'index': 2, 'value': 1.0}] \n", "89 [{'index': 1, 'value': 1.0}] \n", "92 [{'index': 1, 'value': 1.0}] \n", "93 [{'index': 1, 'value': 1.0}] \n", "96 [{'index': 1, 'value': 1.0}] \n", "100 [{'index': 1, 'value': 1.0}] \n", "101 [{'index': 3, 'value': 1.0}] \n", "102 [{'index': 1, 'value': 1.0}] \n", "107 [{'index': 3, 'value': 1.0}] \n", "\n", "[67 rows x 8 columns]" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from bigframes.ml.cluster import KMeans\n", "\n", "# Specify KMeans with four clusters\n", "kmeans = KMeans(n_clusters=4)\n", "\n", "# Fit data\n", "kmeans.fit(processed_X_train)\n", "\n", "# View predictions\n", "kmeans.predict(processed_X_test)" ] }, { "cell_type": "markdown", "metadata": { "id": "DFwsIbscnEvh" }, "source": [ "## Pipelines\n", "\n", "Transfomers and predictors can be chained into a single estimator component using `bigframes.ml.pipeline.Pipeline`:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "id": "Ku2OXqgJnEeR" }, "outputs": [ { "data": { "text/plain": [ "Pipeline(steps=[('preproc',\n", " ColumnTransformer(transformers=[('scale', StandardScaler(),\n", " ['culmen_length_mm',\n", " 'culmen_depth_mm',\n", " 'flipper_length_mm']),\n", " ('encode', OneHotEncoder(),\n", " ['species', 'sex',\n", " 'island'])])),\n", " ('linreg', LinearRegression())])" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from bigframes.ml.pipeline import Pipeline\n", "\n", "pipeline = Pipeline([\n", " ('preproc', preproc),\n", " ('linreg', linreg)\n", "])\n", "\n", "# Print our pipeline\n", "pipeline" ] }, { "cell_type": "markdown", "metadata": { "id": "cCQCY_6wnKz_" }, "source": [ "The pipeline simplifies the workflow by applying each of its component steps automatically:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "id": "hsF7FYagnMko" }, "outputs": [ { "data": { "text/html": [ "Query job 95b43592-b198-4f9e-a990-4e837b82121f is DONE. 24.8 kB processed. Open Job" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "Query job 615b2afb-0c76-45d6-82c7-bde7c8b2b3a4 is DONE. 8.5 kB processed. Open Job" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "Query job cf2ed3ca-01bf-4cb6-a71a-d6e30a8428f6 is DONE. 0 Bytes processed. Open Job" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "Query job d9780763-1d2b-494d-a778-20364c52bd08 is DONE. 29.6 kB processed. Open Job" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "Query job f01296ba-7cd0-4d06-b25a-b5697e46bbf7 is DONE. 536 Bytes processed. Open Job" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "Query job 5b6fe451-2f8e-471e-a6a0-00b9bffaa826 is DONE. 0 Bytes processed. Open Job" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "Query job 6a81883b-0514-4251-9f63-490b6346bb8b is DONE. 6.1 kB processed. Open Job" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
predicted_body_mass_gislandculmen_length_mmculmen_depth_mmflipper_length_mmsexspecies
penguin_id
14772.374547Biscoe45.114.5215.0FEMALEGentoo penguin (Pygoscelis papua)
153883.371052Dream41.117.5190.0MALEAdelie Penguin (Pygoscelis adeliae)
283479.706166Dream38.118.6190.0FEMALEAdelie Penguin (Pygoscelis adeliae)
324223.851137Dream51.919.5206.0MALEChinstrap penguin (Pygoscelis antarctica)
333197.620461Dream42.517.3187.0FEMALEChinstrap penguin (Pygoscelis antarctica)
344155.265191Dream41.118.1205.0MALEAdelie Penguin (Pygoscelis adeliae)
373991.311319Dream51.318.2197.0MALEChinstrap penguin (Pygoscelis antarctica)
413232.644783Torgersen40.217.0176.0FEMALEAdelie Penguin (Pygoscelis adeliae)
474017.738303Dream37.518.5199.0MALEAdelie Penguin (Pygoscelis adeliae)
523365.077659Dream46.417.8191.0FEMALEChinstrap penguin (Pygoscelis antarctica)
563791.328893Biscoe38.218.1185.0MALEAdelie Penguin (Pygoscelis adeliae)
573547.890609Biscoe38.617.2199.0FEMALEAdelie Penguin (Pygoscelis adeliae)
625372.086117Biscoe49.315.7217.0MALEGentoo penguin (Pygoscelis papua)
654263.229571Dream53.519.9205.0MALEChinstrap penguin (Pygoscelis antarctica)
675234.457401Biscoe47.614.5215.0MALEGentoo penguin (Pygoscelis papua)
753979.311469Biscoe37.820.0190.0MALEAdelie Penguin (Pygoscelis adeliae)
813481.328573Dream47.618.3195.0FEMALEChinstrap penguin (Pygoscelis antarctica)
893915.237615Dream39.218.6190.0MALEAdelie Penguin (Pygoscelis adeliae)
923425.560982Dream39.517.8188.0FEMALEAdelie Penguin (Pygoscelis adeliae)
934141.494969Biscoe42.219.5197.0MALEAdelie Penguin (Pygoscelis adeliae)
963394.719445Dream42.218.5180.0FEMALEAdelie Penguin (Pygoscelis adeliae)
1003507.223965Dream38.918.8190.0FEMALEAdelie Penguin (Pygoscelis adeliae)
1014922.284991Biscoe46.914.6222.0FEMALEGentoo penguin (Pygoscelis papua)
1024016.240318Dream43.218.5192.0MALEAdelie Penguin (Pygoscelis adeliae)
1074933.653758Biscoe50.515.2216.0FEMALEGentoo penguin (Pygoscelis papua)
\n", "

25 rows × 7 columns

\n", "
[67 rows x 7 columns in total]" ], "text/plain": [ " predicted_body_mass_g island culmen_length_mm \\\n", "penguin_id \n", "1 4772.374547 Biscoe 45.1 \n", "15 3883.371052 Dream 41.1 \n", "28 3479.706166 Dream 38.1 \n", "32 4223.851137 Dream 51.9 \n", "33 3197.620461 Dream 42.5 \n", "34 4155.265191 Dream 41.1 \n", "37 3991.311319 Dream 51.3 \n", "41 3232.644783 Torgersen 40.2 \n", "47 4017.738303 Dream 37.5 \n", "52 3365.077659 Dream 46.4 \n", "56 3791.328893 Biscoe 38.2 \n", "57 3547.890609 Biscoe 38.6 \n", "62 5372.086117 Biscoe 49.3 \n", "65 4263.229571 Dream 53.5 \n", "67 5234.457401 Biscoe 47.6 \n", "75 3979.311469 Biscoe 37.8 \n", "81 3481.328573 Dream 47.6 \n", "89 3915.237615 Dream 39.2 \n", "92 3425.560982 Dream 39.5 \n", "93 4141.494969 Biscoe 42.2 \n", "96 3394.719445 Dream 42.2 \n", "100 3507.223965 Dream 38.9 \n", "101 4922.284991 Biscoe 46.9 \n", "102 4016.240318 Dream 43.2 \n", "107 4933.653758 Biscoe 50.5 \n", "\n", " culmen_depth_mm flipper_length_mm sex \\\n", "penguin_id \n", "1 14.5 215.0 FEMALE \n", "15 17.5 190.0 MALE \n", "28 18.6 190.0 FEMALE \n", "32 19.5 206.0 MALE \n", "33 17.3 187.0 FEMALE \n", "34 18.1 205.0 MALE \n", "37 18.2 197.0 MALE \n", "41 17.0 176.0 FEMALE \n", "47 18.5 199.0 MALE \n", "52 17.8 191.0 FEMALE \n", "56 18.1 185.0 MALE \n", "57 17.2 199.0 FEMALE \n", "62 15.7 217.0 MALE \n", "65 19.9 205.0 MALE \n", "67 14.5 215.0 MALE \n", "75 20.0 190.0 MALE \n", "81 18.3 195.0 FEMALE \n", "89 18.6 190.0 MALE \n", "92 17.8 188.0 FEMALE \n", "93 19.5 197.0 MALE \n", "96 18.5 180.0 FEMALE \n", "100 18.8 190.0 FEMALE \n", "101 14.6 222.0 FEMALE \n", "102 18.5 192.0 MALE \n", "107 15.2 216.0 FEMALE \n", "\n", " species \n", "penguin_id \n", "1 Gentoo penguin (Pygoscelis papua) \n", "15 Adelie Penguin (Pygoscelis adeliae) \n", "28 Adelie Penguin (Pygoscelis adeliae) \n", "32 Chinstrap penguin (Pygoscelis antarctica) \n", "33 Chinstrap penguin (Pygoscelis antarctica) \n", "34 Adelie Penguin (Pygoscelis adeliae) \n", "37 Chinstrap penguin (Pygoscelis antarctica) \n", "41 Adelie Penguin (Pygoscelis adeliae) \n", "47 Adelie Penguin (Pygoscelis adeliae) \n", "52 Chinstrap penguin (Pygoscelis antarctica) \n", "56 Adelie Penguin (Pygoscelis adeliae) \n", "57 Adelie Penguin (Pygoscelis adeliae) \n", "62 Gentoo penguin (Pygoscelis papua) \n", "65 Chinstrap penguin (Pygoscelis antarctica) \n", "67 Gentoo penguin (Pygoscelis papua) \n", "75 Adelie Penguin (Pygoscelis adeliae) \n", "81 Chinstrap penguin (Pygoscelis antarctica) \n", "89 Adelie Penguin (Pygoscelis adeliae) \n", "92 Adelie Penguin (Pygoscelis adeliae) \n", "93 Adelie Penguin (Pygoscelis adeliae) \n", "96 Adelie Penguin (Pygoscelis adeliae) \n", "100 Adelie Penguin (Pygoscelis adeliae) \n", "101 Gentoo penguin (Pygoscelis papua) \n", "102 Adelie Penguin (Pygoscelis adeliae) \n", "107 Gentoo penguin (Pygoscelis papua) \n", "\n", "[67 rows x 7 columns]" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pipeline.fit(X_train, y_train)\n", "\n", "predicted_y_test = pipeline.predict(X_test)\n", "predicted_y_test" ] }, { "cell_type": "markdown", "metadata": { "id": "SiLzpsg8nRXn" }, "source": [ "In the backend, a pipeline will actually be compiled into a single model with an embedded TRANSFORM step." ] }, { "cell_type": "markdown", "metadata": { "id": "sTzAxTv1nUKZ" }, "source": [ "## Evaluating results\n", "\n", "Some models include a convenient `.score(X, y)` method for evaulation with a preset accuracy metric:" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "id": "Q8nR1ZqznU-B" }, "outputs": [ { "data": { "text/html": [ "Query job c098e1d1-b3ed-4ec5-94c7-6ba3b2b59e3f is DONE. 29.6 kB processed. Open Job" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "Query job 035234b0-537a-44ce-adff-bb51c40b4ffa is DONE. 0 Bytes processed. Open Job" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "Query job b4a2a367-3e06-4fa3-9f00-bdbca884cfdd is DONE. 48 Bytes processed. Open Job" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
mean_absolute_errormean_squared_errormean_squared_log_errormedian_absolute_errorr2_scoreexplained_variance
0225.88351277765.9892810.004457179.5480410.8731660.873315
\n", "

1 rows × 6 columns

\n", "
[1 rows x 6 columns in total]" ], "text/plain": [ " mean_absolute_error mean_squared_error mean_squared_log_error \\\n", "0 225.883512 77765.989281 0.004457 \n", "\n", " median_absolute_error r2_score explained_variance \n", "0 179.548041 0.873166 0.873315 \n", "\n", "[1 rows x 6 columns]" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# In the case of a pipeline, this will be equivalent to calling .score on the contained LinearRegression\n", "pipeline.score(X_test, y_test)" ] }, { "cell_type": "markdown", "metadata": { "id": "UHM7jls6nY8A" }, "source": [ "For a more general approach, the library `bigframes.ml.metrics` is provided:" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "id": "vdEN4Ob9nan4" }, "outputs": [ { "data": { "text/html": [ "Query job 20ec1716-3e8e-4d3f-ba08-1f7b9970ce3f is DONE. 28.9 kB processed. Open Job" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "Query job 6f628f3b-62df-4a5a-8e05-0b313db0ed07 is DONE. 28.9 kB processed. Open Job" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "Query job c4eee1e5-146f-4a52-8499-83fe5f701f53 is DONE. 30.0 kB processed. Open Job" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "0.8731660699616813" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from bigframes.ml.metrics import r2_score\n", "\n", "r2_score(y_test, predicted_y_test[\"predicted_body_mass_g\"])" ] }, { "cell_type": "markdown", "metadata": { "id": "opn4ycPyneVh" }, "source": [ "## Save to BigQuery\n", "\n", "Estimators can be saved to BigQuery as BQML models, and loaded again in future.\n", "\n", "Saving requires `bigquery.tables.create` permission, and loading requires `bigquery.models.getMetadata` permission.\n", "These permissions can be at project level or the dataset level.\n", "\n", "If you have those permissions, please go ahead and uncomment the code in the following cells and run." ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "id": "fb0HpkdpnigJ" }, "outputs": [ { "data": { "text/html": [ "Copy job 06c2b62d-a7aa-46a5-a04a-2f189bafc5ee is DONE. Open Job" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "Pipeline(steps=[('transform',\n", " ColumnTransformer(transformers=[('ont_hot_encoder',\n", " OneHotEncoder(max_categories=1000001,\n", " min_frequency=0),\n", " 'island'),\n", " ('standard_scaler',\n", " StandardScaler(),\n", " 'culmen_length_mm'),\n", " ('standard_scaler',\n", " StandardScaler(),\n", " 'culmen_depth_mm'),\n", " ('standard_scaler',\n", " StandardScaler(),\n", " 'flipper_length_mm'),\n", " ('ont_hot_encoder',\n", " OneHotEncoder(max_categories=1000001,\n", " min_frequency=0),\n", " 'sex'),\n", " ('ont_hot_encoder',\n", " OneHotEncoder(max_categories=1000001,\n", " min_frequency=0),\n", " 'species')])),\n", " ('estimator',\n", " LinearRegression(optimize_strategy='NORMAL_EQUATION'))])" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "linreg.to_gbq(f\"{DATASET}.penguins_model\", replace=True)" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "id": "_zNOBlHdnkII" }, "outputs": [ { "data": { "text/plain": [ "Pipeline(steps=[('transform',\n", " ColumnTransformer(transformers=[('ont_hot_encoder',\n", " OneHotEncoder(max_categories=1000001,\n", " min_frequency=0),\n", " 'island'),\n", " ('standard_scaler',\n", " StandardScaler(),\n", " 'culmen_length_mm'),\n", " ('standard_scaler',\n", " StandardScaler(),\n", " 'culmen_depth_mm'),\n", " ('standard_scaler',\n", " StandardScaler(),\n", " 'flipper_length_mm'),\n", " ('ont_hot_encoder',\n", " OneHotEncoder(max_categories=1000001,\n", " min_frequency=0),\n", " 'sex'),\n", " ('ont_hot_encoder',\n", " OneHotEncoder(max_categories=1000001,\n", " min_frequency=0),\n", " 'species')])),\n", " ('estimator',\n", " LinearRegression(optimize_strategy='NORMAL_EQUATION'))])" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bpd.read_gbq_model(f\"{DATASET}.penguins_model\")" ] }, { "cell_type": "markdown", "metadata": { "id": "RfV-du5uTcBB" }, "source": [ "We can also save the pipeline to BigQuery. BigQuery will save this as a single model, with the pre-processing steps embedded in the TRANSFORM property:" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "id": "P76_TQ3IR6nB" }, "outputs": [ { "data": { "text/html": [ "Copy job a0ed8c1b-3a3f-4995-853c-e151d41560d7 is DONE. Open Job" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "Pipeline(steps=[('transform',\n", " ColumnTransformer(transformers=[('ont_hot_encoder',\n", " OneHotEncoder(max_categories=1000001,\n", " min_frequency=0),\n", " 'island'),\n", " ('standard_scaler',\n", " StandardScaler(),\n", " 'culmen_length_mm'),\n", " ('standard_scaler',\n", " StandardScaler(),\n", " 'culmen_depth_mm'),\n", " ('standard_scaler',\n", " StandardScaler(),\n", " 'flipper_length_mm'),\n", " ('ont_hot_encoder',\n", " OneHotEncoder(max_categories=1000001,\n", " min_frequency=0),\n", " 'sex'),\n", " ('ont_hot_encoder',\n", " OneHotEncoder(max_categories=1000001,\n", " min_frequency=0),\n", " 'species')])),\n", " ('estimator',\n", " LinearRegression(optimize_strategy='NORMAL_EQUATION'))])" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pipeline.to_gbq(f\"{DATASET}.penguins_pipeline\", replace=True)" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "id": "GKvlKFjAbToJ" }, "outputs": [ { "data": { "text/plain": [ "Pipeline(steps=[('transform',\n", " ColumnTransformer(transformers=[('ont_hot_encoder',\n", " OneHotEncoder(max_categories=1000001,\n", " min_frequency=0),\n", " 'island'),\n", " ('standard_scaler',\n", " StandardScaler(),\n", " 'culmen_length_mm'),\n", " ('standard_scaler',\n", " StandardScaler(),\n", " 'culmen_depth_mm'),\n", " ('standard_scaler',\n", " StandardScaler(),\n", " 'flipper_length_mm'),\n", " ('ont_hot_encoder',\n", " OneHotEncoder(max_categories=1000001,\n", " min_frequency=0),\n", " 'sex'),\n", " ('ont_hot_encoder',\n", " OneHotEncoder(max_categories=1000001,\n", " min_frequency=0),\n", " 'species')])),\n", " ('estimator',\n", " LinearRegression(optimize_strategy='NORMAL_EQUATION'))])" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bpd.read_gbq_model(f\"{DATASET}.penguins_pipeline\")" ] }, { "cell_type": "markdown", "metadata": { "id": "wCsmt0IwFkDy" }, "source": [ "## Summary and next steps\n", "\n", "You've completed an end-to-end machine learning workflow using the built-in capabilities of BigQuery DataFrames.\n", "\n", "Learn more about BigQuery DataFrames in the [documentation](https://cloud.google.com/python/docs/reference/bigframes/latest) and find more sample notebooks in the [GitHub repo](https://github.com/googleapis/python-bigquery-dataframes/tree/main/notebooks)." ] }, { "cell_type": "markdown", "metadata": { "id": "TpV-iwP9qw9c" }, "source": [ "### Cleaning up\n", "\n", "To clean up all Google Cloud resources used in this project, you can [delete the Google Cloud\n", "project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.\n", "\n", "Otherwise, you can uncomment the remaining cells and run them to delete the individual resources you created in this tutorial:" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "id": "QwumLUKmVpuH" }, "outputs": [], "source": [ "# # Delete the BQML models\n", "# MODEL_NAME = f\"{PROJECT_ID}:{DATASET}.penguins_model\"\n", "# ! bq rm -f --model {MODEL_NAME}\n", "# PIPELINE_NAME = f\"{PROJECT_ID}:{DATASET}.penguins_pipeline\"\n", "# ! bq rm -f --model {PIPELINE_NAME}" ] } ], "metadata": { "colab": { "provenance": [] }, "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.1" } }, "nbformat": 4, "nbformat_minor": 0 }