{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Using ML - Easy linear regression\n", "\n", "This demo shows BigQuery DataFrames ML providing an SKLearn-like experience for\n", "training a linear regression model.\n", "\n", "In this \"easy\" version of linear regression, we use a couple of BQML features to simplify our code:\n", "\n", "- We rely on automatic preprocessing to encode string values and scale numeric values\n", "- We rely on automatic data split & evaluation to test the model\n", "\n", "This example is adapted from the [BQML linear regression tutorial](https://cloud.google.com/bigquery-ml/docs/linear-regression-tutorial)." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Init & load data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Import `bigframes.pandas` module and get the default session" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "import bigframes.pandas\n", "session = bigframes.pandas.get_global_session()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Define a dataset for storing BQML model, and create it if it does not exist." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dataset = f\"{session.bqclient.project}.bqml_tutorial\"\n", "session.bqclient.create_dataset(dataset, exists_ok=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Define a model path" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "penguins_model = f\"{dataset}.penguins_model\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Read the penguins data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# read a BigQuery table to a BigQuery DataFrame\n", "df = bigframes.pandas.read_gbq(f\"bigquery-public-data.ml_datasets.penguins\")\n", "\n", "# take a peek at the dataframe\n", "df" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Data cleaning / prep" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# filter down to the data we want to analyze\n", "adelie_data = df[df.species == \"Adelie Penguin (Pygoscelis adeliae)\"]\n", "\n", "# drop the columns we don't care about\n", "adelie_data = adelie_data.drop(columns=[\"species\"])\n", "\n", "# drop rows with nulls to get our training data\n", "training_data = adelie_data.dropna()\n", "\n", "# take a peek at the training data\n", "training_data" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "# pick feature columns and label column\n", "feature_columns = training_data[['island', 'culmen_length_mm', 'culmen_depth_mm', 'flipper_length_mm', 'sex']]\n", "label_columns = training_data[['body_mass_g']] \n", "\n", "# also get the rows that we want to make predictions for (i.e. where the feature column is null)\n", "missing_body_mass = adelie_data[adelie_data.body_mass_g.isnull()]" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Create, score, fit, predict" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bigframes.ml.linear_model import LinearRegression\n", "\n", "model = LinearRegression()\n", "\n", "# Here we pass the feature columns without transforms - BQML will then use\n", "# automatic preprocessing to encode these columns\n", "model.fit(feature_columns, label_columns)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# check how the model performed\n", "model.score(feature_columns, label_columns)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# use the model to predict the missing labels\n", "model.predict(missing_body_mass)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Save in BigQuery" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# save the model to a permanent location in BigQuery, so we can use it in future sessions (and elsewhere in BQ)\n", "model.to_gbq(penguins_model, replace=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. Reload from BigQuery" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# WARNING - until b/281709360 is fixed & pipeline is updated, pipelines will load as models,\n", "# and details of their transform steps will be lost (the loaded model will behave the same)\n", "bigframes.pandas.read_gbq_model(penguins_model)" ] } ], "metadata": { "kernelspec": { "display_name": "venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.12" }, "orig_nbformat": 4, "vscode": { "interpreter": { "hash": "a850322d07d9bdc9ec5f301d307e048bcab2390ae395e1cbce9335f4e081e5e2" } } }, "nbformat": 4, "nbformat_minor": 2 }