{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Using ML - Easy linear regression\n",
    "\n",
    "This demo shows BigQuery DataFrames ML providing an SKLearn-like experience for\n",
    "training a linear regression model.\n",
    "\n",
    "In this \"easy\" version of linear regression, we use a couple of BQML features to simplify our code:\n",
    "\n",
    "- We rely on automatic preprocessing to encode string values and scale numeric values\n",
    "- We rely on automatic data split & evaluation to test the model\n",
    "\n",
    "This example is adapted from the [BQML linear regression tutorial](https://cloud.google.com/bigquery-ml/docs/linear-regression-tutorial)."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Init & load data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Import `bigframes.pandas` module and get the default session"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "import bigframes.pandas\n",
    "session = bigframes.pandas.get_global_session()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Define a dataset for storing BQML model, and create it if it does not exist."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dataset = f\"{session.bqclient.project}.bqml_tutorial\"\n",
    "session.bqclient.create_dataset(dataset, exists_ok=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Define a model path"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [],
   "source": [
    "penguins_model = f\"{dataset}.penguins_model\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Read the penguins data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# read a BigQuery table to a BigQuery DataFrame\n",
    "df = bigframes.pandas.read_gbq(f\"bigquery-public-data.ml_datasets.penguins\")\n",
    "\n",
    "# take a peek at the dataframe\n",
    "df"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Data cleaning / prep"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# filter down to the data we want to analyze\n",
    "adelie_data = df[df.species == \"Adelie Penguin (Pygoscelis adeliae)\"]\n",
    "\n",
    "# drop the columns we don't care about\n",
    "adelie_data = adelie_data.drop(columns=[\"species\"])\n",
    "\n",
    "# drop rows with nulls to get our training data\n",
    "training_data = adelie_data.dropna()\n",
    "\n",
    "# take a peek at the training data\n",
    "training_data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [],
   "source": [
    "# pick feature columns and label column\n",
    "feature_columns = training_data[['island', 'culmen_length_mm', 'culmen_depth_mm', 'flipper_length_mm', 'sex']]\n",
    "label_columns = training_data[['body_mass_g']]                               \n",
    "\n",
    "# also get the rows that we want to make predictions for (i.e. where the feature column is null)\n",
    "missing_body_mass = adelie_data[adelie_data.body_mass_g.isnull()]"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Create, score, fit, predict"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from bigframes.ml.linear_model import LinearRegression\n",
    "\n",
    "model = LinearRegression()\n",
    "\n",
    "# Here we pass the feature columns without transforms - BQML will then use\n",
    "# automatic preprocessing to encode these columns\n",
    "model.fit(feature_columns, label_columns)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# check how the model performed\n",
    "model.score(feature_columns, label_columns)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# use the model to predict the missing labels\n",
    "model.predict(missing_body_mass)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Save in BigQuery"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# save the model to a permanent location in BigQuery, so we can use it in future sessions (and elsewhere in BQ)\n",
    "model.to_gbq(penguins_model, replace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Reload from BigQuery"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# WARNING - until b/281709360 is fixed & pipeline is updated, pipelines will load as models,\n",
    "# and details of their transform steps will be lost (the loaded model will behave the same)\n",
    "bigframes.pandas.read_gbq_model(penguins_model)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  },
  "orig_nbformat": 4,
  "vscode": {
   "interpreter": {
    "hash": "a850322d07d9bdc9ec5f301d307e048bcab2390ae395e1cbce9335f4e081e5e2"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}