{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": { "id": "ur8xi4C7S06n" }, "outputs": [], "source": [ "# Copyright 2023 Google LLC\n", "#\n", "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License." ] }, { "cell_type": "markdown", "metadata": { "id": "JAPoU8Sm5E6e" }, "source": [ "# BigQuery DataFrames Quickstart Guide\n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "
\n", " \n", " \"Colab Run in Colab\n", " \n", " \n", " \n", " \"GitHub\n", " View on GitHub\n", " \n", " \n", " \n", " \"Vertex\n", " Open in Vertex AI Workbench\n", " \n", " \n", " \n", " \"BQ\n", " Open in BQ Studio\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": { "id": "24743cf4a1e1" }, "source": [ "**_NOTE_**: This notebook has been tested in the following environment:\n", "\n", "* Python version = 3.12" ] }, { "cell_type": "markdown", "metadata": { "id": "tvgnzT1CKxrO" }, "source": [ "## Overview\n", "\n", "In this guide, you learn how to install BigQuery DataFrames, load data into a BigQuery DataFrames DataFrame, and inspect and manipulate the data using pandas and a custom Python function, running at BigQuery scale.\n", "\n", "The steps include:\n", "\n", "- Installing the BigQuery Dataframes package.\n", "- Setting up the environment.\n", "- Creating a BigQuery DataFrames DataFrame: Access data from a local CSV to create a BigQuery DataFrames DataFrame.\n", "- Inspecting and manipulating data: Use pandas to perform data cleaning and preparation on the DataFrame.\n", "- Deploying a custom function: Deploy a [remote function ](https://cloud.google.com/bigquery/docs/remote-functions)that runs a scalar Python function at BigQuery scale.\n", "\n", "You can learn more about [BigQuery DataFrames](https://cloud.google.com/python/docs/reference/bigframes/latest)." ] }, { "cell_type": "markdown", "metadata": { "id": "08d289fa873f" }, "source": [ "### Dataset\n", "\n", "This tutorial uses the [```penguins``` table](https://console.cloud.google.com/bigquery?p=bigquery-public-data&d=ml_datasets&t=penguins) (a BigQuery public dataset), which contains data on a set of penguins including species, island of residence, weight, culmen length and depth, flipper length, and sex.\n", "\n", "The same dataset is also stored in a public Cloud Storage bucket as a CSV file so that you can use it to try ingesting data from a local environment." ] }, { "cell_type": "markdown", "metadata": { "id": "aed92deeb4a0" }, "source": [ "### Costs\n", "\n", "This tutorial uses billable components of Google Cloud:\n", "\n", "* BigQuery (storage and compute)\n", "* Cloud Functions\n", "\n", "Learn about [BigQuery storage pricing](https://cloud.google.com/bigquery/pricing#storage),\n", "[BigQuery compute pricing](https://cloud.google.com/bigquery/pricing#analysis_pricing_models),\n", "and [Cloud Functions pricing](https://cloud.google.com/functions/pricing),\n", "and use the [Pricing Calculator](https://cloud.google.com/products/calculator/)\n", "to generate a cost estimate based on your projected usage." ] }, { "cell_type": "markdown", "metadata": { "id": "i7EUnXsZhAGF" }, "source": [ "## Installation\n", "\n", "Install the following packages, which are required to run this notebook:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "id": "mfPoOwPLGpSr" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: bigframes in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (2.17.0)\n", "Requirement already satisfied: cloudpickle>=2.0.0 in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from bigframes) (3.1.1)\n", "Requirement already satisfied: fsspec>=2023.3.0 in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from bigframes) (2025.9.0)\n", "Requirement already satisfied: gcsfs!=2025.5.0,>=2023.3.0 in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from bigframes) (2025.9.0)\n", "Requirement already satisfied: geopandas>=0.12.2 in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from bigframes) (1.1.1)\n", "Requirement already satisfied: google-auth<3.0,>=2.15.0 in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from bigframes) (2.40.3)\n", "Requirement already satisfied: google-cloud-bigquery>=3.36.0 in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from google-cloud-bigquery[bqstorage,pandas]>=3.36.0->bigframes) (3.36.0)\n", "Requirement already satisfied: google-cloud-bigquery-storage<3.0.0,>=2.30.0 in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from bigframes) (2.33.0)\n", "Requirement already satisfied: google-cloud-functions>=1.12.0 in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from bigframes) (1.20.4)\n", "Requirement already satisfied: google-cloud-bigquery-connection>=1.12.0 in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from bigframes) (1.18.3)\n", "Requirement already satisfied: google-cloud-resource-manager>=1.10.3 in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from bigframes) (1.14.2)\n", "Requirement already satisfied: google-cloud-storage>=2.0.0 in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from bigframes) (3.3.1)\n", "Requirement already satisfied: grpc-google-iam-v1>=0.14.2 in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from bigframes) (0.14.2)\n", "Requirement already satisfied: numpy>=1.24.0 in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from bigframes) (2.2.6)\n", "Requirement already satisfied: pandas>=1.5.3 in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from bigframes) (2.3.2)\n", "Requirement already satisfied: pandas-gbq>=0.26.1 in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from bigframes) (0.29.2)\n", "Requirement already satisfied: pyarrow>=15.0.2 in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from bigframes) (21.0.0)\n", "Requirement already satisfied: pydata-google-auth>=1.8.2 in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from bigframes) (1.9.1)\n", "Requirement already satisfied: requests>=2.27.1 in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from bigframes) (2.32.5)\n", "Requirement already satisfied: shapely>=1.8.5 in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from bigframes) (2.1.1)\n", "Requirement already satisfied: sqlglot>=23.6.3 in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from bigframes) (27.11.0)\n", "Requirement already satisfied: tabulate>=0.9 in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from bigframes) (0.9.0)\n", "Requirement already satisfied: ipywidgets>=7.7.1 in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from bigframes) (8.1.7)\n", "Requirement already satisfied: humanize>=4.6.0 in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from bigframes) (4.13.0)\n", "Requirement already satisfied: matplotlib>=3.7.1 in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from bigframes) (3.10.6)\n", "Requirement already satisfied: db-dtypes>=1.4.2 in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from bigframes) (1.4.3)\n", "Requirement already satisfied: atpublic<6,>=2.3 in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from bigframes) (5.1)\n", "Requirement already satisfied: python-dateutil<3,>=2.8.2 in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from bigframes) (2.9.0.post0)\n", "Requirement already satisfied: pytz>=2022.7 in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from bigframes) (2025.2)\n", "Requirement already satisfied: toolz<2,>=0.11 in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from bigframes) (1.0.0)\n", "Requirement already satisfied: typing-extensions<5,>=4.5.0 in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from bigframes) (4.15.0)\n", "Requirement already satisfied: rich<14,>=12.4.4 in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from bigframes) (13.9.4)\n", "Requirement already satisfied: cachetools<6.0,>=2.0.0 in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from google-auth<3.0,>=2.15.0->bigframes) (5.5.2)\n", "Requirement already satisfied: pyasn1-modules>=0.2.1 in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from google-auth<3.0,>=2.15.0->bigframes) (0.4.2)\n", "Requirement already satisfied: rsa<5,>=3.1.4 in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from google-auth<3.0,>=2.15.0->bigframes) (4.9.1)\n", "Requirement already satisfied: google-api-core!=2.0.*,!=2.1.*,!=2.10.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,!=2.8.*,!=2.9.*,<3.0.0,>=1.34.0 in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from google-api-core[grpc]!=2.0.*,!=2.1.*,!=2.10.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,!=2.8.*,!=2.9.*,<3.0.0,>=1.34.0->google-cloud-bigquery-storage<3.0.0,>=2.30.0->bigframes) (2.25.1)\n", "Requirement already satisfied: proto-plus<2.0.0,>=1.22.0 in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from google-cloud-bigquery-storage<3.0.0,>=2.30.0->bigframes) (1.26.1)\n", "Requirement already satisfied: protobuf!=3.20.0,!=3.20.1,!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<7.0.0,>=3.20.2 in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from google-cloud-bigquery-storage<3.0.0,>=2.30.0->bigframes) (6.32.0)\n", "Requirement already satisfied: googleapis-common-protos<2.0.0,>=1.56.2 in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from google-api-core!=2.0.*,!=2.1.*,!=2.10.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,!=2.8.*,!=2.9.*,<3.0.0,>=1.34.0->google-api-core[grpc]!=2.0.*,!=2.1.*,!=2.10.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,!=2.8.*,!=2.9.*,<3.0.0,>=1.34.0->google-cloud-bigquery-storage<3.0.0,>=2.30.0->bigframes) (1.70.0)\n", "Requirement already satisfied: grpcio<2.0.0,>=1.33.2 in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from google-api-core[grpc]!=2.0.*,!=2.1.*,!=2.10.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,!=2.8.*,!=2.9.*,<3.0.0,>=1.34.0->google-cloud-bigquery-storage<3.0.0,>=2.30.0->bigframes) (1.74.0)\n", "Requirement already satisfied: grpcio-status<2.0.0,>=1.33.2 in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from google-api-core[grpc]!=2.0.*,!=2.1.*,!=2.10.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,!=2.8.*,!=2.9.*,<3.0.0,>=1.34.0->google-cloud-bigquery-storage<3.0.0,>=2.30.0->bigframes) (1.74.0)\n", "Requirement already satisfied: six>=1.5 in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from python-dateutil<3,>=2.8.2->bigframes) (1.17.0)\n", "Requirement already satisfied: charset_normalizer<4,>=2 in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from requests>=2.27.1->bigframes) (3.4.3)\n", "Requirement already satisfied: idna<4,>=2.5 in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from requests>=2.27.1->bigframes) (3.10)\n", "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from requests>=2.27.1->bigframes) (2.5.0)\n", "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from requests>=2.27.1->bigframes) (2025.8.3)\n", "Requirement already satisfied: markdown-it-py>=2.2.0 in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from rich<14,>=12.4.4->bigframes) (4.0.0)\n", "Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from rich<14,>=12.4.4->bigframes) (2.19.2)\n", "Requirement already satisfied: pyasn1>=0.1.3 in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from rsa<5,>=3.1.4->google-auth<3.0,>=2.15.0->bigframes) (0.6.1)\n", "Requirement already satisfied: packaging>=24.2.0 in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from db-dtypes>=1.4.2->bigframes) (25.0)\n", "Requirement already satisfied: aiohttp!=4.0.0a0,!=4.0.0a1 in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from gcsfs!=2025.5.0,>=2023.3.0->bigframes) (3.12.15)\n", "Requirement already satisfied: decorator>4.1.2 in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from gcsfs!=2025.5.0,>=2023.3.0->bigframes) (5.2.1)\n", "Requirement already satisfied: google-auth-oauthlib in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from gcsfs!=2025.5.0,>=2023.3.0->bigframes) (1.2.2)\n", "Requirement already satisfied: aiohappyeyeballs>=2.5.0 in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->gcsfs!=2025.5.0,>=2023.3.0->bigframes) (2.6.1)\n", "Requirement already satisfied: aiosignal>=1.4.0 in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->gcsfs!=2025.5.0,>=2023.3.0->bigframes) (1.4.0)\n", "Requirement already satisfied: async-timeout<6.0,>=4.0 in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->gcsfs!=2025.5.0,>=2023.3.0->bigframes) (5.0.1)\n", "Requirement already satisfied: attrs>=17.3.0 in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->gcsfs!=2025.5.0,>=2023.3.0->bigframes) (25.3.0)\n", "Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->gcsfs!=2025.5.0,>=2023.3.0->bigframes) (1.7.0)\n", "Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->gcsfs!=2025.5.0,>=2023.3.0->bigframes) (6.6.4)\n", "Requirement already satisfied: propcache>=0.2.0 in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->gcsfs!=2025.5.0,>=2023.3.0->bigframes) (0.3.2)\n", "Requirement already satisfied: yarl<2.0,>=1.17.0 in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->gcsfs!=2025.5.0,>=2023.3.0->bigframes) (1.20.1)\n", "Requirement already satisfied: pyogrio>=0.7.2 in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from geopandas>=0.12.2->bigframes) (0.11.1)\n", "Requirement already satisfied: pyproj>=3.5.0 in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from geopandas>=0.12.2->bigframes) (3.7.1)\n", "Requirement already satisfied: google-cloud-core<3.0.0,>=2.4.1 in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from google-cloud-bigquery>=3.36.0->google-cloud-bigquery[bqstorage,pandas]>=3.36.0->bigframes) (2.4.3)\n", "Requirement already satisfied: google-resumable-media<3.0.0,>=2.0.0 in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from google-cloud-bigquery>=3.36.0->google-cloud-bigquery[bqstorage,pandas]>=3.36.0->bigframes) (2.7.2)\n", "Requirement already satisfied: google-crc32c<2.0dev,>=1.0 in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from google-resumable-media<3.0.0,>=2.0.0->google-cloud-bigquery>=3.36.0->google-cloud-bigquery[bqstorage,pandas]>=3.36.0->bigframes) (1.7.1)\n", "Requirement already satisfied: comm>=0.1.3 in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from ipywidgets>=7.7.1->bigframes) (0.2.3)\n", "Requirement already satisfied: ipython>=6.1.0 in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from ipywidgets>=7.7.1->bigframes) (8.37.0)\n", "Requirement already satisfied: traitlets>=4.3.1 in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from ipywidgets>=7.7.1->bigframes) (5.14.3)\n", "Requirement already satisfied: widgetsnbextension~=4.0.14 in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from ipywidgets>=7.7.1->bigframes) (4.0.14)\n", "Requirement already satisfied: jupyterlab_widgets~=3.0.15 in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from ipywidgets>=7.7.1->bigframes) (3.0.15)\n", "Requirement already satisfied: exceptiongroup in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from ipython>=6.1.0->ipywidgets>=7.7.1->bigframes) (1.3.0)\n", "Requirement already satisfied: jedi>=0.16 in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from ipython>=6.1.0->ipywidgets>=7.7.1->bigframes) (0.19.2)\n", "Requirement already satisfied: matplotlib-inline in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from ipython>=6.1.0->ipywidgets>=7.7.1->bigframes) (0.1.7)\n", "Requirement already satisfied: pexpect>4.3 in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from ipython>=6.1.0->ipywidgets>=7.7.1->bigframes) (4.9.0)\n", "Requirement already satisfied: prompt_toolkit<3.1.0,>=3.0.41 in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from ipython>=6.1.0->ipywidgets>=7.7.1->bigframes) (3.0.52)\n", "Requirement already satisfied: stack_data in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from ipython>=6.1.0->ipywidgets>=7.7.1->bigframes) (0.6.3)\n", "Requirement already satisfied: wcwidth in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from prompt_toolkit<3.1.0,>=3.0.41->ipython>=6.1.0->ipywidgets>=7.7.1->bigframes) (0.2.13)\n", "Requirement already satisfied: parso<0.9.0,>=0.8.4 in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from jedi>=0.16->ipython>=6.1.0->ipywidgets>=7.7.1->bigframes) (0.8.5)\n", "Requirement already satisfied: mdurl~=0.1 in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from markdown-it-py>=2.2.0->rich<14,>=12.4.4->bigframes) (0.1.2)\n", "Requirement already satisfied: contourpy>=1.0.1 in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from matplotlib>=3.7.1->bigframes) (1.3.2)\n", "Requirement already satisfied: cycler>=0.10 in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from matplotlib>=3.7.1->bigframes) (0.12.1)\n", "Requirement already satisfied: fonttools>=4.22.0 in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from matplotlib>=3.7.1->bigframes) (4.59.2)\n", "Requirement already satisfied: kiwisolver>=1.3.1 in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from matplotlib>=3.7.1->bigframes) (1.4.9)\n", "Requirement already satisfied: pillow>=8 in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from matplotlib>=3.7.1->bigframes) (11.3.0)\n", "Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from matplotlib>=3.7.1->bigframes) (3.2.3)\n", "Requirement already satisfied: tzdata>=2022.7 in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from pandas>=1.5.3->bigframes) (2025.2)\n", "Requirement already satisfied: setuptools in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from pandas-gbq>=0.26.1->bigframes) (65.5.0)\n", "Requirement already satisfied: requests-oauthlib>=0.7.0 in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from google-auth-oauthlib->gcsfs!=2025.5.0,>=2023.3.0->bigframes) (2.0.0)\n", "Requirement already satisfied: ptyprocess>=0.5 in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from pexpect>4.3->ipython>=6.1.0->ipywidgets>=7.7.1->bigframes) (0.7.0)\n", "Requirement already satisfied: oauthlib>=3.0.0 in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from requests-oauthlib>=0.7.0->google-auth-oauthlib->gcsfs!=2025.5.0,>=2023.3.0->bigframes) (3.3.1)\n", "Requirement already satisfied: executing>=1.2.0 in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from stack_data->ipython>=6.1.0->ipywidgets>=7.7.1->bigframes) (2.2.1)\n", "Requirement already satisfied: asttokens>=2.1.0 in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from stack_data->ipython>=6.1.0->ipywidgets>=7.7.1->bigframes) (3.0.0)\n", "Requirement already satisfied: pure-eval in /usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/venv/lib/python3.10/site-packages (from stack_data->ipython>=6.1.0->ipywidgets>=7.7.1->bigframes) (0.2.3)\n" ] } ], "source": [ "!pip install bigframes" ] }, { "cell_type": "markdown", "metadata": { "id": "58707a750154" }, "source": [ "### Colab only\n", "\n", "Uncomment and run the following cell to restart the kernel:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "id": "f200f10a1da3" }, "outputs": [], "source": [ "# Automatically restart kernel after installs so that your environment can access the new packages\n", "# import IPython\n", "\n", "# app = IPython.Application.instance()\n", "# app.kernel.do_shutdown(True)" ] }, { "cell_type": "markdown", "metadata": { "id": "BF1j6f9HApxa" }, "source": [ "## Environment setup\n", "\n", "Complete the tasks in this section to set up your environment." ] }, { "cell_type": "markdown", "metadata": { "id": "Yq7zKYWelRQP" }, "source": [ "### Set up your Google Cloud project\n", "\n", "**The following steps are required, regardless of your notebook environment.**\n", "\n", "1. [Select or create a Google Cloud project](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get a $300 credit towards your compute/storage costs.\n", "\n", "2. [Make sure that billing is enabled for your project](https://cloud.google.com/billing/docs/how-to/modify-project).\n", "\n", "3. [Click here](https://console.cloud.google.com/flows/enableapi?apiid=bigquery.googleapis.com,bigqueryconnection.googleapis.com,cloudfunctions.googleapis.com,run.googleapis.com,artifactregistry.googleapis.com,cloudbuild.googleapis.com,cloudresourcemanager.googleapis.com) to enable the following APIs:\n", "\n", " * BigQuery API\n", " * BigQuery Connection API\n", " * Cloud Functions API\n", " * Cloud Run API\n", " * Artifact Registry API\n", " * Cloud Build API\n", " * Cloud Resource Manager API\n", "\n", "4. If you are running this notebook locally, install the [Cloud SDK](https://cloud.google.com/sdk)." ] }, { "cell_type": "markdown", "metadata": { "id": "WReHDGG5g0XY" }, "source": [ "#### Set your project ID\n", "\n", "If you don't know your project ID, try the following:\n", "* Run `gcloud config list`.\n", "* Run `gcloud projects list`.\n", "* See the support page: [Locate the project ID](https://support.google.com/googleapi/answer/7014113)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "oM1iC_MfAts1" }, "outputs": [], "source": [ "PROJECT_ID = \"\" # @param {type:\"string\"}" ] }, { "cell_type": "markdown", "metadata": { "id": "region" }, "source": [ "#### Set the region\n", "\n", "You can also change the `REGION` variable used by BigQuery. Learn more about [BigQuery regions](https://cloud.google.com/bigquery/docs/locations#supported_locations)." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "id": "eF-Twtc4XGem" }, "outputs": [], "source": [ "REGION = \"US\" # @param {type: \"string\"}" ] }, { "cell_type": "markdown", "metadata": { "id": "sBCra4QMA2wR" }, "source": [ "### Authenticate your Google Cloud account\n", "\n", "Depending on your Jupyter environment, you might have to manually authenticate. Follow the relevant instructions below." ] }, { "cell_type": "markdown", "metadata": { "id": "74ccc9e52986" }, "source": [ "**Vertex AI Workbench**\n", "\n", "Do nothing, you are already authenticated." ] }, { "cell_type": "markdown", "metadata": { "id": "de775a3773ba" }, "source": [ "**Local JupyterLab instance**\n", "\n", "Uncomment and run the following cell:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "id": "254614fa0c46" }, "outputs": [], "source": [ "# ! gcloud auth login" ] }, { "cell_type": "markdown", "metadata": { "id": "ef21552ccea8" }, "source": [ "**Colab**\n", "\n", "Uncomment and run the following cell:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "id": "603adbbf0532" }, "outputs": [], "source": [ "# from google.colab import auth\n", "# auth.authenticate_user()" ] }, { "cell_type": "markdown", "metadata": { "id": "960505627ddf" }, "source": [ "### Import libraries" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "id": "PyQmSRbKA8r-" }, "outputs": [], "source": [ "import bigframes.pandas as bpd" ] }, { "cell_type": "markdown", "metadata": { "id": "init_aip:mbsdk,all" }, "source": [ "\n", "### Set BigQuery DataFrames options" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "id": "NPPMuw2PXGeo" }, "outputs": [], "source": [ "# Note: The project option is not required in all environments.\n", "# On BigQuery Studio, the project ID is automatically detected.\n", "bpd.options.bigquery.project = PROJECT_ID\n", "\n", "# Note: The location option is not required.\n", "# It defaults to the location of the first table or query\n", "# passed to read_gbq(). For APIs where a location can't be\n", "# auto-detected, the location defaults to the \"US\" location.\n", "bpd.options.bigquery.location = REGION\n", "\n", "# Note: By default BigQuery DataFrames emits out BigQuery job metadata via a\n", "# progress bar. But in this notebook let's disable the progress bar to keep the\n", "# experience less verbose. If you would like the default behavior, please\n", "# comment out the following expression. \n", "bpd.options.display.progress_bar = None" ] }, { "cell_type": "markdown", "metadata": { "id": "pDfrKwMKE_dK" }, "source": [ "If you want to reset the location of the created DataFrame or Series objects, reset the session by executing `bpd.close_session()`. After that, you can reuse `bpd.options.bigquery.location` to specify another location." ] }, { "cell_type": "markdown", "metadata": { "id": "-19Uiwoo9pP4" }, "source": [ "## See the power of BigQuery DataFrames first-hand\n", "\n", "BigQuery DataFrames enables you to interact with datasets of any size, so that you can explore, transform, and understand even your biggest datasets using familiar tools like pandas and scikit-learn." ] }, { "cell_type": "markdown", "metadata": { "id": "KMX4D2uoBwM0" }, "source": [ "For example, take the BigQuery sample table `bigquery-samples.wikipedia_pageviews.200809h`, which is ~60 GB is size. This is not a dataset you'd likely be able process in pandas without extra infrastructure.\n", "\n", "With BigQuery DataFrames, however, computation is handled by BigQuery's highly scalable compute engine, meaning you can focus on doing data science without hitting size limitations." ] }, { "cell_type": "markdown", "metadata": { "id": "i98c46p1CXoV" }, "source": [ "If you'd like to try creating a BigQuery DataFrames DataFrame from this table, uncomment and run the next cell to load the table using the `read_gbq` method.\n", "\n", "> Note: Keep in mind that running these operations will count against your monthly [free tier allowance in BigQuery](https://cloud.google.com/bigquery/pricing#free-tier)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Vyex9BQI-BNa" }, "outputs": [], "source": [ "# bq_df_sample = bpd.read_gbq(\"bigquery-samples.wikipedia_pageviews.200809h\")" ] }, { "cell_type": "markdown", "metadata": { "id": "gE6CEALjDZZV" }, "source": [ "No problem! BigQuery DataFrames makes a DataFrame, `bq_df_sample`, containing the entirety of the source table of data." ] }, { "cell_type": "markdown", "metadata": { "id": "T6lAIeelDwLz" }, "source": [ "Uncomment and run the following cell to see pandas in action over your new BigQuery DataFrames DataFrame.\n", "\n", "This code uses regex to filter the DataFrame to include only rows with Wikipedia page titles containing the word \"Google\", sums the total views by page title, and then returns the top 10 results." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "XfGq5apK-D_e" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
titleviews
21911Google1414560
27669Google_Chrome962482
28394Google_Earth383566
29184Google_Maps205089
27251Google_Android99450
33900Google_search97665
31825Google_chrome78399
30204Google_Street_View71580
40798Image:Google_Chrome.png60746
35222Googleplex53848
\n", "

10 rows × 2 columns

\n", "
[10 rows x 2 columns in total]" ], "text/plain": [ " title views\n", "21911 Google 1414560\n", "27669 Google_Chrome 962482\n", "28394 Google_Earth 383566\n", "29184 Google_Maps 205089\n", "27251 Google_Android 99450\n", "33900 Google_search 97665\n", "31825 Google_chrome 78399\n", "30204 Google_Street_View 71580\n", "40798 Image:Google_Chrome.png 60746\n", "35222 Googleplex 53848\n", "\n", "[10 rows x 2 columns]" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# bq_df_sample[bq_df_sample.title.str.contains(r\"[Gg]oogle\")]\\\n", "# .groupby(['title'], as_index=False)['views'].sum(numeric_only=True)\\\n", "# .sort_values('views', ascending=False)\\\n", "# .head(10)" ] }, { "cell_type": "markdown", "metadata": { "id": "i6XV-HTN-IFF" }, "source": [ "In addition to giving you access to pandas, BigQuery DataFrames also enables you to build ML models, run inference, and deploy and run your own Python functions at scale. You'll see examples throughout this and other notebooks in this GitHub repo.\n", "\n", "Now you'll move to the smaller `penguins` dataset for the remainder of this getting started guide." ] }, { "cell_type": "markdown", "metadata": { "id": "9EMAqR37AfLS" }, "source": [ "## Create a BigQuery DataFrames DataFrame\n", "\n", "You can create a BigQuery DataFrames DataFrame by reading data from any of the following locations:\n", "\n", "* A local data file\n", "* Data stored in a BigQuery table\n", "* A data file stored in Cloud Storage\n", "* An in-memory pandas DataFrame\n", "\n", "The following sections show how to use the first two options." ] }, { "cell_type": "markdown", "metadata": { "id": "iZDjzglh9eWZ" }, "source": [ "### Create a DataFrame from a local file\n", "\n", "Use the instructions in the following sections to create a BigQuery DataFrames DataFrame from a local file.\n" ] }, { "cell_type": "markdown", "metadata": { "id": "8Jry3NoFv3Wm" }, "source": [ "#### Get the CSV file\n", "\n", "First, copy and paste the following link into a new browser window to download the CSV file of the penguin data to your local machine:\n", "\n", "> http://storage.googleapis.com/cloud-samples-data/vertex-ai/bigframe/penguins.csv\n", "\n", "Next, upload the local CSV file to your notebook environment, using the relevant instructions for your environment:\n", "\n", "**Vertex AI Workbench or a local JupyterLab instance**\n", "\n", "1. Follow these [directions](https://jupyterlab.readthedocs.io/en/latest/user/files.html#uploading-and-downloading) to upload the file from your machine to your notebook environment by using the UI.\n", "2. Uncomment the next cell, set the variable `fn` to match the path to your file, and then run the cell." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "id": "SvyXzkRl783u" }, "outputs": [], "source": [ "# BigQuery DataFrames can read directly from GCS.\n", "fn = 'gs://cloud-samples-data/vertex-ai/bigframe/penguins.csv'\n", "\n", "# Or from a local file.\n", "# fn = 'penguins.csv'" ] }, { "cell_type": "markdown", "metadata": { "id": "yqcuF1JNvFse" }, "source": [ "**Colab**\n", "\n", "Uncomment and run the following cell:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "id": "3QHQYlnoBLpt" }, "outputs": [], "source": [ "# from google.colab import files\n", "# uploaded = files.upload()\n", "# for fn in uploaded.keys():\n", "# print('User uploaded file \"{name}\" with length {length} bytes'.format(\n", "# name=fn, length=len(uploaded[fn])))" ] }, { "cell_type": "markdown", "metadata": { "id": "sJsrwAQY_H6g" }, "source": [ "#### Create a DataFrame\n", "\n", "Create a BigQuery DataFrames DataFrame from the uploaded CSV file:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "id": "EDAaIwHpQCDZ" }, "outputs": [], "source": [ "# If order is not important, use the \"bigquery\" engine to\n", "# allow BigQuery DataFrames to read directly from GCS.\n", "df_from_local = bpd.read_csv(fn, engine=\"bigquery\")" ] }, { "cell_type": "markdown", "metadata": { "id": "U-RVfNCu_h_h" }, "source": [ "Take a look at the rows randomly sampled from the DataFrame:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "id": "_gPD0Zn1Stdb" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
speciesislandculmen_length_mmculmen_depth_mmflipper_length_mmbody_mass_gsex
41Gentoo penguin (Pygoscelis papua)Biscoe49.816.82305700MALE
73Gentoo penguin (Pygoscelis papua)Biscoe46.816.12155500MALE
75Gentoo penguin (Pygoscelis papua)Biscoe49.616.02255700MALE
93Adelie Penguin (Pygoscelis adeliae)Biscoe35.516.21953350FEMALE
299Chinstrap penguin (Pygoscelis antarctica)Dream52.018.12014050MALE
\n", "
" ], "text/plain": [ " species island culmen_length_mm \\\n", "41 Gentoo penguin (Pygoscelis papua) Biscoe 49.8 \n", "73 Gentoo penguin (Pygoscelis papua) Biscoe 46.8 \n", "75 Gentoo penguin (Pygoscelis papua) Biscoe 49.6 \n", "93 Adelie Penguin (Pygoscelis adeliae) Biscoe 35.5 \n", "299 Chinstrap penguin (Pygoscelis antarctica) Dream 52.0 \n", "\n", " culmen_depth_mm flipper_length_mm body_mass_g sex \n", "41 16.8 230 5700 MALE \n", "73 16.1 215 5500 MALE \n", "75 16.0 225 5700 MALE \n", "93 16.2 195 3350 FEMALE \n", "299 18.1 201 4050 MALE " ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_from_local.peek()" ] }, { "cell_type": "markdown", "metadata": { "id": "rK0lNJmz_xkA" }, "source": [ "### Ingest data from a DataFrame to a BigQuery table\n", "\n", "BigQuery DataFrames lets you create a BigQuery table from a BigQuery DataFrames DataFrame on-the-fly." ] }, { "cell_type": "markdown", "metadata": { "id": "V1DWpmSCAEql" }, "source": [ "First, create a BigQuery dataset to house the table. Choose a name for your dataset, or keep the suggestion of `birds`." ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "id": "ZSP7gt13QrQt" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Dataset birds created.\n" ] } ], "source": [ "DATASET_ID = \"birds\" # @param {type:\"string\"}\n", "\n", "from google.cloud import bigquery\n", "client = bigquery.Client(project=PROJECT_ID)\n", "dataset = bigquery.Dataset(PROJECT_ID + \".\" + DATASET_ID)\n", "dataset.location = REGION\n", "dataset = client.create_dataset(dataset, exists_ok=True)\n", "print(f\"Dataset {dataset.dataset_id} created.\")" ] }, { "cell_type": "markdown", "metadata": { "id": "Jd0dFISwAPPa" }, "source": [ "Next, use the `to_gbq` method to create a BigQuery table from the DataFrame:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "id": "oP1NIAmUBjop" }, "outputs": [ { "data": { "text/plain": [ "'bigframes-dev.birds.penguins'" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_from_local.to_gbq(\n", " f\"{PROJECT_ID}.{DATASET_ID}.penguins\",\n", " if_exists=\"replace\",\n", ")" ] }, { "cell_type": "markdown", "metadata": { "id": "kfF6fnmmAZEK" }, "source": [ "### Create a DataFrame from BigQuery data\n", "You can create a BigQuery DataFrames DataFrame from a BigQuery table by using the `read_gbq` method and referencing either an entire table or a SQL query." ] }, { "cell_type": "markdown", "metadata": { "id": "TEy5jHJDD6hx" }, "source": [ "Create a BigQuery DataFrames DataFrame from the BigQuery table you created in the previous section, and view a few rows:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "id": "IBuo-d6dWfsA" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
speciesislandculmen_length_mmculmen_depth_mmflipper_length_mmbody_mass_gsex
79Gentoo penguin (Pygoscelis papua)Biscoe43.314.02084575FEMALE
118Adelie Penguin (Pygoscelis adeliae)Biscoe40.618.61833550MALE
213Adelie Penguin (Pygoscelis adeliae)Torgersen42.119.11954000MALE
315Adelie Penguin (Pygoscelis adeliae)Torgersen38.719.01953450FEMALE
338Chinstrap penguin (Pygoscelis antarctica)Dream40.916.61873200FEMALE
\n", "
" ], "text/plain": [ " species island culmen_length_mm \\\n", "79 Gentoo penguin (Pygoscelis papua) Biscoe 43.3 \n", "118 Adelie Penguin (Pygoscelis adeliae) Biscoe 40.6 \n", "213 Adelie Penguin (Pygoscelis adeliae) Torgersen 42.1 \n", "315 Adelie Penguin (Pygoscelis adeliae) Torgersen 38.7 \n", "338 Chinstrap penguin (Pygoscelis antarctica) Dream 40.9 \n", "\n", " culmen_depth_mm flipper_length_mm body_mass_g sex \n", "79 14.0 208 4575 FEMALE \n", "118 18.6 183 3550 MALE \n", "213 19.1 195 4000 MALE \n", "315 19.0 195 3450 FEMALE \n", "338 16.6 187 3200 FEMALE " ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query_or_table = f\"{PROJECT_ID}.{DATASET_ID}.penguins\"\n", "bq_df = bpd.read_gbq(query_or_table)\n", "bq_df.peek()" ] }, { "cell_type": "markdown", "metadata": { "id": "rwPLjqW2Ajzh" }, "source": [ "## Inspect and manipulate data in BigQuery DataFrames" ] }, { "cell_type": "markdown", "metadata": { "id": "bExmYlL_ELtV" }, "source": [ "### Using pandas\n", "\n", "You can use pandas as you normally would on the BigQuery DataFrames DataFrame, but calculations happen in the BigQuery query engine instead of your local environment. There are 150+ pandas functions supported in BigQuery DataFrames. You can view the list in [the documentation](https://cloud.google.com/python/docs/reference/bigframes/latest)." ] }, { "cell_type": "markdown", "metadata": { "id": "ZHFUc3Q_FHc1" }, "source": [ "To see this in action, inspect one of the columns (or series) of the BigQuery DataFrames DataFrame:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "id": "6i6HkFJZa8na" }, "outputs": [ { "data": { "text/plain": [ "133 \n", "279 3150\n", "34 3400\n", "96 3600\n", "208 3950\n", "18 3800\n", "64 2850\n", "310 3175\n", "118 3550\n", "2 3075\n", "Name: body_mass_g, dtype: Int64" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bq_df[\"body_mass_g\"].peek(10)" ] }, { "cell_type": "markdown", "metadata": { "id": "EJIZJaNXFQzh" }, "source": [ "Compute the mean of this series:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "id": "YKwCW7Nsavap" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "average_body_mass: 4201.754385964914\n" ] } ], "source": [ "average_body_mass = bq_df[\"body_mass_g\"].mean()\n", "print(f\"average_body_mass: {average_body_mass}\")" ] }, { "cell_type": "markdown", "metadata": { "id": "DSs1cnca-MOU" }, "source": [ "Calculate the mean `body_mass_g` by `species` using the `groupby` operation:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "id": "4PyKMR61-Mjy" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
body_mass_g
species
Adelie Penguin (Pygoscelis adeliae)3700.662252
Chinstrap penguin (Pygoscelis antarctica)3733.088235
Gentoo penguin (Pygoscelis papua)5076.01626
\n", "

3 rows × 1 columns

\n", "
[3 rows x 1 columns in total]" ], "text/plain": [ " body_mass_g\n", "species \n", "Adelie Penguin (Pygoscelis adeliae) 3700.662252\n", "Chinstrap penguin (Pygoscelis antarctica) 3733.088235\n", "Gentoo penguin (Pygoscelis papua) 5076.01626\n", "\n", "[3 rows x 1 columns]" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bq_df[[\"species\", \"body_mass_g\"]].groupby(by=bq_df[\"species\"]).mean(numeric_only=True)" ] }, { "cell_type": "markdown", "metadata": { "id": "6sf9kZ2C9Ixe" }, "source": [ "You can confirm that the calculations were run in BigQuery by clicking \"Open job\" from the previous cells' output. This takes you to the BigQuery console to view the SQL statement and job details." ] }, { "cell_type": "markdown", "metadata": { "id": "cWVNZ8D_FUtT" }, "source": [ "### Using custom functions\n", "\n", "Running your own Python functions (or being able to bring your packages) and using them at scale is a challenge many data scientists face. BigQuery DataFrames makes it easy to deploy [remote functions](https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.pandas#bigframes_pandas_remote_function) that run scalar Python functions at BigQuery scale. These functions are persisted as [BigQuery remote functions](https://cloud.google.com/bigquery/docs/remote-functions) that you can then re-use." ] }, { "cell_type": "markdown", "metadata": { "id": "zjw8toUbHuRD" }, "source": [ "Running the cell below creates a custom function using the `remote_function` method. This function categorizes a value into one of two buckets: >= 3500 or <3500.\n", "\n", "> Note: Creating a function requires a [BigQuery connection](https://cloud.google.com/bigquery/docs/remote-functions#create_a_remote_function). This code assumes a pre-created connection named `bigframes-default-connection`. If\n", "the connection is not already created, BigQuery DataFrames attempts to create one assuming the [necessary APIs\n", "and IAM permissions](https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.pandas#bigframes_pandas_remote_function) are set up in the project.\n", "\n", "This cell takes a few minutes to run because it creates the BigQuery connection (if applicable) and deploys the Cloud Function." ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "id": "rSWTOG-vb2Fc" }, "outputs": [], "source": [ "@bpd.remote_function(cloud_function_service_account=\"default\")\n", "def get_bucket(num: float) -> str:\n", " if not num: return \"NA\"\n", " boundary = 3500\n", " return \"at_or_above_3500\" if num >= boundary else \"below_3500\"" ] }, { "cell_type": "markdown", "metadata": { "id": "N7JH0BI5IOpK" }, "source": [ "The custom function is deployed as a Cloud Function, and is then integrated with BigQuery as a remote function.\n", "\n", "Save both of the function names so that you can clean them up at the end of this notebook." ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "id": "6ejPXoyEQpWE" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Cloud Function Name projects/bigframes-dev/locations/us-central1/functions/bigframes-sessioncf7a5d-aa59468b9d6c757c1256e46c9f71ebe3\n", "Remote Function Name bigframes-dev._63cfa399614a54153cc386c27d6c0c6fdb249f9e.bigframes_sessioncf7a5d_aa59468b9d6c757c1256e46c9f71ebe3\n" ] } ], "source": [ "CLOUD_FUNCTION_NAME = format(get_bucket.bigframes_cloud_function)\n", "print(\"Cloud Function Name \" + CLOUD_FUNCTION_NAME)\n", "REMOTE_FUNCTION_NAME = format(get_bucket.bigframes_remote_function)\n", "print(\"Remote Function Name \" + REMOTE_FUNCTION_NAME)" ] }, { "cell_type": "markdown", "metadata": { "id": "vHV3JqKjJHsH" }, "source": [ "Apply the custom function to the BigQuery DataFrames DataFrame to bucketize the `body_mass_g` value of the penguins:" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "id": "NxSd9WZFcIji" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
body_mass_gbody_mass_bucket
133<NA>NA
2793150below_3500
343400below_3500
963600at_or_above_3500
2083950at_or_above_3500
183800at_or_above_3500
642850below_3500
3103175below_3500
1183550at_or_above_3500
23075below_3500
\n", "
" ], "text/plain": [ " body_mass_g body_mass_bucket\n", "133 NA\n", "279 3150 below_3500\n", "34 3400 below_3500\n", "96 3600 at_or_above_3500\n", "208 3950 at_or_above_3500\n", "18 3800 at_or_above_3500\n", "64 2850 below_3500\n", "310 3175 below_3500\n", "118 3550 at_or_above_3500\n", "2 3075 below_3500" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bq_df = bq_df.assign(body_mass_bucket=bq_df['body_mass_g'].apply(get_bucket))\n", "bq_df[['body_mass_g', 'body_mass_bucket']].peek(10)" ] }, { "cell_type": "markdown", "metadata": { "id": "wCsmt0IwFkDy" }, "source": [ "## Summary and next steps\n", "\n", "You've created BigQuery DataFrames DataFrames, and inspected and manipulated data with pandas and custom remote functions at BigQuery scale and speed.\n", "\n", "Learn more about BigQuery DataFrames in the [documentation](https://cloud.google.com/python/docs/reference/bigframes/latest) and find more sample notebooks in the [GitHub repo](https://github.com/googleapis/python-bigquery-dataframes/tree/main/notebooks), including an introductory notebook for `bigframes.ml`." ] }, { "cell_type": "markdown", "metadata": { "id": "TpV-iwP9qw9c" }, "source": [ "### Cleaning up\n", "\n", "To clean up all Google Cloud resources used in this project, you can [delete the Google Cloud\n", "project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.\n", "\n", "Otherwise, you can uncomment the remaining cells and run them to delete the individual resources you created in this tutorial:" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "# Delete the temporary cloud artifacts created during the bigframes session \n", "bpd.close_session()" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "id": "sx_vKniMq9ZX" }, "outputs": [], "source": [ "# # Delete the BigQuery dataset\n", "# from google.cloud import bigquery\n", "# client = bigquery.Client(project=PROJECT_ID)\n", "# client.delete_dataset(\n", "# DATASET_ID, delete_contents=True, not_found_ok=True\n", "# )\n", "# print(\"Deleted dataset '{}'.\".format(DATASET_ID))" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "id": "_dTCXvCxtPw9" }, "outputs": [], "source": [ "# # Delete the BigQuery Connection\n", "# from google.cloud import bigquery_connection_v1 as bq_connection\n", "# client = bq_connection.ConnectionServiceClient()\n", "# CONNECTION_ID = f\"projects/{PROJECT_ID}/locations/{REGION}/connections/bigframes-default-connection\"\n", "# client.delete_connection(name=CONNECTION_ID)\n", "# print(\"Deleted connection '{}'.\".format(CONNECTION_ID))" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "id": "EDAIIfcpwNOF" }, "outputs": [], "source": [ "# # Delete the Cloud Function\n", "# ! gcloud functions delete {CLOUD_FUNCTION_NAME} --quiet" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "id": "QwumLUKmVpuH" }, "outputs": [], "source": [ "# # Delete the Remote Function\n", "# REMOTE_FUNCTION_NAME = REMOTE_FUNCTION_NAME.replace(PROJECT_ID + \".\", \"\")\n", "# ! bq rm --routine --force=true {REMOTE_FUNCTION_NAME}" ] } ], "metadata": { "colab": { "provenance": [], "toc_visible": true }, "kernelspec": { "display_name": "venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.16" } }, "nbformat": 4, "nbformat_minor": 0 }