{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# Copyright 2025 Google LLC\n", "#\n", "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License." ] }, { "cell_type": "markdown", "metadata": { "id": "YOrUAvz6DMw-" }, "source": [ "# BigFrames Multimodal DataFrame\n", "\n", "\n", "\n", " \n", " \n", " \n", "
\n", " \n", " \"Colab Run in Colab\n", " \n", " \n", " \n", " \"GitHub\n", " View on GitHub\n", " \n", " \n", " \n", " \"BQ\n", " Open in BQ Studio\n", " \n", "
\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook is introducing BigFrames Multimodal features:\n", "1. Create Multimodal DataFrame\n", "2. Combine unstructured data with structured data\n", "3. Conduct image transformations\n", "4. Use LLM models to ask questions and generate embeddings on images\n", "5. PDF chunking function\n", "6. Transcribe audio\n", "7. Extract EXIF metadata from images" ] }, { "cell_type": "markdown", "metadata": { "id": "PEAJQQ6AFg-n" }, "source": [ "## Setup" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Install the latest bigframes package if bigframes version < 2.4.0" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# !pip install bigframes --upgrade" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "bGyhLnfEeB0X", "outputId": "83ac8b64-3f44-4d43-d089-28a5026cbb42" }, "outputs": [], "source": [ "PROJECT = \"bigframes-dev\" # replace with your project. \n", "# Refer to https://cloud.google.com/bigquery/docs/multimodal-data-dataframes-tutorial#required_roles for your required permissions\n", "\n", "LOCATION = \"us\" # replace with your location.\n", "\n", "# Dataset where the UDF will be created.\n", "DATASET_ID = \"bigframes_samples\" # replace with your dataset ID.\n", "\n", "OUTPUT_BUCKET = \"bigframes_blob_test\" # replace with your GCS bucket. \n", "# The connection (or bigframes-default-connection of the project) must have read/write permission to the bucket. \n", "# Refer to https://cloud.google.com/bigquery/docs/multimodal-data-dataframes-tutorial#grant-permissions for setting up connection service account permissions.\n", "# In this Notebook it uses bigframes-default-connection by default. You can also bring in your own connections in each method.\n", "\n", "import bigframes\n", "# Setup project\n", "bigframes.options.bigquery.project = PROJECT\n", "bigframes.options.bigquery.location = LOCATION\n", "\n", "# Display options\n", "bigframes.options.display.blob_display_width = 300\n", "bigframes.options.display.progress_bar = None\n", "\n", "import bigframes.pandas as bpd\n", "import bigframes.bigquery as bbq" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "import bigframes.bigquery as bbq\n", "\n", "def get_runtime_json_str(series, mode=\"R\", with_metadata=False):\n", " \"\"\"\n", " Get the runtime (contains signed URL to access gcs data) and apply the\n", " ToJSONSTring transformation.\n", " \n", " Args:\n", " series: bigframes.series.Series to operate on.\n", " mode: \"R\" for read, \"RW\" for read/write.\n", " with_metadata: Whether to fetch and include blob metadata.\n", " \"\"\"\n", " # 1. Optionally fetch metadata\n", " s = (\n", " bbq.obj.fetch_metadata(series)\n", " if with_metadata\n", " else series\n", " )\n", " \n", " # 2. Retrieve the access URL runtime object\n", " runtime = bbq.obj.get_access_url(s, mode=mode)\n", " \n", " # 3. Convert the runtime object to a JSON string\n", " return bbq.to_json_string(runtime)\n", "\n", "def get_metadata(series):\n", " # Fetch metadata and extract GCS metadata from the details JSON field\n", " metadata_obj = bbq.obj.fetch_metadata(series)\n", " return bbq.json_query(metadata_obj.struct.field(\"details\"), \"$.gcs_metadata\")\n", "\n", "def get_content_type(series):\n", " return bbq.json_value(get_metadata(series), \"$.content_type\")\n", "\n", "def get_size(series):\n", " return bbq.json_value(get_metadata(series), \"$.size\").astype(\"Int64\")\n", "\n", "def get_updated(series):\n", " return bpd.to_datetime(bbq.json_value(get_metadata(series), \"$.updated\").astype(\"Int64\"), unit=\"us\", utc=True)" ] }, { "cell_type": "markdown", "metadata": { "id": "ifKOq7VZGtZy" }, "source": [ "### 1. Create Multimodal DataFrame\n", "There are several ways to create Multimodal DataFrame. The easiest way is from the wildcard paths." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "fx6YcZJbeYru", "outputId": "d707954a-0dd0-4c50-b7bf-36b140cf76cf" }, "outputs": [], "source": [ "# Create blob columns from wildcard path.\n", "df_image = bpd.from_glob_path(\n", " \"gs://cloud-samples-data/bigquery/tutorials/cymbal-pets/images/*\", name=\"image\"\n", ")" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 487 }, "id": "HhCb8jRsLe9B", "outputId": "03081cf9-3a22-42c9-b38f-649f592fdada" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/usr/local/google/home/shuowei/src/python-bigquery-dataframes/bigframes/dtypes.py:990: JSONDtypeWarning: JSON columns will be represented as pandas.ArrowDtype(pyarrow.json_())\n", "instead of using `db_dtypes` in the future when available in pandas\n", "(https://github.com/pandas-dev/pandas/issues/60958) and pyarrow.\n", " warnings.warn(msg, bigframes.exceptions.JSONDtypeWarning)\n", "/usr/local/google/home/shuowei/src/python-bigquery-dataframes/bigframes/core/logging/log_adapter.py:229: ApiDeprecationWarning: The blob accessor is deprecated and will be removed in a future release. Use bigframes.bigquery.obj functions instead.\n", " return prop(*args, **kwargs)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
image
0
1
2
3
4
\n", "

5 rows × 1 columns

\n", "
[5 rows x 1 columns in total]" ], "text/plain": [ " image\n", "0 {\"access_urls\":{\"expiry_time\":\"2026-02-21T01:3...\n", "1 {\"access_urls\":{\"expiry_time\":\"2026-02-21T01:3...\n", "2 {\"access_urls\":{\"expiry_time\":\"2026-02-21T01:3...\n", "3 {\"access_urls\":{\"expiry_time\":\"2026-02-21T01:3...\n", "4 {\"access_urls\":{\"expiry_time\":\"2026-02-21T01:3...\n", "\n", "[5 rows x 1 columns]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Take only the 5 images to deal with. Preview the content of the Mutimodal DataFrame\n", "df_image = df_image.head(5)\n", "df_image" ] }, { "cell_type": "markdown", "metadata": { "id": "b6RRZb3qPi_T" }, "source": [ "### 2. Combine unstructured data with structured data" ] }, { "cell_type": "markdown", "metadata": { "id": "4YJCdmLtR-qu" }, "source": [ "Now you can put more information into the table to describe the files. Such as author info from inputs, or other metadata from the gcs object itself." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "id": "YYYVn7NDH0Me" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/usr/local/google/home/shuowei/src/python-bigquery-dataframes/bigframes/dtypes.py:990: JSONDtypeWarning: JSON columns will be represented as pandas.ArrowDtype(pyarrow.json_())\n", "instead of using `db_dtypes` in the future when available in pandas\n", "(https://github.com/pandas-dev/pandas/issues/60958) and pyarrow.\n", " warnings.warn(msg, bigframes.exceptions.JSONDtypeWarning)\n", "/usr/local/google/home/shuowei/src/python-bigquery-dataframes/bigframes/core/logging/log_adapter.py:229: ApiDeprecationWarning: The blob accessor is deprecated and will be removed in a future release. Use bigframes.bigquery.obj functions instead.\n", " return prop(*args, **kwargs)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
imageauthorcontent_typesizeupdated
0aliceimage/png15912402025-03-20 17:45:04+00:00
1bobimage/png11829512025-03-20 17:45:02+00:00
2bobimage/png15208842025-03-20 17:44:55+00:00
3aliceimage/png12354012025-03-20 17:45:19+00:00
4bobimage/png15919232025-03-20 17:44:47+00:00
\n", "

5 rows × 5 columns

\n", "
[5 rows x 5 columns in total]" ], "text/plain": [ " image author content_type \\\n", "0 {\"access_urls\":{\"expiry_time\":\"2026-02-21T01:3... alice image/png \n", "1 {\"access_urls\":{\"expiry_time\":\"2026-02-21T01:3... bob image/png \n", "2 {\"access_urls\":{\"expiry_time\":\"2026-02-21T01:3... bob image/png \n", "3 {\"access_urls\":{\"expiry_time\":\"2026-02-21T01:3... alice image/png \n", "4 {\"access_urls\":{\"expiry_time\":\"2026-02-21T01:3... bob image/png \n", "\n", " size updated \n", "0 1591240 2025-03-20 17:45:04+00:00 \n", "1 1182951 2025-03-20 17:45:02+00:00 \n", "2 1520884 2025-03-20 17:44:55+00:00 \n", "3 1235401 2025-03-20 17:45:19+00:00 \n", "4 1591923 2025-03-20 17:44:47+00:00 \n", "\n", "[5 rows x 5 columns]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Combine unstructured data with structured data\n", "df_image = df_image.head(5)\n", "df_image[\"author\"] = [\"alice\", \"bob\", \"bob\", \"alice\", \"bob\"] # type: ignore\n", "df_image[\"content_type\"] = get_content_type(df_image[\"image\"])\n", "df_image[\"size\"] = get_size(df_image[\"image\"])\n", "df_image[\"updated\"] = get_updated(df_image[\"image\"])\n", "df_image" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3. Conduct image transformations" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This section demonstrates how to perform image transformations like blur, resize, and normalize using custom BigQuery Python UDFs and the `opencv-python` library." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 487 }, "id": "HhCb8jRsLe9B", "outputId": "03081cf9-3a22-42c9-b38f-649f592fdada" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/usr/local/google/home/shuowei/src/python-bigquery-dataframes/bigframes/pandas/__init__.py:151: PreviewWarning: udf is in preview.\n", " return global_session.with_default_session(\n", "/usr/local/google/home/shuowei/src/python-bigquery-dataframes/bigframes/dataframe.py:4655: FunctionAxisOnePreviewWarning: DataFrame.apply with parameter axis=1 scenario is in preview.\n", " warnings.warn(msg, category=bfe.FunctionAxisOnePreviewWarning)\n", "/usr/local/google/home/shuowei/src/python-bigquery-dataframes/bigframes/dtypes.py:990: JSONDtypeWarning: JSON columns will be represented as pandas.ArrowDtype(pyarrow.json_())\n", "instead of using `db_dtypes` in the future when available in pandas\n", "(https://github.com/pandas-dev/pandas/issues/60958) and pyarrow.\n", " warnings.warn(msg, bigframes.exceptions.JSONDtypeWarning)\n", "/usr/local/google/home/shuowei/src/python-bigquery-dataframes/bigframes/core/logging/log_adapter.py:229: ApiDeprecationWarning: The blob accessor is deprecated and will be removed in a future release. Use bigframes.bigquery.obj functions instead.\n", " return prop(*args, **kwargs)\n", "/usr/local/google/home/shuowei/src/python-bigquery-dataframes/bigframes/core/logging/log_adapter.py:229: ApiDeprecationWarning: The blob accessor is deprecated and will be removed in a future release. Use bigframes.bigquery.obj functions instead.\n", " return prop(*args, **kwargs)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
imageblurred
0
1
2
3
4
\n", "

5 rows × 2 columns

\n", "
[5 rows x 2 columns in total]" ], "text/plain": [ " image \\\n", "0 {\"access_urls\":{\"expiry_time\":\"2026-02-21T01:3... \n", "1 {\"access_urls\":{\"expiry_time\":\"2026-02-21T01:3... \n", "2 {\"access_urls\":{\"expiry_time\":\"2026-02-21T01:3... \n", "3 {\"access_urls\":{\"expiry_time\":\"2026-02-21T01:3... \n", "4 {\"access_urls\":{\"expiry_time\":\"2026-02-21T01:3... \n", "\n", " blurred \n", "0 {\"access_urls\":{\"expiry_time\":\"2026-02-21T01:3... \n", "1 {\"access_urls\":{\"expiry_time\":\"2026-02-21T01:3... \n", "2 {\"access_urls\":{\"expiry_time\":\"2026-02-21T01:3... \n", "3 {\"access_urls\":{\"expiry_time\":\"2026-02-21T01:3... \n", "4 {\"access_urls\":{\"expiry_time\":\"2026-02-21T01:3... \n", "\n", "[5 rows x 2 columns]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Construct the canonical connection ID\n", "FULL_CONNECTION_ID = f\"{PROJECT}.{LOCATION}.bigframes-default-connection\"\n", "\n", "@bpd.udf(\n", " input_types=[str, str, int, int],\n", " output_type=str,\n", " dataset=DATASET_ID,\n", " name=\"image_blur\",\n", " bigquery_connection=FULL_CONNECTION_ID,\n", " packages=[\"opencv-python\", \"numpy\", \"requests\"],\n", ")\n", "def image_blur(src_rt: str, dst_rt: str, kx: int, ky: int) -> str:\n", " import json\n", " import cv2 as cv\n", " import numpy as np\n", " import requests\n", " import base64\n", "\n", " src_obj = json.loads(src_rt)\n", " src_url = src_obj[\"access_urls\"][\"read_url\"]\n", " \n", " response = requests.get(src_url, timeout=30)\n", " response.raise_for_status()\n", " \n", " img = cv.imdecode(np.frombuffer(response.content, np.uint8), cv.IMREAD_UNCHANGED)\n", " if img is None:\n", " raise ValueError(\"cv.imdecode failed\")\n", " \n", " kx, ky = int(kx), int(ky)\n", " img_blurred = cv.blur(img, ksize=(kx, ky))\n", " \n", " success, encoded = cv.imencode(\".jpeg\", img_blurred)\n", " if not success:\n", " raise ValueError(\"cv.imencode failed\")\n", " \n", " # Handle two output modes\n", " if dst_rt: # GCS/Series output mode\n", " dst_obj = json.loads(dst_rt)\n", " dst_url = dst_obj[\"access_urls\"][\"write_url\"]\n", " \n", " requests.put(dst_url, data=encoded.tobytes(), headers={\"Content-Type\": \"image/jpeg\"}, timeout=30).raise_for_status()\n", " \n", " uri = dst_obj[\"objectref\"][\"uri\"]\n", " return uri\n", " \n", " else: # BigQuery bytes output mode \n", " image_bytes = encoded.tobytes()\n", " return base64.b64encode(image_bytes).decode()\n", "\n", "def apply_transformation(series, dst_folder, udf, *args, verbose=False):\n", " import os\n", " dst_folder = os.path.join(dst_folder, \"\")\n", " # Fetch metadata to get the URI\n", " metadata = bbq.obj.fetch_metadata(series)\n", " current_uri = metadata.struct.field(\"uri\")\n", " dst_uri = current_uri.str.replace(r\"^.*\\/(.*)$\", rf\"{dst_folder}\\1\", regex=True)\n", " dst_blob = dst_uri.str.to_blob(connection=FULL_CONNECTION_ID)\n", " df_transform = bpd.DataFrame({\n", " \"src_rt\": get_runtime_json_str(series, mode=\"R\"),\n", " \"dst_rt\": get_runtime_json_str(dst_blob, mode=\"RW\"),\n", " })\n", " res = df_transform[[\"src_rt\", \"dst_rt\"]].apply(\n", " udf, axis=1, args=args\n", " )\n", " return res if verbose else res.str.to_blob(connection=FULL_CONNECTION_ID)\n", "\n", "# Apply transformations\n", "df_image[\"blurred\"] = apply_transformation(\n", " df_image[\"image\"], f\"gs://{OUTPUT_BUCKET}/image_blur_transformed/\",\n", " image_blur, 20, 20\n", ")\n", "df_image[[\"image\", \"blurred\"]]" ] }, { "cell_type": "markdown", "metadata": { "id": "Euk5saeVVdTP" }, "source": [ "### 4. Use LLM models to ask questions and generate embeddings on images" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "id": "mRUGfcaFVW-3" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/usr/local/google/home/shuowei/src/python-bigquery-dataframes/bigframes/core/logging/log_adapter.py:183: FutureWarning: Since upgrading the default model can cause unintended breakages, the\n", "default model will be removed in BigFrames 3.0. Please supply an\n", "explicit model to avoid this message.\n", " return method(*args, **kwargs)\n" ] } ], "source": [ "from bigframes.ml import llm\n", "gemini = llm.GeminiTextGenerator()" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 657 }, "id": "DNFP7CbjWdR9", "outputId": "3f90a062-0abc-4bce-f53c-db57b06a14b9" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/usr/local/google/home/shuowei/src/python-bigquery-dataframes/bigframes/dtypes.py:990: JSONDtypeWarning: JSON columns will be represented as pandas.ArrowDtype(pyarrow.json_())\n", "instead of using `db_dtypes` in the future when available in pandas\n", "(https://github.com/pandas-dev/pandas/issues/60958) and pyarrow.\n", " warnings.warn(msg, bigframes.exceptions.JSONDtypeWarning)\n", "/usr/local/google/home/shuowei/src/python-bigquery-dataframes/bigframes/core/logging/log_adapter.py:229: ApiDeprecationWarning: The blob accessor is deprecated and will be removed in a future release. Use bigframes.bigquery.obj functions instead.\n", " return prop(*args, **kwargs)\n", "/usr/local/google/home/shuowei/src/python-bigquery-dataframes/bigframes/dtypes.py:990: JSONDtypeWarning: JSON columns will be represented as pandas.ArrowDtype(pyarrow.json_())\n", "instead of using `db_dtypes` in the future when available in pandas\n", "(https://github.com/pandas-dev/pandas/issues/60958) and pyarrow.\n", " warnings.warn(msg, bigframes.exceptions.JSONDtypeWarning)\n", "/usr/local/google/home/shuowei/src/python-bigquery-dataframes/bigframes/core/logging/log_adapter.py:229: ApiDeprecationWarning: The blob accessor is deprecated and will be removed in a future release. Use bigframes.bigquery.obj functions instead.\n", " return prop(*args, **kwargs)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ml_generate_text_llm_resultimage
0The item is a container of K9 Guard Dog Paw Balm.
1The item is K9 Guard Dog Hot Spot Spray.
2The image contains three bags of food, likely for small animals like rabbits or guinea pigs. They are labeled \"Timoth Hay Lend Variety Plend\", \"Herbal Greeıs Mix Variety Blend\", and \"Berry & Blossom Treat Blend\", all under the brand \"Fluffy Buns.\" The bags are yellow, green, and purple, respectively. Each bag has a pile of its contents beneath it.
3The item is a cat tree.\\n
4The item is a bag of bird seed. Specifically, it's labeled \"Chirpy Seed\", \"Deluxe Bird Food\".\\n
\n", "

5 rows × 2 columns

\n", "
[5 rows x 2 columns in total]" ], "text/plain": [ " ml_generate_text_llm_result \\\n", "0 The item is a container of K9 Guard Dog Paw Balm. \n", "1 The item is K9 Guard Dog Hot Spot Spray. \n", "2 The image contains three bags of food, likely ... \n", "3 The item is a cat tree.\\n \n", "4 The item is a bag of bird seed. Specifically, ... \n", "\n", " image \n", "0 {\"access_urls\":{\"expiry_time\":\"2026-02-21T01:4... \n", "1 {\"access_urls\":{\"expiry_time\":\"2026-02-21T01:4... \n", "2 {\"access_urls\":{\"expiry_time\":\"2026-02-21T01:4... \n", "3 {\"access_urls\":{\"expiry_time\":\"2026-02-21T01:4... \n", "4 {\"access_urls\":{\"expiry_time\":\"2026-02-21T01:4... \n", "\n", "[5 rows x 2 columns]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Ask the same question on the images\n", "answer = gemini.predict(df_image, prompt=[\"what item is it?\", df_image[\"image\"]])\n", "answer[[\"ml_generate_text_llm_result\", \"image\"]]" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "id": "IG3J3HsKhyBY" }, "outputs": [], "source": [ "# Ask different questions\n", "df_image[\"question\"] = [\n", " \"what item is it?\",\n", " \"what color is the picture?\",\n", " \"what is the product name?\",\n", " \"is it for pets?\",\n", " \"what is the weight of the product?\",\n", "]" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 657 }, "id": "qKOb765IiVuD", "outputId": "731bafad-ea29-463f-c8c1-cb7acfd70e5d" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/usr/local/google/home/shuowei/src/python-bigquery-dataframes/bigframes/dtypes.py:990: JSONDtypeWarning: JSON columns will be represented as pandas.ArrowDtype(pyarrow.json_())\n", "instead of using `db_dtypes` in the future when available in pandas\n", "(https://github.com/pandas-dev/pandas/issues/60958) and pyarrow.\n", " warnings.warn(msg, bigframes.exceptions.JSONDtypeWarning)\n", "/usr/local/google/home/shuowei/src/python-bigquery-dataframes/bigframes/core/logging/log_adapter.py:229: ApiDeprecationWarning: The blob accessor is deprecated and will be removed in a future release. Use bigframes.bigquery.obj functions instead.\n", " return prop(*args, **kwargs)\n", "/usr/local/google/home/shuowei/src/python-bigquery-dataframes/bigframes/dtypes.py:990: JSONDtypeWarning: JSON columns will be represented as pandas.ArrowDtype(pyarrow.json_())\n", "instead of using `db_dtypes` in the future when available in pandas\n", "(https://github.com/pandas-dev/pandas/issues/60958) and pyarrow.\n", " warnings.warn(msg, bigframes.exceptions.JSONDtypeWarning)\n", "/usr/local/google/home/shuowei/src/python-bigquery-dataframes/bigframes/core/logging/log_adapter.py:229: ApiDeprecationWarning: The blob accessor is deprecated and will be removed in a future release. Use bigframes.bigquery.obj functions instead.\n", " return prop(*args, **kwargs)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ml_generate_text_llm_resultimage
0The item is a container of Dog Paw Balm.
1The picture contains many colors, including white, black, green, and a bright blue. The product label predominantly features a bright blue hue. The background is a solid gray.
2Here are the product names from the image:\\n\\n* **Timoth Hay Lend Variety Plend** is the product in the yellow bag.\\n* **Herbal Greeıs Mix Variety Blend** is the product in the green bag.\\n* **Berry & Blossom Treat Blend** is the product in the purple bag.
3Yes, it is for pets. It appears to be a cat tree or scratching post.\\n
4The image shows that the weight of the product is 15 oz/ 257g.
\n", "

5 rows × 2 columns

\n", "
[5 rows x 2 columns in total]" ], "text/plain": [ " ml_generate_text_llm_result \\\n", "0 The item is a container of Dog Paw Balm. \n", "1 The picture contains many colors, including wh... \n", "2 Here are the product names from the image:\\n\\n... \n", "3 Yes, it is for pets. It appears to be a cat tr... \n", "4 The image shows that the weight of the product... \n", "\n", " image \n", "0 {\"access_urls\":{\"expiry_time\":\"2026-02-21T01:4... \n", "1 {\"access_urls\":{\"expiry_time\":\"2026-02-21T01:4... \n", "2 {\"access_urls\":{\"expiry_time\":\"2026-02-21T01:4... \n", "3 {\"access_urls\":{\"expiry_time\":\"2026-02-21T01:4... \n", "4 {\"access_urls\":{\"expiry_time\":\"2026-02-21T01:4... \n", "\n", "[5 rows x 2 columns]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "answer_alt = gemini.predict(df_image, prompt=[df_image[\"question\"], df_image[\"image\"]])\n", "answer_alt[[\"ml_generate_text_llm_result\", \"image\"]]" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 300 }, "id": "KATVv2CO5RT1", "outputId": "6ec01f27-70b6-4f69-c545-e5e3c879480c" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/usr/local/google/home/shuowei/src/python-bigquery-dataframes/bigframes/core/logging/log_adapter.py:183: FutureWarning: Since upgrading the default model can cause unintended breakages, the\n", "default model will be removed in BigFrames 3.0. Please supply an\n", "explicit model to avoid this message.\n", " return method(*args, **kwargs)\n", "/usr/local/google/home/shuowei/src/python-bigquery-dataframes/bigframes/dtypes.py:990: JSONDtypeWarning: JSON columns will be represented as pandas.ArrowDtype(pyarrow.json_())\n", "instead of using `db_dtypes` in the future when available in pandas\n", "(https://github.com/pandas-dev/pandas/issues/60958) and pyarrow.\n", " warnings.warn(msg, bigframes.exceptions.JSONDtypeWarning)\n", "/usr/local/google/home/shuowei/src/python-bigquery-dataframes/bigframes/core/logging/log_adapter.py:229: ApiDeprecationWarning: The blob accessor is deprecated and will be removed in a future release. Use bigframes.bigquery.obj functions instead.\n", " return prop(*args, **kwargs)\n", "/usr/local/google/home/shuowei/src/python-bigquery-dataframes/bigframes/dtypes.py:990: JSONDtypeWarning: JSON columns will be represented as pandas.ArrowDtype(pyarrow.json_())\n", "instead of using `db_dtypes` in the future when available in pandas\n", "(https://github.com/pandas-dev/pandas/issues/60958) and pyarrow.\n", " warnings.warn(msg, bigframes.exceptions.JSONDtypeWarning)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ml_generate_embedding_resultml_generate_embedding_statusml_generate_embedding_start_secml_generate_embedding_end_seccontent
0[ 0.00638822 0.01666385 0.00451817 ... -0.02...<NA><NA>{\"access_urls\":{\"expiry_time\":\"2026-02-21T01:4...
1[ 0.00973976 0.02148137 0.0024429 ... 0.00...<NA><NA>{\"access_urls\":{\"expiry_time\":\"2026-02-21T01:4...
2[ 0.01195884 0.02139394 0.05968047 ... -0.01...<NA><NA>{\"access_urls\":{\"expiry_time\":\"2026-02-21T01:4...
3[-0.02621161 0.02797648 0.04416926 ... -0.01...<NA><NA>{\"access_urls\":{\"expiry_time\":\"2026-02-21T01:4...
4[ 0.05918628 0.0125137 0.01907336 ... 0.01...<NA><NA>{\"access_urls\":{\"expiry_time\":\"2026-02-21T01:4...
\n", "

5 rows × 5 columns

\n", "
[5 rows x 5 columns in total]" ], "text/plain": [ " ml_generate_embedding_result \\\n", "0 [ 0.00638822 0.01666385 0.00451817 ... -0.02... \n", "1 [ 0.00973976 0.02148137 0.0024429 ... 0.00... \n", "2 [ 0.01195884 0.02139394 0.05968047 ... -0.01... \n", "3 [-0.02621161 0.02797648 0.04416926 ... -0.01... \n", "4 [ 0.05918628 0.0125137 0.01907336 ... 0.01... \n", "\n", " ml_generate_embedding_status ml_generate_embedding_start_sec \\\n", "0 \n", "1 \n", "2 \n", "3 \n", "4 \n", "\n", " ml_generate_embedding_end_sec \\\n", "0 \n", "1 \n", "2 \n", "3 \n", "4 \n", "\n", " content \n", "0 {\"access_urls\":{\"expiry_time\":\"2026-02-21T01:4... \n", "1 {\"access_urls\":{\"expiry_time\":\"2026-02-21T01:4... \n", "2 {\"access_urls\":{\"expiry_time\":\"2026-02-21T01:4... \n", "3 {\"access_urls\":{\"expiry_time\":\"2026-02-21T01:4... \n", "4 {\"access_urls\":{\"expiry_time\":\"2026-02-21T01:4... \n", "\n", "[5 rows x 5 columns]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Generate embeddings.\n", "embed_model = llm.MultimodalEmbeddingGenerator()\n", "embeddings = embed_model.predict(df_image[\"image\"])\n", "embeddings" ] }, { "cell_type": "markdown", "metadata": { "id": "iRUi8AjG7cIf" }, "source": [ "### 5. PDF extraction and chunking function\n", "\n", "This section demonstrates how to extract text and chunk text from PDF files using custom BigQuery Python UDFs and the `pypdf` library." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/usr/local/google/home/shuowei/src/python-bigquery-dataframes/bigframes/pandas/__init__.py:151: PreviewWarning: udf is in preview.\n", " return global_session.with_default_session(\n" ] } ], "source": [ "# Construct the canonical connection ID\n", "FULL_CONNECTION_ID = f\"{PROJECT}.{LOCATION}.bigframes-default-connection\"\n", "\n", "@bpd.udf(\n", " input_types=[str],\n", " output_type=str,\n", " dataset=DATASET_ID,\n", " name=\"pdf_extract\",\n", " bigquery_connection=FULL_CONNECTION_ID,\n", " packages=[\"pypdf\", \"requests\", \"cryptography\"],\n", ")\n", "def pdf_extract(src_obj_ref_rt: str) -> str:\n", " import io\n", " import json\n", " from pypdf import PdfReader\n", " import requests\n", " src_obj_ref_rt_json = json.loads(src_obj_ref_rt)\n", " src_url = src_obj_ref_rt_json[\"access_urls\"][\"read_url\"]\n", " response = requests.get(src_url, timeout=30, stream=True)\n", " response.raise_for_status()\n", " pdf_bytes = response.content\n", " pdf_file = io.BytesIO(pdf_bytes)\n", " reader = PdfReader(pdf_file, strict=False)\n", " all_text = \"\"\n", " for page in reader.pages:\n", " page_extract_text = page.extract_text()\n", " if page_extract_text:\n", " all_text += page_extract_text\n", " return all_text\n", "\n", "@bpd.udf(\n", " input_types=[str, int, int],\n", " output_type=list[str],\n", " dataset=DATASET_ID,\n", " name=\"pdf_chunk\",\n", " bigquery_connection=FULL_CONNECTION_ID,\n", " packages=[\"pypdf\", \"requests\", \"cryptography\"],\n", ")\n", "def pdf_chunk(src_obj_ref_rt: str, chunk_size: int, overlap_size: int) -> list[str]:\n", " import io\n", " import json\n", " from pypdf import PdfReader\n", " import requests\n", " src_obj_ref_rt_json = json.loads(src_obj_ref_rt)\n", " src_url = src_obj_ref_rt_json[\"access_urls\"][\"read_url\"]\n", " response = requests.get(src_url, timeout=30, stream=True)\n", " response.raise_for_status()\n", " pdf_bytes = response.content\n", " pdf_file = io.BytesIO(pdf_bytes)\n", " reader = PdfReader(pdf_file, strict=False)\n", " all_text_chunks = []\n", " curr_chunk = \"\"\n", " for page in reader.pages:\n", " page_text = page.extract_text()\n", " if page_text:\n", " curr_chunk += page_text\n", " while len(curr_chunk) >= chunk_size:\n", " split_idx = curr_chunk.rfind(\" \", 0, chunk_size)\n", " if split_idx == -1:\n", " split_idx = chunk_size\n", " actual_chunk = curr_chunk[:split_idx]\n", " all_text_chunks.append(actual_chunk)\n", " overlap = curr_chunk[split_idx + 1 : split_idx + 1 + overlap_size]\n", " curr_chunk = overlap + curr_chunk[split_idx + 1 + overlap_size :]\n", " if curr_chunk:\n", " all_text_chunks.append(curr_chunk)\n", " return all_text_chunks" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
extracted_textchunked
0CritterCuisine Pro 5000 - Automatic Pet Feeder...[\"CritterCuisine Pro 5000 - Automatic Pet Feed...
\n", "

1 rows × 2 columns

\n", "
[1 rows x 2 columns in total]" ], "text/plain": [ " extracted_text \\\n", "0 CritterCuisine Pro 5000 - Automatic Pet Feeder... \n", "\n", " chunked \n", "0 [\"CritterCuisine Pro 5000 - Automatic Pet Feed... \n", "\n", "[1 rows x 2 columns]" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_pdf = bpd.from_glob_path(\"gs://cloud-samples-data/bigquery/tutorials/cymbal-pets/documents/*\", name=\"pdf\")\n", "\n", "# Generate a JSON string containing the runtime information (including signed read URLs)\n", "access_urls = get_runtime_json_str(df_pdf[\"pdf\"], mode=\"R\")\n", "\n", "# Apply PDF extraction\n", "df_pdf[\"extracted_text\"] = access_urls.apply(pdf_extract)\n", "\n", "# Apply PDF chunking\n", "df_pdf[\"chunked\"] = access_urls.apply(pdf_chunk, args=(2000, 200))\n", "\n", "df_pdf[[\"extracted_text\", \"chunked\"]]" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
0    CritterCuisine Pro 5000 - Automatic Pet Feeder...\n",
       "0    on a level, stable surface to prevent tipping....\n",
       "0    included)\\nto maintain the schedule during pow...\n",
       "0    digits for Meal 1 will flash.\\n\u0000. Use the UP/D...\n",
       "0    paperclip) for 5\\nseconds. This will reset all...\n",
       "0    unit with a damp cloth. Do not immerse the bas...\n",
       "0    continues,\\ncontact customer support.\\nE2: Foo...
" ], "text/plain": [ "0 CritterCuisine Pro 5000 - Automatic Pet Feeder...\n", "0 on a level, stable surface to prevent tipping....\n", "0 included)\\nto maintain the schedule during pow...\n", "0 digits for Meal 1 will flash.\\n\u0000. Use the UP/D...\n", "0 paperclip) for 5\\nseconds. This will reset all...\n", "0 unit with a damp cloth. Do not immerse the bas...\n", "0 continues,\\ncontact customer support.\\nE2: Foo...\n", "Name: chunked, dtype: string" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Explode the chunks to see each chunk as a separate row\n", "chunked = df_pdf[\"chunked\"].explode()\n", "chunked" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 6. Audio transcribe" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "audio_gcs_path = \"gs://bigframes_blob_test/audio/*\"\n", "df = bpd.from_glob_path(audio_gcs_path, name=\"audio\")" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/usr/local/google/home/shuowei/src/python-bigquery-dataframes/bigframes/dtypes.py:990: JSONDtypeWarning: JSON columns will be represented as pandas.ArrowDtype(pyarrow.json_())\n", "instead of using `db_dtypes` in the future when available in pandas\n", "(https://github.com/pandas-dev/pandas/issues/60958) and pyarrow.\n", " warnings.warn(msg, bigframes.exceptions.JSONDtypeWarning)\n" ] }, { "data": { "text/html": [ "
0    Now, as all books, not primarily intended as p...
" ], "text/plain": [ "0 Now, as all books, not primarily intended as p...\n", "Name: transcribed_content, dtype: string" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# The audio_transcribe function is a convenience wrapper around bigframes.bigquery.ai.generate.\n", "# Here's how to perform the same operation directly:\n", "\n", "audio_series = df[\"audio\"]\n", "prompt_text = (\n", " \"**Task:** Transcribe the provided audio. **Instructions:** - Your response \"\n", " \"must contain only the verbatim transcription of the audio. - Do not include \"\n", " \"any introductory text, summaries, or conversational filler in your response. \"\n", " \"The output should begin directly with the first word of the audio.\"\n", ")\n", "\n", "# Convert the audio series to the runtime representation required by the model.\n", "# This involves fetching metadata and getting a signed access URL.\n", "audio_metadata = bbq.obj.fetch_metadata(audio_series)\n", "audio_runtime = bbq.obj.get_access_url(audio_metadata, mode=\"R\")\n", "\n", "transcribed_results = bbq.ai.generate(\n", " prompt=(prompt_text, audio_runtime),\n", " endpoint=\"gemini-2.0-flash-001\",\n", " model_params={\"generationConfig\": {\"temperature\": 0.0}},\n", ")\n", "\n", "transcribed_series = transcribed_results.struct.field(\"result\").rename(\"transcribed_content\")\n", "transcribed_series" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
0    {'status': '', 'content': 'Now, as all books, ...
" ], "text/plain": [ "0 {'status': '', 'content': 'Now, as all books, ...\n", "Name: transcription_results, dtype: struct[pyarrow]" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# To get verbose results (including status), we can extract both fields from the result struct.\n", "transcribed_content_series = transcribed_results.struct.field(\"result\")\n", "transcribed_status_series = transcribed_results.struct.field(\"status\")\n", "\n", "transcribed_series_verbose = bpd.DataFrame(\n", " {\n", " \"status\": transcribed_status_series,\n", " \"content\": transcribed_content_series,\n", " }\n", ")\n", "# Package as a struct for consistent display\n", "transcribed_series_verbose = bbq.struct(transcribed_series_verbose).rename(\"transcription_results\")\n", "transcribed_series_verbose" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 7. Extract EXIF metadata from images" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This section demonstrates how to extract EXIF metadata from images using a custom BigQuery Python UDF and the `Pillow` library." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/usr/local/google/home/shuowei/src/python-bigquery-dataframes/bigframes/pandas/__init__.py:151: PreviewWarning: udf is in preview.\n", " return global_session.with_default_session(\n" ] } ], "source": [ "# Construct the canonical connection ID\n", "FULL_CONNECTION_ID = f\"{PROJECT}.{LOCATION}.bigframes-default-connection\"\n", "\n", "@bpd.udf(\n", " input_types=[str],\n", " output_type=str,\n", " dataset=DATASET_ID,\n", " name=\"extract_exif\",\n", " bigquery_connection=FULL_CONNECTION_ID,\n", " packages=[\"pillow\", \"requests\"],\n", " max_batching_rows=8192,\n", " container_cpu=0.33,\n", " container_memory=\"512Mi\"\n", ")\n", "def extract_exif(src_obj_ref_rt: str) -> str:\n", " import io\n", " import json\n", " from PIL import ExifTags, Image\n", " import requests\n", " src_obj_ref_rt_json = json.loads(src_obj_ref_rt)\n", " src_url = src_obj_ref_rt_json[\"access_urls\"][\"read_url\"]\n", " response = requests.get(src_url, timeout=30)\n", " bts = response.content\n", " image = Image.open(io.BytesIO(bts))\n", " exif_data = image.getexif()\n", " exif_dict = {}\n", " if exif_data:\n", " for tag, value in exif_data.items():\n", " tag_name = ExifTags.TAGS.get(tag, tag)\n", " exif_dict[tag_name] = value\n", " return json.dumps(exif_dict)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/usr/local/google/home/shuowei/src/python-bigquery-dataframes/bigframes/core/utils.py:228: PreviewWarning: The JSON-related API `parse_json` is in preview. Its behavior may\n", "change in future versions.\n", " warnings.warn(bfe.format_message(msg), category=bfe.PreviewWarning)\n" ] }, { "data": { "text/html": [ "
0    {\"ExifOffset\":47,\"Make\":\"MyCamera\"}
" ], "text/plain": [ "0 {\"ExifOffset\":47,\"Make\":\"MyCamera\"}\n", "Name: blob_col, dtype: extension>[pyarrow]" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Create a Multimodal DataFrame from the sample image URIs\n", "exif_image_df = bpd.from_glob_path(\n", " \"gs://bigframes_blob_test/images_exif/*\",\n", " name=\"blob_col\",\n", ")\n", "\n", "# Generate a JSON string containing the runtime information (including signed read URLs)\n", "# This allows the UDF to download the images from Google Cloud Storage\n", "access_urls = get_runtime_json_str(exif_image_df[\"blob_col\"], mode=\"R\")\n", "\n", "# Apply the BigQuery Python UDF to the runtime JSON strings\n", "# We cast to string to ensure the input matches the UDF's signature\n", "exif_json = access_urls.astype(str).apply(extract_exif)\n", "\n", "# Parse the resulting JSON strings back into a structured JSON type for easier access\n", "exif_data = bbq.parse_json(exif_json)\n", "\n", "exif_data" ] } ], "metadata": { "colab": { "provenance": [] }, "kernelspec": { "display_name": "venv (3.13.0)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.0" } }, "nbformat": 4, "nbformat_minor": 0 }