{ "cells": [ { "cell_type": "markdown", "metadata": { "@deathbeds/jupyterlab-fonts": { "styles": { "": { "body[data-jp-deck-mode='presenting'] &": { "zoom": "194%" } } } }, "editable": true, "slideshow": { "slide_type": "subslide" }, "tags": [] }, "source": [ "# Creating a searchable index of the National Jukebox\n", "\n", "_Extracting text from audio and indexing it with BigQuery DataFrames_\n", "\n", "* Tim Swena (formerly, Swast)\n", "* swast@google.com\n", "* https://vis.social/@timswast on Mastodon\n", "\n", "This notebook lives in\n", "\n", "* https://github.com/tswast/code-snippets\n", "* at https://github.com/tswast/code-snippets/blob/main/2025/national-jukebox/transcribe_songs.ipynb\n", "\n", "To follow along, you'll need a Google Cloud project\n", "\n", "* Go to https://cloud.google.com/free to start a free trial." ] }, { "cell_type": "markdown", "metadata": { "@deathbeds/jupyterlab-fonts": { "styles": { "": { "body[data-jp-deck-mode='presenting'] &": { "z-index": "0", "zoom": "216%" } } } }, "slideshow": { "slide_type": "slide" } }, "source": [ "The National Jukebox is a project of the USA Library of Congress to provide access to thousands of acoustic sound recordings from the very earliest days of the commercial record industry.\n", "\n", "* Learn more at https://www.loc.gov/collections/national-jukebox/about-this-collection/\n", "\n", "\"recording" ] }, { "cell_type": "markdown", "metadata": { "@deathbeds/jupyterlab-fonts": { "styles": { "": { "body[data-jp-deck-mode='presenting'] &": { "z-index": "0", "zoom": "181%" } } } }, "slideshow": { "slide_type": "slide" } }, "source": [ "\n", "To search the National Jukebox, we combine powerful features of BigQuery:\n", "\n", "\"audio\n", "\n", "1. Integrations with multi-modal AI models to extract information from unstructured data, in this case audio files.\n", "\n", " https://cloud.google.com/bigquery/docs/multimodal-data-dataframes-tutorial\n", " \n", "2. Vector search to find similar text using embedding models.\n", "\n", " https://cloud.google.com/bigquery/docs/vector-index-text-search-tutorial\n", "\n", "3. BigQuery DataFrames to use Python instead of SQL.\n", "\n", " https://cloud.google.com/bigquery/docs/bigquery-dataframes-introduction" ] }, { "cell_type": "markdown", "metadata": { "@deathbeds/jupyterlab-fonts": { "styles": { "": { "body[data-jp-deck-mode='presenting'] &": { "zoom": "275%" } } } }, "slideshow": { "slide_type": "slide" } }, "source": [ "## Getting started with BigQuery DataFrames (bigframes)\n", "\n", "Install the bigframes package." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "@deathbeds/jupyterlab-fonts": { "styles": { "": { "body[data-jp-deck-mode='presenting'] &": { "zoom": "214%" } } } }, "execution": { "iopub.execute_input": "2025-08-14T15:53:02.494188Z", "iopub.status.busy": "2025-08-14T15:53:02.493469Z", "iopub.status.idle": "2025-08-14T15:53:08.492291Z", "shell.execute_reply": "2025-08-14T15:53:08.491183Z", "shell.execute_reply.started": "2025-08-14T15:53:02.494152Z" }, "trusted": true }, "outputs": [], "source": [ "%pip install --upgrade bigframes google-cloud-automl google-cloud-translate google-ai-generativelanguage tensorflow " ] }, { "cell_type": "markdown", "metadata": { "@deathbeds/jupyterlab-fonts": { "styles": { "": { "body[data-jp-deck-mode='presenting'] &": { "z-index": "4", "zoom": "236%" } } } } }, "source": [ "**Important:** restart the kernel by going to \"Run -> Restart & clear cell outputs\" before continuing.\n", "\n", "Configure bigframes to use your GCP project. First, go to \"Add-ons -> Google Cloud SDK\" and click the \"Attach\" button. Then," ] }, { "cell_type": "code", "execution_count": null, "metadata": { "execution": { "iopub.execute_input": "2025-08-14T15:53:08.494636Z", "iopub.status.busy": "2025-08-14T15:53:08.494313Z", "iopub.status.idle": "2025-08-14T15:53:08.609706Z", "shell.execute_reply": "2025-08-14T15:53:08.608705Z", "shell.execute_reply.started": "2025-08-14T15:53:08.494604Z" }, "trusted": true }, "outputs": [], "source": [ "from kaggle_secrets import UserSecretsClient\n", "user_secrets = UserSecretsClient()\n", "user_credential = user_secrets.get_gcloud_credential()\n", "user_secrets.set_tensorflow_credential(user_credential)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "@deathbeds/jupyterlab-fonts": { "styles": { "": { "body[data-jp-deck-mode='presenting'] &": { "zoom": "193%" } } } }, "execution": { "iopub.execute_input": "2025-08-14T15:53:08.610982Z", "iopub.status.busy": "2025-08-14T15:53:08.610686Z", "iopub.status.idle": "2025-08-14T15:53:17.658993Z", "shell.execute_reply": "2025-08-14T15:53:17.657745Z", "shell.execute_reply.started": "2025-08-14T15:53:08.610961Z" }, "trusted": true }, "outputs": [], "source": [ "import bigframes._config\n", "import bigframes.pandas as bpd\n", "\n", "bpd.options.bigquery.location = \"US\"\n", "\n", "# Set to your GCP project ID.\n", "bpd.options.bigquery.project = \"swast-scratch\"" ] }, { "cell_type": "markdown", "metadata": { "@deathbeds/jupyterlab-fonts": { "styles": { "": { "body[data-jp-deck-mode='presenting'] &": { "zoom": "207%" } } } }, "slideshow": { "slide_type": "slide" } }, "source": [ "## Reading data\n", "\n", "BigQuery DataFrames can read data from BigQuery, GCS, or even local sources. With `engine=\"bigquery\"`, BigQuery's distributed processing reads the file without it ever having to reach your local Python environment." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "@deathbeds/jupyterlab-fonts": { "styles": { "": { "body[data-jp-deck-mode='presenting'] &": { "zoom": "225%" } } } }, "execution": { "iopub.execute_input": "2025-08-14T15:53:17.662234Z", "iopub.status.busy": "2025-08-14T15:53:17.661901Z", "iopub.status.idle": "2025-08-14T15:53:34.486799Z", "shell.execute_reply": "2025-08-14T15:53:34.485777Z", "shell.execute_reply.started": "2025-08-14T15:53:17.662207Z" }, "trusted": true }, "outputs": [], "source": [ "df = bpd.read_json(\n", " \"gs://cloud-samples-data/third-party/usa-loc-national-jukebox/jukebox.jsonl\",\n", " engine=\"bigquery\",\n", " orient=\"records\",\n", " lines=True,\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "@deathbeds/jupyterlab-fonts": { "styles": { "": { "body[data-jp-deck-mode='presenting'] &": { "zoom": "122%" } } } }, "execution": { "iopub.execute_input": "2025-08-14T15:53:34.488610Z", "iopub.status.busy": "2025-08-14T15:53:34.488332Z", "iopub.status.idle": "2025-08-14T15:53:40.347014Z", "shell.execute_reply": "2025-08-14T15:53:40.345773Z", "shell.execute_reply.started": "2025-08-14T15:53:34.488589Z" }, "slideshow": { "slide_type": "slide" }, "trusted": true }, "outputs": [], "source": [ "# Use `peek()` instead of `head()` to see arbitrary rows rather than the \"first\" rows.\n", "df.peek()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "@deathbeds/jupyterlab-fonts": { "styles": { "": { "body[data-jp-deck-mode='presenting'] &": { "zoom": "134%" } } } }, "execution": { "iopub.execute_input": "2025-08-14T15:53:40.348376Z", "iopub.status.busy": "2025-08-14T15:53:40.348021Z", "iopub.status.idle": "2025-08-14T15:53:40.364129Z", "shell.execute_reply": "2025-08-14T15:53:40.363204Z", "shell.execute_reply.started": "2025-08-14T15:53:40.348351Z" }, "trusted": true }, "outputs": [], "source": [ "df.shape" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "execution": { "iopub.execute_input": "2025-08-14T15:55:55.448664Z", "iopub.status.busy": "2025-08-14T15:55:55.448310Z", "iopub.status.idle": "2025-08-14T15:55:59.440964Z", "shell.execute_reply": "2025-08-14T15:55:59.439988Z", "shell.execute_reply.started": "2025-08-14T15:55:55.448637Z" }, "trusted": true }, "outputs": [], "source": [ "# For the purposes of a demo, select only a subset of rows.\n", "df = df.sample(n=250)\n", "df.cache()\n", "df.shape" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "@deathbeds/jupyterlab-fonts": { "styles": { "": { "body[data-jp-deck-mode='presenting'] &": { "zoom": "161%" } } } }, "execution": { "iopub.execute_input": "2025-08-14T15:56:02.040804Z", "iopub.status.busy": "2025-08-14T15:56:02.040450Z", "iopub.status.idle": "2025-08-14T15:56:06.544384Z", "shell.execute_reply": "2025-08-14T15:56:06.543240Z", "shell.execute_reply.started": "2025-08-14T15:56:02.040777Z" }, "slideshow": { "slide_type": "slide" }, "trusted": true }, "outputs": [], "source": [ "# As a side effect of how I extracted the song information from the HTML DOM,\n", "# we ended up with lists in places where we only expect one item.\n", "#\n", "# We can \"explode\" to flatten these lists.\n", "flattened = df.explode([\n", " \"Recording Repository\",\n", " \"Recording Label\",\n", " \"Recording Take Number\",\n", " \"Recording Date\",\n", " \"Recording Matrix Number\",\n", " \"Recording Catalog Number\",\n", " \"Media Size\",\n", " \"Recording Location\",\n", " \"Summary\",\n", " \"Rights Advisory\",\n", " \"Title\",\n", "])\n", "flattened.peek()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "execution": { "iopub.execute_input": "2025-08-14T15:56:06.546531Z", "iopub.status.busy": "2025-08-14T15:56:06.546140Z", "iopub.status.idle": "2025-08-14T15:56:06.566005Z", "shell.execute_reply": "2025-08-14T15:56:06.564355Z", "shell.execute_reply.started": "2025-08-14T15:56:06.546494Z" }, "trusted": true }, "outputs": [], "source": [ "flattened.shape" ] }, { "cell_type": "markdown", "metadata": { "@deathbeds/jupyterlab-fonts": { "styles": { "": { "body[data-jp-deck-mode='presenting'] &": { "zoom": "216%" } } } }, "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "To access unstructured data from BigQuery, create a URI pointing to a file in Google Cloud Storage (GCS). Then, construct a \"blob\" (also known as an \"Object Ref\" in BigQuery terms) so that BigQuery can read from GCS." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "@deathbeds/jupyterlab-fonts": { "styles": { "": { "body[data-jp-deck-mode='presenting'] &": { "zoom": "211%" } } } }, "editable": true, "execution": { "iopub.execute_input": "2025-08-14T15:56:07.394879Z", "iopub.status.busy": "2025-08-14T15:56:07.394509Z", "iopub.status.idle": "2025-08-14T15:56:12.217017Z", "shell.execute_reply": "2025-08-14T15:56:12.215852Z", "shell.execute_reply.started": "2025-08-14T15:56:07.394853Z" }, "slideshow": { "slide_type": "" }, "tags": [], "trusted": true }, "outputs": [], "source": [ "flattened = flattened.assign(**{\n", " \"GCS Prefix\": \"gs://cloud-samples-data/third-party/usa-loc-national-jukebox/\",\n", " \"GCS Stub\": flattened['URL'].str.extract(r'/(jukebox-[0-9]+)/'),\n", "})\n", "flattened[\"GCS URI\"] = flattened[\"GCS Prefix\"] + flattened[\"GCS Stub\"] + \".mp3\"\n", "flattened[\"GCS Blob\"] = flattened[\"GCS URI\"].str.to_blob()" ] }, { "cell_type": "markdown", "metadata": { "@deathbeds/jupyterlab-fonts": { "styles": { "": { "body[data-jp-deck-mode='presenting'] &": { "zoom": "317%" } } } }, "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "BigQuery (and BigQuery DataFrames) provide access to powerful models and multimodal capabilities. Here, we transcribe audio to text." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "editable": true, "execution": { "iopub.execute_input": "2025-08-14T15:56:20.908198Z", "iopub.status.busy": "2025-08-14T15:56:20.907791Z", "iopub.status.idle": "2025-08-14T15:58:45.909086Z", "shell.execute_reply": "2025-08-14T15:58:45.908060Z", "shell.execute_reply.started": "2025-08-14T15:56:20.908170Z" }, "slideshow": { "slide_type": "" }, "tags": [], "trusted": true }, "outputs": [], "source": [ "flattened[\"Transcription\"] = flattened[\"GCS Blob\"].blob.audio_transcribe(\n", " model_name=\"gemini-2.0-flash-001\",\n", " verbose=True,\n", ")\n", "flattened[\"Transcription\"]" ] }, { "cell_type": "markdown", "metadata": { "@deathbeds/jupyterlab-fonts": { "styles": { "": { "body[data-jp-deck-mode='presenting'] &": { "zoom": "229%" } } } }, "slideshow": { "slide_type": "slide" } }, "source": [ "Sometimes the model has transient errors. Check the status column to see if there are errors." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "@deathbeds/jupyterlab-fonts": { "styles": { "": { "body[data-jp-deck-mode='presenting'] &": { "zoom": "177%" } } } }, "editable": true, "execution": { "iopub.execute_input": "2025-08-14T15:59:43.609239Z", "iopub.status.busy": "2025-08-14T15:59:43.607976Z", "iopub.status.idle": "2025-08-14T15:59:44.515118Z", "shell.execute_reply": "2025-08-14T15:59:44.514275Z", "shell.execute_reply.started": "2025-08-14T15:59:43.609201Z" }, "slideshow": { "slide_type": "" }, "tags": [], "trusted": true }, "outputs": [], "source": [ "print(f\"Successful rows: {(flattened['Transcription'].struct.field('status') == '').sum()}\")\n", "print(f\"Failed rows: {(flattened['Transcription'].struct.field('status') != '').sum()}\")\n", "flattened.shape" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "@deathbeds/jupyterlab-fonts": { "styles": { "": { "body[data-jp-deck-mode='presenting'] &": { "zoom": "141%" } } } }, "execution": { "iopub.execute_input": "2025-08-14T15:59:44.820256Z", "iopub.status.busy": "2025-08-14T15:59:44.819926Z", "iopub.status.idle": "2025-08-14T15:59:53.147159Z", "shell.execute_reply": "2025-08-14T15:59:53.146281Z", "shell.execute_reply.started": "2025-08-14T15:59:44.820232Z" }, "trusted": true }, "outputs": [], "source": [ "# Show transcribed lyrics.\n", "flattened[\"Transcription\"].struct.field(\"content\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "@deathbeds/jupyterlab-fonts": { "styles": { "": { "body[data-jp-deck-mode='presenting'] &": { "zoom": "152%" } } } }, "execution": { "iopub.execute_input": "2025-08-14T15:59:53.149222Z", "iopub.status.busy": "2025-08-14T15:59:53.148783Z", "iopub.status.idle": "2025-08-14T15:59:58.868959Z", "shell.execute_reply": "2025-08-14T15:59:58.867804Z", "shell.execute_reply.started": "2025-08-14T15:59:53.149198Z" }, "slideshow": { "slide_type": "slide" }, "trusted": true }, "outputs": [], "source": [ "# Find all instrumentatal songs\n", "instrumental = flattened[flattened[\"Transcription\"].struct.field(\"content\") == \"\"]\n", "print(instrumental.shape)\n", "song = instrumental.peek(1)\n", "song" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "@deathbeds/jupyterlab-fonts": { "styles": { "": { "body[data-jp-deck-mode='presenting'] &": { "zoom": "152%" } } } }, "editable": true, "execution": { "iopub.execute_input": "2025-08-14T15:59:58.870143Z", "iopub.status.busy": "2025-08-14T15:59:58.869868Z", "iopub.status.idle": "2025-08-14T16:00:15.502470Z", "shell.execute_reply": "2025-08-14T16:00:15.500813Z", "shell.execute_reply.started": "2025-08-14T15:59:58.870123Z" }, "slideshow": { "slide_type": "" }, "tags": [], "trusted": true }, "outputs": [], "source": [ "import gcsfs\n", "import IPython.display\n", "\n", "fs = gcsfs.GCSFileSystem(project='bigframes-dev')\n", "with fs.open(song[\"GCS URI\"].iloc[0]) as song_file:\n", " song_bytes = song_file.read()\n", "\n", "IPython.display.Audio(song_bytes)" ] }, { "cell_type": "markdown", "metadata": { "@deathbeds/jupyterlab-fonts": { "styles": { "": { "body[data-jp-deck-mode='presenting'] &": { "zoom": "181%" } } } }, "slideshow": { "slide_type": "slide" } }, "source": [ "## Creating a searchable index\n", "\n", "To be able to search by semantics rather than just text, generate embeddings and then create an index to efficiently search these.\n", "\n", "See also, this example: https://github.com/googleapis/python-bigquery-dataframes/blob/main/notebooks/generative_ai/bq_dataframes_llm_vector_search.ipynb" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "@deathbeds/jupyterlab-fonts": { "styles": { "": { "body[data-jp-deck-mode='presenting'] &": { "zoom": "163%" } } } }, "execution": { "iopub.execute_input": "2025-08-14T16:00:15.506380Z", "iopub.status.busy": "2025-08-14T16:00:15.505775Z", "iopub.status.idle": "2025-08-14T16:00:25.134987Z", "shell.execute_reply": "2025-08-14T16:00:25.134124Z", "shell.execute_reply.started": "2025-08-14T16:00:15.506337Z" }, "trusted": true }, "outputs": [], "source": [ "from bigframes.ml.llm import TextEmbeddingGenerator\n", "\n", "text_model = TextEmbeddingGenerator(model_name=\"text-multilingual-embedding-002\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "@deathbeds/jupyterlab-fonts": { "styles": { "": { "body[data-jp-deck-mode='presenting'] &": { "zoom": "125%" } } } }, "execution": { "iopub.execute_input": "2025-08-14T16:00:25.136017Z", "iopub.status.busy": "2025-08-14T16:00:25.135744Z", "iopub.status.idle": "2025-08-14T16:00:34.860878Z", "shell.execute_reply": "2025-08-14T16:00:34.859925Z", "shell.execute_reply.started": "2025-08-14T16:00:25.135997Z" }, "trusted": true }, "outputs": [], "source": [ "df_to_index = (\n", " flattened\n", " .assign(content=flattened[\"Transcription\"].struct.field(\"content\"))\n", " [flattened[\"Transcription\"].struct.field(\"content\") != \"\"]\n", ")\n", "embedding = text_model.predict(df_to_index)\n", "embedding.peek(1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "@deathbeds/jupyterlab-fonts": { "styles": { "": { "body[data-jp-deck-mode='presenting'] &": { "zoom": "178%" } } } }, "editable": true, "execution": { "iopub.execute_input": "2025-08-14T16:01:20.816923Z", "iopub.status.busy": "2025-08-14T16:01:20.816523Z", "iopub.status.idle": "2025-08-14T16:01:22.480554Z", "shell.execute_reply": "2025-08-14T16:01:22.479604Z", "shell.execute_reply.started": "2025-08-14T16:01:20.816894Z" }, "slideshow": { "slide_type": "slide" }, "tags": [], "trusted": true }, "outputs": [], "source": [ "# Check the status column to look for errors.\n", "print(f\"Successful rows: {(embedding['ml_generate_embedding_status'] == '').sum()}\")\n", "print(f\"Failed rows: {(embedding['ml_generate_embedding_status'] != '').sum()}\")\n", "embedding.shape" ] }, { "cell_type": "markdown", "metadata": { "@deathbeds/jupyterlab-fonts": { "styles": { "": { "body[data-jp-deck-mode='presenting'] &": { "zoom": "224%" } } } } }, "source": [ "We're now ready to save this to a table." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "@deathbeds/jupyterlab-fonts": { "styles": { "": { "body[data-jp-deck-mode='presenting'] &": { "zoom": "172%" } } } }, "execution": { "iopub.execute_input": "2025-08-14T16:03:43.611592Z", "iopub.status.busy": "2025-08-14T16:03:43.611265Z", "iopub.status.idle": "2025-08-14T16:03:47.459025Z", "shell.execute_reply": "2025-08-14T16:03:47.458079Z", "shell.execute_reply.started": "2025-08-14T16:03:43.611568Z" }, "trusted": true }, "outputs": [], "source": [ "embedding_table_id = f\"{bpd.options.bigquery.project}.kaggle.national_jukebox\"\n", "embedding.to_gbq(embedding_table_id, if_exists=\"replace\")" ] }, { "cell_type": "markdown", "metadata": { "@deathbeds/jupyterlab-fonts": { "styles": { "": { "body[data-jp-deck-mode='presenting'] &": { "zoom": "183%" } } } }, "slideshow": { "slide_type": "slide" } }, "source": [ "## Searching the database\n", "\n", "To search by semantics, we:\n", "\n", "1. Turn our search string into an embedding using the same model as our index.\n", "2. Find the closest matches to the search string." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "@deathbeds/jupyterlab-fonts": { "styles": { "": { "body[data-jp-deck-mode='presenting'] &": { "zoom": "92%" } } } }, "execution": { "iopub.execute_input": "2025-08-14T16:03:52.674429Z", "iopub.status.busy": "2025-08-14T16:03:52.673629Z", "iopub.status.idle": "2025-08-14T16:03:59.962635Z", "shell.execute_reply": "2025-08-14T16:03:59.961482Z", "shell.execute_reply.started": "2025-08-14T16:03:52.674399Z" }, "slideshow": { "slide_type": "skip" }, "trusted": true }, "outputs": [], "source": [ "import bigframes.pandas as bpd\n", "\n", "df_written = bpd.read_gbq(embedding_table_id)\n", "df_written.peek(1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "@deathbeds/jupyterlab-fonts": { "styles": { "": { "body[data-jp-deck-mode='presenting'] &": { "zoom": "127%" } } } }, "execution": { "iopub.execute_input": "2025-08-14T16:03:59.964634Z", "iopub.status.busy": "2025-08-14T16:03:59.964268Z", "iopub.status.idle": "2025-08-14T16:04:55.051531Z", "shell.execute_reply": "2025-08-14T16:04:55.050393Z", "shell.execute_reply.started": "2025-08-14T16:03:59.964598Z" }, "trusted": true }, "outputs": [], "source": [ "from bigframes.ml.llm import TextEmbeddingGenerator\n", "\n", "search_string = \"walking home\"\n", "\n", "text_model = TextEmbeddingGenerator(model_name=\"text-multilingual-embedding-002\")\n", "search_df = bpd.DataFrame([search_string], columns=['search_string'])\n", "search_embedding = text_model.predict(search_df)\n", "search_embedding" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "@deathbeds/jupyterlab-fonts": { "styles": { "": { "body[data-jp-deck-mode='presenting'] &": { "zoom": "175%" } } } }, "editable": true, "execution": { "iopub.execute_input": "2025-08-14T16:05:46.473357Z", "iopub.status.busy": "2025-08-14T16:05:46.473056Z", "iopub.status.idle": "2025-08-14T16:05:50.564470Z", "shell.execute_reply": "2025-08-14T16:05:50.563277Z", "shell.execute_reply.started": "2025-08-14T16:05:46.473336Z" }, "slideshow": { "slide_type": "slide" }, "tags": [], "trusted": true }, "outputs": [], "source": [ "import bigframes.bigquery as bbq\n", "\n", "vector_search_results = bbq.vector_search(\n", " base_table=f\"swast-scratch.scipy2025.national_jukebox\",\n", " column_to_search=\"ml_generate_embedding_result\",\n", " query=search_embedding,\n", " distance_type=\"COSINE\",\n", " query_column_to_search=\"ml_generate_embedding_result\",\n", " top_k=5,\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "execution": { "iopub.execute_input": "2025-08-14T16:05:50.566930Z", "iopub.status.busy": "2025-08-14T16:05:50.566422Z", "iopub.status.idle": "2025-08-14T16:05:50.576293Z", "shell.execute_reply": "2025-08-14T16:05:50.575186Z", "shell.execute_reply.started": "2025-08-14T16:05:50.566893Z" }, "trusted": true }, "outputs": [], "source": [ "vector_search_results.dtypes" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "@deathbeds/jupyterlab-fonts": { "styles": { "": { "body[data-jp-deck-mode='presenting'] &": { "zoom": "158%" } } } }, "execution": { "iopub.execute_input": "2025-08-14T16:05:54.787080Z", "iopub.status.busy": "2025-08-14T16:05:54.786649Z", "iopub.status.idle": "2025-08-14T16:05:55.581285Z", "shell.execute_reply": "2025-08-14T16:05:55.580012Z", "shell.execute_reply.started": "2025-08-14T16:05:54.787054Z" }, "slideshow": { "slide_type": "slide" }, "trusted": true }, "outputs": [], "source": [ "results = vector_search_results[[\"Title\", \"Summary\", \"Names\", \"GCS URI\", \"Transcription\", \"distance\"]].sort_values(\"distance\").to_pandas()\n", "results" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "@deathbeds/jupyterlab-fonts": { "styles": { "": { "body[data-jp-deck-mode='presenting'] &": { "zoom": "138%" } } } }, "execution": { "iopub.execute_input": "2025-08-14T16:05:56.142373Z", "iopub.status.busy": "2025-08-14T16:05:56.142038Z", "iopub.status.idle": "2025-08-14T16:05:56.149020Z", "shell.execute_reply": "2025-08-14T16:05:56.147966Z", "shell.execute_reply.started": "2025-08-14T16:05:56.142350Z" }, "trusted": true }, "outputs": [], "source": [ "print(results[\"Transcription\"].struct.field(\"content\").iloc[0])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "editable": true, "execution": { "iopub.execute_input": "2025-08-14T16:06:04.542878Z", "iopub.status.busy": "2025-08-14T16:06:04.542537Z", "iopub.status.idle": "2025-08-14T16:06:04.843052Z", "shell.execute_reply": "2025-08-14T16:06:04.841220Z", "shell.execute_reply.started": "2025-08-14T16:06:04.542854Z" }, "scrolled": true, "slideshow": { "slide_type": "" }, "tags": [], "trusted": true }, "outputs": [], "source": [ "import gcsfs\n", "import IPython.display\n", "\n", "fs = gcsfs.GCSFileSystem(project='bigframes-dev')\n", "with fs.open(results[\"GCS URI\"].iloc[0]) as song_file:\n", " song_bytes = song_file.read()\n", "\n", "IPython.display.Audio(song_bytes)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "trusted": true }, "outputs": [], "source": [] } ], "metadata": { "kaggle": { "accelerator": "none", "dataSources": [ { "databundleVersionId": 13238728, "sourceId": 110281, "sourceType": "competition" } ], "dockerImageVersionId": 31089, "isGpuEnabled": false, "isInternetEnabled": true, "language": "python", "sourceType": "notebook" }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.13" } }, "nbformat": 4, "nbformat_minor": 4 }