Creating a searchable index of the National Jukebox#

Extracting text from audio and indexing it with BigQuery DataFrames

Tim Swena (formerly, Swast)
swast@google.com
https://vis.social/@timswast on Mastodon

This notebook lives in

https://github.com/tswast/code-snippets
at https://github.com/tswast/code-snippets/blob/main/2025/national-jukebox/transcribe_songs.ipynb

To follow along, you’ll need a Google Cloud project

Go to https://cloud.google.com/free to start a free trial.

The National Jukebox is a project of the USA Library of Congress to provide access to thousands of acoustic sound recordings from the very earliest days of the commercial record industry.

Learn more at https://www.loc.gov/collections/national-jukebox/about-this-collection/

To search the National Jukebox, we combine powerful features of BigQuery:

Integrations with multi-modal AI models to extract information from unstructured data, in this case audio files.

https://cloud.google.com/bigquery/docs/multimodal-data-dataframes-tutorial
Vector search to find similar text using embedding models.

https://cloud.google.com/bigquery/docs/vector-index-text-search-tutorial
BigQuery DataFrames to use Python instead of SQL.

https://cloud.google.com/bigquery/docs/bigquery-dataframes-introduction

Getting started with BigQuery DataFrames (bigframes)#

Install the bigframes package.

%pip install --upgrade bigframes google-cloud-automl google-cloud-translate google-ai-generativelanguage tensorflow 

Important: restart the kernel by going to “Run -> Restart & clear cell outputs” before continuing.

Configure bigframes to use your GCP project. First, go to “Add-ons -> Google Cloud SDK” and click the “Attach” button. Then,

from kaggle_secrets import UserSecretsClient

user_secrets = UserSecretsClient()
user_credential = user_secrets.get_gcloud_credential()
user_secrets.set_tensorflow_credential(user_credential)

import bigframes._config
import bigframes.pandas as bpd

PROJECT_ID = "your-project-id" # @param {type:"string"}
bpd.options.bigquery.location = "US"

# Set to your GCP project ID.
bpd.options.bigquery.project = PROJECT_ID

Reading data#

BigQuery DataFrames can read data from BigQuery, GCS, or even local sources. With engine="bigquery", BigQuery’s distributed processing reads the file without it ever having to reach your local Python environment.

df = bpd.read_json(
    "gs://cloud-samples-data/third-party/usa-loc-national-jukebox/jukebox.jsonl",
    engine="bigquery",
    orient="records",
    lines=True,
)

# Use `peek()` instead of `head()` to see arbitrary rows rather than the "first" rows.
df.peek()

df.shape

# For the purposes of a demo, select only a subset of rows.
df = df.sample(n=250)
df.cache()
df.shape

# As a side effect of how I extracted the song information from the HTML DOM,
# we ended up with lists in places where we only expect one item.
#
# We can "explode" to flatten these lists.
flattened = df.explode([
    "Recording Repository",
    "Recording Label",
    "Recording Take Number",
    "Recording Date",
    "Recording Matrix Number",
    "Recording Catalog Number",
    "Media Size",
    "Recording Location",
    "Summary",
    "Rights Advisory",
    "Title",
])
flattened.peek()

flattened.shape

To access unstructured data from BigQuery, create a URI pointing to a file in Google Cloud Storage (GCS). Then, construct a “blob” (also known as an “Object Ref” in BigQuery terms) so that BigQuery can read from GCS.

flattened = flattened.assign(**{
    "GCS Prefix": "gs://cloud-samples-data/third-party/usa-loc-national-jukebox/",
    "GCS Stub": flattened['URL'].str.extract(r'/(jukebox-[0-9]+)/'),
})
flattened["GCS URI"] = flattened["GCS Prefix"] + flattened["GCS Stub"] + ".mp3"

BigQuery (and BigQuery DataFrames) provide access to powerful models and multimodal capabilities. Here, we transcribe audio to text.

import bigframes.bigquery as bbq

# Replace with your own connection name.
CONNECTION_ID = 'your-project-id.your-location.your-connection' # @param {type:"string"}

# Convert the audio URI to the runtime representation required by the model.
audio_ref = bbq.obj.make_ref(flattened["GCS URI"], authorizer=CONNECTION_ID)
audio_metadata = bbq.obj.fetch_metadata(audio_ref)
audio_runtime = bbq.obj.get_access_url(audio_metadata, mode="R")

# Call GenAI model to perform audio transcription
raw_results = bbq.ai.generate(
    prompt=("Transcribe the provided audio.", audio_runtime),
    endpoint="gemini-3.5-flash"
)

# Package result struct to contain 'content' and 'status' expected by downstream cells
transcription_df = bpd.DataFrame({
    "content": raw_results.struct.field("result"),
    "status": raw_results.struct.field("status")
})
flattened["Transcription"] = bbq.struct(transcription_df)

Sometimes the model has transient errors. Check the status column to see if there are errors.

print(f"Successful rows: {(flattened['Transcription'].struct.field('status') == '').sum()}")
print(f"Failed rows: {(flattened['Transcription'].struct.field('status') != '').sum()}")
flattened.shape

# Show transcribed lyrics.
flattened["Transcription"].struct.field("content")

# Find all instrumentatal songs
instrumental = flattened[flattened["Transcription"].struct.field("content") == ""]
print(instrumental.shape)
song = instrumental.peek(1)
song

import gcsfs
import IPython.display

fs = gcsfs.GCSFileSystem(project='bigframes-dev')
with fs.open(song["GCS URI"].iloc[0]) as song_file:
    song_bytes = song_file.read()

IPython.display.Audio(song_bytes)

Creating a searchable index#

To be able to search by semantics rather than just text, generate embeddings and then create an index to efficiently search these.

See also, this example: https://github.com/googleapis/python-bigquery-dataframes/blob/main/notebooks/generative_ai/bq_dataframes_llm_vector_search.ipynb

from bigframes.ml.llm import TextEmbeddingGenerator

text_model = TextEmbeddingGenerator(model_name="text-multilingual-embedding-002")

df_to_index = (
    flattened
    .assign(content=flattened["Transcription"].struct.field("content"))
    [flattened["Transcription"].struct.field("content") != ""]
)
embedding = text_model.predict(df_to_index)
embedding.peek(1)

# Check the status column to look for errors.
print(f"Successful rows: {(embedding['ml_generate_embedding_status'] == '').sum()}")
print(f"Failed rows: {(embedding['ml_generate_embedding_status'] != '').sum()}")
embedding.shape

We’re now ready to save this to a table.

embedding_table_id = f"{bpd.options.bigquery.project}.kaggle.national_jukebox"
embedding.to_gbq(embedding_table_id, if_exists="replace")

Searching the database#

To search by semantics, we:

Turn our search string into an embedding using the same model as our index.
Find the closest matches to the search string.

import bigframes.pandas as bpd

df_written = bpd.read_gbq(embedding_table_id)
df_written.peek(1)

from bigframes.ml.llm import TextEmbeddingGenerator

search_string = "walking home"

text_model = TextEmbeddingGenerator(model_name="text-multilingual-embedding-002")
search_df = bpd.DataFrame([search_string], columns=['search_string'])
search_embedding = text_model.predict(search_df)
search_embedding

import bigframes.bigquery as bbq

vector_search_results = bbq.vector_search(
    base_table=embedding_table_id,
    column_to_search="ml_generate_embedding_result",
    query=search_embedding,
    distance_type="COSINE",
    query_column_to_search="ml_generate_embedding_result",
    top_k=5,
)

vector_search_results.dtypes

results = vector_search_results[["Title", "Summary", "Names", "GCS URI", "Transcription", "distance"]].sort_values("distance").to_pandas()
results

print(results["Transcription"].struct.field("content").iloc[0])

import gcsfs
import IPython.display

fs = gcsfs.GCSFileSystem(project='bigframes-dev')
with fs.open(results["GCS URI"].iloc[0]) as song_file:
    song_bytes = song_file.read()

IPython.display.Audio(song_bytes)