# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

Build a Vector Search application using BigQuery DataFrames (aka BigFrames)#

Colab logo Run in Colab GitHub logo View on GitHub Vertex AI logo Open in Vertex AI Workbench BQ logo Open in BQ Studio

Author: Sudipto Guha (Google)

Last updated: March 16th 2025

Overview#

This notebook will guide you through a practical example of using BigFrames to perform vector search and analysis on a patent dataset within BigQuery. We will leverage Python and BigFrames to efficiently process, analyze, and gain insights from a large-scale dataset without moving data from BigQuery.

Here’s a breakdown of what we’ll cover:

  1. Data Ingestion and Embedding Generation: We will start by reading a public patent dataset directly from BigQuery into a BigFrames DataFrame. We’ll demonstrate how to use BigFrames’ TextEmbeddingGenerator to create text embeddings for the patent abstracts. This process converts the textual data into numerical vectors that capture the semantic meaning of each abstract. We’ll show how BigFrames efficiently performs this embedding generation within BigQuery, avoiding data transfer to the client-side. Finally, we’ll store the generated embeddings back into a new BigQuery table for subsequent analysis.

  2. Indexing and Similarity Search: Here we’ll create a vector index using BigFrames to enable fast and scalable similarity searches. We’ll demonstrate how to create an IVF index for efficient approximate nearest neighbor searches. We’ll then perform a vector search using a sample query string to find patents that are semantically similar to the query. This showcases how vector search goes beyond keyword matching to find relevant results based on meaning.

  3. AI-Powered Summarization with Retrieval Augmented Generation (RAG): To further enhance the analysis, we’ll implement a RAG pipeline. We’ll retrieve the top most similar patents based on the vector search results from step 2. We’ll use BigFrames’ GeminiTextGenerator to create a prompt for an LLM to generate a concise summary of the retrieved patents. This demonstrates how to combine vector search with generative AI to extract and synthesize meaningful insights from complex patent data.

We will tie these pieces together in Python using BigQuery DataFrames. Click here to learn more about BigQuery DataFrames!

Dataset#

This notebook uses the BQ Patents Public Dataset.

Costs#

This tutorial uses billable components of Google Cloud:

  • BigQuery (compute)

  • BigQuery ML

  • Generative AI support on Vertex AI

Learn about BigQuery compute pricing, Generative AI support on Vertex AI pricing, and BigQuery ML pricing, and use the Pricing Calculator to generate a cost estimate based on your projected usage.

Setup & initialization#

Make sure you have the required roles and permissions listed below:

For Vector embedding generation

For Vector Index creation

Before you begin#

Complete the tasks in this section to set up your environment.

Set up your Google Cloud project#

The following steps are required, regardless of your notebook environment.

  1. Select or create a Google Cloud project. When you first create an account, you get a $300 credit towards your compute/storage costs.

  2. Make sure that billing is enabled for your project.

  3. Click here to enable the following APIs:

  • BigQuery API

  • BigQuery Connection API

  • Vertex AI API

  1. If you are running this notebook locally, install the Cloud SDK.

Set your project ID#

If you don’t know your project ID, see the support page: Locate the project ID

# set your project ID below
PROJECT_ID = "bigframes-dev"  # @param {type:"string"}

# set your region
REGION = "US"  # @param {type: "string"}

# Set the project id in gcloud
#! gcloud config set project {PROJECT_ID}

Authenticate your Google Cloud account#

Depending on your Jupyter environment, you might have to manually authenticate. Follow the relevant instructions below.

Vertex AI Workbench

Do nothing, you are already authenticated.

Local JupyterLab instance

Uncomment and run the following cell:

# ! gcloud auth login

Colab

Uncomment and run the following cell:

# from google.colab import auth
# auth.authenticate_user()

Now we are ready to use BigQuery DataFrames!

Step 1: Data Ingestion and Embedding Generation#

Install libraries

import bigframes.pandas as bf
import bigframes.ml as bf_ml
import bigframes.bigquery as bf_bq
import bigframes.ml.llm as bf_llm


from google.cloud import bigquery
from google.cloud import storage

# Construct a BigQuery client object.
client = bigquery.Client()

import pandas as pd
from IPython.display import Image, display
from PIL import Image as PILImage
import io

import json
from IPython.display import Markdown

# Note: The project option is not required in all environments.
# On BigQuery Studio, the project ID is automatically detected.
bf.options.bigquery.project = PROJECT_ID
bf.options.bigquery.location = REGION

Partial ordering mode allows BigQuery DataFrames to push down many more row and column filters. On large clustered and partitioned tables, this can greatly reduce the number of bytes scanned and computation slots used. This blog post goes over it in more detail.

bf.options.bigquery.ordering_mode = "partial"

If you want to reset the location of the created DataFrame or Series objects, reset the session by executing bf.close_session(). After that, you can reuse bf.options.bigquery.location to specify another location.

Data Input - read the data from a publicly available BigQuery dataset

publications = bf.read_gbq('patents-public-data.google_patents_research.publications')
## create patents base table (subset of 10k out of ~110M records)

keep = (publications.embedding_v1.str.len() > 0) & (publications.title.str.len() > 0) & (publications.abstract.str.len() > 30)

## Choose 10000 random rows to analyze
publications = publications[keep].peek(10000)
## take a look at the sample dataset

publications.head(5)
publication_number title title_translated abstract abstract_translated cpc cpc_low cpc_inventive_low top_terms similar url country publication_description cited_by embedding_v1
0 WO-2007022924-B1 Pharmaceutical compositions with melting point... False The invention relates to the use of chemical f... False [{'code': 'A61K47/32', 'inventive': True, 'fir... ['A61K47/32' 'A61K47/30' 'A61K47/00' 'A61K' 'A... ['A61K47/32' 'A61K47/30' 'A61K47/00' 'A61K' 'A... ['composition' 'mucosa' 'melting point' 'agent... [{'publication_number': 'WO-2007022924-B1', 'a... https://patents.google.com/patent/WO2007022924B1 WIPO (PCT) Amended claims [] [ 5.3550040e-02 -9.3632710e-02 1.4337189e-02 ...
1 WO-03043855-B1 Convenience lighting for interior and exterior... False A lighting apparatus for a vehicle(21) include... False [{'code': 'B60Q1/247', 'inventive': True, 'fir... ['B60Q1/247' 'B60Q1/24' 'B60Q1/02' 'B60Q1/00' ... ['B60Q1/247' 'B60Q1/24' 'B60Q1/02' 'B60Q1/00' ... ['vehicle' 'light' 'apparatus defined' 'pillar... [{'publication_number': 'WO-03043855-B1', 'app... https://patents.google.com/patent/WO2003043855B1 WIPO (PCT) Amended claims [] [ 0.00484032 -0.02695554 -0.20798226 -0.207528...
2 AU-2020396918-A2 Shot detection and verification system False A shot detection system for a projectile weapo... False [{'code': 'F41A19/01', 'inventive': True, 'fir... ['F41A19/01' 'F41A19/00' 'F41A' 'F41' 'F' 'H04... ['F41A19/01' 'F41A19/00' 'F41A' 'F41' 'F' 'H04... ['interest' 'region' 'property' 'shot' 'test' ... [{'publication_number': 'US-2023228510-A1', 'a... https://patents.google.com/patent/AU2020396918A2 Australia Amended post open to public inspection [] [-1.49729420e-02 -2.27105440e-01 -2.68012730e-...
3 PL-347539-A1 Concrete mix of increased fire resistance False The burning resistance of concrete containing ... False [{'code': 'Y02W30/91', 'inventive': False, 'fi... ['Y02W30/91' 'Y02W30/50' 'Y02W30/00' 'Y02W' 'Y... ['Y02W30/91' 'Y02W30/50' 'Y02W30/00' 'Y02W' 'Y... ['fire resistance' 'concrete mix' 'increased f... [{'publication_number': 'DK-1564194-T3', 'appl... https://patents.google.com/patent/PL347539A1 Poland Application [] [ 0.01849568 -0.05340371 -0.19257502 -0.174919...
4 AU-PS049302-A0 Methods and systems (ap53) False A charging stand for charging a mobile phone, ... False [{'code': 'H02J7/00', 'inventive': True, 'firs... ['H02J7/00' 'H02J' 'H02' 'H' 'H04B1/40' 'H04B1... ['H02J7/00' 'H02J' 'H02' 'H' 'H04B1/40' 'H04B1... ['connection pin' 'mobile phone' 'cartridge' '... [{'publication_number': 'AU-PS049302-A0', 'app... https://patents.google.com/patent/AUPS049302A0 Australia Application filed, as announced in the Gazette... [] [ 0.00064732 -0.2136009 0.0040593 -0.024562...

Generate the text embeddings

from bigframes.ml.llm import TextEmbeddingGenerator

text_model = TextEmbeddingGenerator(
    model_name="text-embedding-005",
    # No connection id needed
)
Query job 0e9d9117-4981-4f5c-b785-ed831c08e7aa is DONE. 0 Bytes processed. Open Job
Query job fa4f1a54-85d4-4030-992e-fddda5edf3e3 is DONE. 0 Bytes processed. Open Job
## rename abstract column to content as the desired column on which embedding will be generated
publications = publications[["publication_number", "title", "abstract"]].rename(columns={'abstract': 'content'})

## generate the embeddings
## takes ~2-3 mins to run
embedding = text_model.predict(publications)[["publication_number", "title", "content", "ml_generate_embedding_result","ml_generate_embedding_status"]]

## filter out rows where the embedding generation failed. the embedding status value is empty if the embedding generation was successful
embedding = embedding[~embedding["ml_generate_embedding_status"].isnull()]
Load job 70377d71-bb13-46af-80c1-71ef16bf2949 is DONE. Open Job
Query job cc3b609d-b6b7-404f-9447-c76d3a52698b is DONE. 9.5 MB processed. Open Job
/usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/bigframes/core/array_value.py:109: PreviewWarning: JSON column interpretation as a custom PyArrow extention in
`db_dtypes` is a preview feature and subject to change.
  warnings.warn(msg, bfe.PreviewWarning)
embedding.head(5)
Query job 5b15fc4a-fa9a-4608-825f-be5af9953a38 is DONE. 71.0 MB processed. Open Job
publication_number title content ml_generate_embedding_result ml_generate_embedding_status
5611 WO-2014005277-A1 Resource management in a cloud computing envir... Technologies and implementations for managing ... [-2.92946529e-02 -1.24640828e-02 1.27173709e-...
6895 AU-2011325479-B2 7-([1,2,3]triazol-4-yl)-pyrrolo[2,3-b]pyrazine... Compounds of formula I, in which R [-6.45397678e-02 1.19616119e-02 -9.85191786e-...
6 IL-45347-A 7h-indolizino(5,6,7-ij)isoquinoline derivative... Compounds of the formula:\n[US3946019A] [-3.82784344e-02 -2.31682733e-02 -4.35006060e-...
5923 WO-2005111625-A3 Method to predict prostate cancer A method for predicting the probability or ris... [ 0.02480386 -0.01648765 0.03873815 -0.025998...
6370 US-7868678-B2 Configurable differential lines Embodiments related to configurable differenti... [ 2.71715336e-02 -1.93733890e-02 2.82729534e-...

5 rows × 5 columns

[5 rows x 5 columns in total]
# store embeddings in a BQ table
DATASET_ID = "my_dataset"  # @param {type:"string"}
TEXT_EMBEDDING_TABLE_ID = "my_embeddings_table" # @param {type:"string"}
embedding.to_gbq(f"{DATASET_ID}.{TEXT_EMBEDDING_TABLE_ID}", if_exists='replace')
Query job 06ce090b-e3f9-4252-b847-45c2a296ca61 is DONE. 70.9 MB processed. Open Job
'my_dataset.my_embeddings_table'

Step 3: AI-Powered Summarization with Retrieval Augmented Generation (RAG)#

Patent documents can be dense and time-consuming to digest. AI-Powered Patent Summarization utilizes Retrieval Augmented Generation (RAG) to streamline this process. By retrieving relevant patent information through vector search and then synthesizing it with a large language model, we can generate concise, human-readable summaries, saving valuable time and effort. The code sample below walks through how to set this up continuing with the same user query as the previous use case.

## gemini model

llm_model = bf_llm.GeminiTextGenerator(model_name = "gemini-2.0-flash-001") ## replace with other model as needed
Query job 3fabe659-f95b-49cb-b0c7-9d32b09177bf is DONE. 0 Bytes processed. Open Job

We will use the same user query from Section 2, and pass the list of abstracts returned by the vector search into the prompt for the RAG application

TEMPERATURE = 0.4
# Extract strings into a list of JSON strings
json_strings = [json.dumps({'abstract': s}) for s in vector_search_results['content_1']]
ALL_ABSTRACTS = json_strings

# Print the result (optional)
print(ALL_ABSTRACTS)
['{"abstract": "A chip package is provided, the chip package including: a chip carrier; a chip disposed over and electrically connected to a chip carrier top side; an electrically insulating material disposed over and at least partially surrounding the chip; one or more electrically conductive contact regions formed over the electrically insulating material and in electrical connection with the chip; and another electrically insulating material disposed over a chip carrier bottom side. An electrically conductive contact region on the chip carrier bottom side is released from the further electrically insulating material."}', '{"abstract": "A method of forming a microelectronic assembly includes positioning a support structure adjacent to an active region of a device but not extending onto the active region. The support structure has planar sections. Each planar section has a substantially uniform composition. The composition of at least one of the planar sections differs from the composition of at least one of the other planar sections. A lid is positioned in contact with the support structure and extends over the active region. The support structure is bonded to the device and to the lid."}', '{"abstract": "To provide a substrate for a semiconductor mounting device capable of obtaining high reliability.   In a semiconductor mounting device substrate of the present invention, a semiconductor chip can be surface-mounted by a flip chip connection method on a semiconductor chip mounting region of a first main surface of a multilayer wiring substrate. A plurality of second main surface side solder bumps 52 forming a plate-like component mounting region 53 are formed at a location immediately below the semiconductor chip 21 on the second main surface 13 of the multilayer wiring board 11. A plate-like component 101 mainly composed of an inorganic material is surface-mounted on the multilayer wiring board 11 by a flip chip connection method via a plurality of second main surface side solder bumps 52. A plurality of second main surface side solder bumps 52 are sealed by a second main surface side underfill 107 provided in the gap S <b> 2 between the second main surface 13 and the plate-like component 101.   [Selection] Figure 1"}', '{"abstract": "A ceramic electronic component includes an electronic component body, an inner electrode, and an outer electrode. The outer electrode includes a fired electrode layer and first and second plated layers. The fired electrode layer is disposed on the electronic component body. The first plated layer is disposed on the fired electrode layer. The thickness of the first plated layer is about 3 \\u03bcm to about 8 \\u03bcm, for example. The first plated layer contains nickel. The second plated layer is disposed on the first plated layer. The thickness of the second plated layer is about 0.025 \\u03bcm to about 1 \\u03bcm, for example. The second plated layer contains lead."}', '{"abstract": "Processes of assembling microelectronic packages with lead frames and/or other suitable substrates are described herein. In one embodiment, a method for fabricating a semiconductor assembly includes forming an attachment area and a non-attachment area on a lead finger of a lead frame. The attachment area is more wettable to the solder ball than the non-attachment area during reflow. The method also includes contacting a solder ball carried by a semiconductor die with the attachment area of the lead finger, reflowing the solder ball while the solder ball is in contact with the attachment area of the lead finger, and controllably collapsing the solder ball to establish an electrical connection between the semiconductor die and the lead finger of the lead frame."}']
## Setup the LLM prompt

prompt = f"""
You are an expert patent analyst. I will provide you the abstracts of the top 5 patents in json format retrieved by a vector search based on a user's query.
Your task is to analyze these abstracts and generate a concise, coherent summary that encapsulates the core innovations and concepts shared among them.

In your output, share the original user query.
Then output the concise, coherent summary that encapsulates the core innovations and concepts shared among the top 5 abstracts. The heading for this section should
be : Summary of the top 5 abstracts that are semantically closest to the user query.

User Query: {TEXT_SEARCH_STRING}
Top 5 abstracts: {ALL_ABSTRACTS}

Instructions:

Focus on identifying the common themes and key technological advancements described in the abstracts.
Synthesize the information into a clear and concise summary, approximately 150-200 words.
Avoid simply copying phrases from the abstracts. Instead, aim to provide a cohesive overview of the shared concepts.
Highlight the potential applications and benefits of the described inventions.
Maintain a professional and objective tone.
Do not mention the individual patents by number, focus on summarizing the shared concepts.
"""

print(prompt)
You are an expert patent analyst. I will provide you the abstracts of the top 5 patents in json format retrieved by a vector search based on a user's query.
Your task is to analyze these abstracts and generate a concise, coherent summary that encapsulates the core innovations and concepts shared among them.

In your output, share the original user query.
Then output the concise, coherent summary that encapsulates the core innovations and concepts shared among the top 5 abstracts. The heading for this section should
be : Summary of the top 5 abstracts that are semantically closest to the user query.

User Query: Chip assemblies employing solder bonds to back-side lands including an electrolytic nickel layer
Top 5 abstracts: ['{"abstract": "A chip package is provided, the chip package including: a chip carrier; a chip disposed over and electrically connected to a chip carrier top side; an electrically insulating material disposed over and at least partially surrounding the chip; one or more electrically conductive contact regions formed over the electrically insulating material and in electrical connection with the chip; and another electrically insulating material disposed over a chip carrier bottom side. An electrically conductive contact region on the chip carrier bottom side is released from the further electrically insulating material."}', '{"abstract": "A method of forming a microelectronic assembly includes positioning a support structure adjacent to an active region of a device but not extending onto the active region. The support structure has planar sections. Each planar section has a substantially uniform composition. The composition of at least one of the planar sections differs from the composition of at least one of the other planar sections. A lid is positioned in contact with the support structure and extends over the active region. The support structure is bonded to the device and to the lid."}', '{"abstract": "To provide a substrate for a semiconductor mounting device capable of obtaining high reliability.   In a semiconductor mounting device substrate of the present invention, a semiconductor chip can be surface-mounted by a flip chip connection method on a semiconductor chip mounting region of a first main surface of a multilayer wiring substrate. A plurality of second main surface side solder bumps 52 forming a plate-like component mounting region 53 are formed at a location immediately below the semiconductor chip 21 on the second main surface 13 of the multilayer wiring board 11. A plate-like component 101 mainly composed of an inorganic material is surface-mounted on the multilayer wiring board 11 by a flip chip connection method via a plurality of second main surface side solder bumps 52. A plurality of second main surface side solder bumps 52 are sealed by a second main surface side underfill 107 provided in the gap S <b> 2 between the second main surface 13 and the plate-like component 101.   [Selection] Figure 1"}', '{"abstract": "A ceramic electronic component includes an electronic component body, an inner electrode, and an outer electrode. The outer electrode includes a fired electrode layer and first and second plated layers. The fired electrode layer is disposed on the electronic component body. The first plated layer is disposed on the fired electrode layer. The thickness of the first plated layer is about 3 \\u03bcm to about 8 \\u03bcm, for example. The first plated layer contains nickel. The second plated layer is disposed on the first plated layer. The thickness of the second plated layer is about 0.025 \\u03bcm to about 1 \\u03bcm, for example. The second plated layer contains lead."}', '{"abstract": "Processes of assembling microelectronic packages with lead frames and/or other suitable substrates are described herein. In one embodiment, a method for fabricating a semiconductor assembly includes forming an attachment area and a non-attachment area on a lead finger of a lead frame. The attachment area is more wettable to the solder ball than the non-attachment area during reflow. The method also includes contacting a solder ball carried by a semiconductor die with the attachment area of the lead finger, reflowing the solder ball while the solder ball is in contact with the attachment area of the lead finger, and controllably collapsing the solder ball to establish an electrical connection between the semiconductor die and the lead finger of the lead frame."}']

Instructions:

Focus on identifying the common themes and key technological advancements described in the abstracts.
Synthesize the information into a clear and concise summary, approximately 150-200 words.
Avoid simply copying phrases from the abstracts. Instead, aim to provide a cohesive overview of the shared concepts.
Highlight the potential applications and benefits of the described inventions.
Maintain a professional and objective tone.
Do not mention the individual patents by number, focus on summarizing the shared concepts.
## Define a function that will take the input propmpt and run the LLM

def predict(prompt: str, temperature: float = TEMPERATURE) -> str:
    # Create dataframe
    input = bf.DataFrame(
        {
            "prompt": [prompt],
        }
    )

    # Return response
    return llm_model.predict(input, temperature=temperature).ml_generate_text_llm_result.iloc[0]
# Invoke LLM with prompt
response = predict(prompt, temperature = TEMPERATURE)

# Print results as Markdown
Markdown(response)
Load job 34f3b649-6e45-46db-a6e5-405ae0a8bf69 is DONE. Open Job
Query job a574725f-64ae-4a19-aac0-959bec0bffeb is DONE. 5.0 kB processed. Open Job
/usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/bigframes/core/array_value.py:109: PreviewWarning: JSON column interpretation as a custom PyArrow extention in
`db_dtypes` is a preview feature and subject to change.
  warnings.warn(msg, bfe.PreviewWarning)

User Query: Chip assemblies employing solder bonds to back-side lands including an electrolytic nickel layer

Summary of the top 5 abstracts that are semantically closest to the user query:

The abstracts describe various aspects of microelectronic assembly and packaging, with a focus on enhancing reliability and electrical connectivity. A common theme is the use of solder bumps or balls for creating electrical connections between different components, such as semiconductor chips and substrates or lead frames. Several abstracts highlight methods for improving the solderability and wettability of contact regions, often involving the use of multiple layers with differing compositions. The use of electrically insulating materials to provide support and protection to the chip and electrical connections is also described. One abstract specifically mentions a nickel-containing plated layer as part of an outer electrode, suggesting its role in improving the electrical or mechanical properties of the connection. The innovations aim to improve the reliability and performance of microelectronic devices through optimized material selection, assembly processes, and structural designs.

Summary and next steps#

Ready to dive deeper and explore the endless possibilities? Start building your own vector search applications with BigFrames and BigQuery today! Check out our documentation, explore our sample notebooks, and unleash the power of vector analytics on your data. The BigFrames team would also love to hear from you. If you would like to reach out, please send an email to: bigframes-feedback@google.com or by filing an issue at the open source BigFrames repository. To receive updates about BigFrames, subscribe to the BigFrames email list.