{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": { "id": "TpJu6BBeooES" }, "outputs": [], "source": [ "# Copyright 2023 Google LLC\n", "#\n", "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License." ] }, { "cell_type": "markdown", "metadata": { "id": "EQbZKS7_ooET" }, "source": [ "# Build a Vector Search application using BigQuery DataFrames (aka BigFrames)", "\n", "\n", "\n", " \n", " \n", " \n", " \n", "
\n", " \n", " \"Colab Run in Colab\n", " \n", " \n", " \n", " \"GitHub\n", " View on GitHub\n", " \n", " \n", " \n", " \"Vertex\n", " Open in Vertex AI Workbench\n", " \n", " \n", " \n", " \"BQ\n", " Open in BQ Studio\n", " \n", "
\n" ] }, { "cell_type": "markdown", "metadata": { "id": "vFMjpPBo9aVv" }, "source": [ "**Author:** Sudipto Guha (Google)\n", "\n", "**Last updated:** March 16th 2025" ] }, { "cell_type": "markdown", "metadata": { "id": "SHQ3Gx-oooEU" }, "source": [ "## Overview\n", "\n", "This notebook will guide you through a practical example of using [BigFrames](https://github.com/googleapis/python-bigquery-dataframes/issues) to perform [vector search](https://cloud.google.com/bigquery/docs/vector-search-intro) and analysis on a patent dataset within BigQuery. We will leverage Python and BigFrames to efficiently process, analyze, and gain insights from a large-scale dataset without moving data from BigQuery.\n", "\n", "Here's a breakdown of what we'll cover:\n", "\n", "1. **Data Ingestion and Embedding Generation:**\n", "We will start by reading a public patent dataset directly from BigQuery into a BigFrames DataFrame.\n", "We'll demonstrate how to use BigFrames' `TextEmbeddingGenerator` to create text embeddings for the patent abstracts. This process converts the textual data into numerical vectors that capture the semantic meaning of each abstract.\n", "We'll show how BigFrames efficiently performs this embedding generation within BigQuery, avoiding data transfer to the client-side.\n", "Finally, we'll store the generated embeddings back into a new BigQuery table for subsequent analysis.\n", "\n", "2. **Indexing and Similarity Search:**\n", "Here we'll create a vector index using BigFrames to enable fast and scalable similarity searches.\n", "We'll demonstrate how to create an IVF index for efficient approximate nearest neighbor searches.\n", "We'll then perform a vector search using a sample query string to find patents that are semantically similar to the query. This showcases how vector search goes beyond keyword matching to find relevant results based on meaning.\n", "\n", "3. **AI-Powered Summarization with Retrieval Augmented Generation (RAG):**\n", "To further enhance the analysis, we'll implement a RAG pipeline.\n", "We'll retrieve the top most similar patents based on the vector search results from step 2.\n", "We'll use BigFrames' `GeminiTextGenerator` to create a prompt for an LLM to generate a concise summary of the retrieved patents.\n", "This demonstrates how to combine vector search with generative AI to extract and synthesize meaningful insights from complex patent data.\n", "\n", "\n", "We will tie these pieces together in Python using BigQuery DataFrames. [Click here](https://cloud.google.com/bigquery/docs/dataframes-quickstart) to learn more about BigQuery DataFrames!" ] }, { "cell_type": "markdown", "metadata": { "id": "EHjmqb-0ooEU" }, "source": [ "### Dataset\n", "\n", "This notebook uses the [BQ Patents Public Dataset](https://bigquery.cloud.google.com/dataset/patents-public-data:patentsview)." ] }, { "cell_type": "markdown", "metadata": { "id": "AqdihIDJooEU" }, "source": [ "### Costs\n", "\n", "This tutorial uses billable components of Google Cloud:\n", "\n", "* BigQuery (compute)\n", "* BigQuery ML\n", "* Generative AI support on Vertex AI\n", "\n", "Learn about [BigQuery compute pricing](https://cloud.google.com/bigquery/pricing#analysis_pricing_models), [Generative AI support on Vertex AI pricing](https://cloud.google.com/vertex-ai/pricing#generative_ai_models),\n", "and [BigQuery ML pricing](https://cloud.google.com/bigquery/pricing#bqml),\n", "and use the [Pricing Calculator](https://cloud.google.com/products/calculator/)\n", "to generate a cost estimate based on your projected usage." ] }, { "cell_type": "markdown", "metadata": { "id": "GqLjnm1hsKGU" }, "source": [ "## Setup & initialization\n", "\n", "Make sure you have the required roles and permissions listed below:\n", "\n", "For [Vector embedding generation](https://cloud.google.com/bigquery/docs/generate-text-embedding#required_roles)\n", "\n", "For [Vector Index creation](https://cloud.google.com/bigquery/docs/vector-index#roles_and_permissions)" ] }, { "cell_type": "markdown", "metadata": { "id": "Z-mvYJUCooEV" }, "source": [ "## Before you begin\n", "\n", "Complete the tasks in this section to set up your environment." ] }, { "cell_type": "markdown", "metadata": { "id": "xn-v3mSvooEV" }, "source": [ "### Set up your Google Cloud project\n", "\n", "**The following steps are required, regardless of your notebook environment.**\n", "\n", "1. [Select or create a Google Cloud project](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get a $300 credit towards your compute/storage costs.\n", "\n", "2. [Make sure that billing is enabled for your project](https://cloud.google.com/billing/docs/how-to/modify-project).\n", "\n", "3. [Click here](https://console.cloud.google.com/flows/enableapi?apiid=bigquery.googleapis.com,bigqueryconnection.googleapis.com,aiplatform.googleapis.com) to enable the following APIs:\n", "\n", " * BigQuery API\n", " * BigQuery Connection API\n", " * Vertex AI API\n", "\n", "4. If you are running this notebook locally, install the [Cloud SDK](https://cloud.google.com/sdk)." ] }, { "cell_type": "markdown", "metadata": { "id": "Ioydzb_8ooEV" }, "source": [ "#### Set your project ID\n", "\n", "**If you don't know your project ID**, see the support page: [Locate the project ID](https://support.google.com/googleapi/answer/7014113)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "executionInfo": { "elapsed": 2, "status": "ok", "timestamp": 1742191597773, "user": { "displayName": "", "userId": "" }, "user_tz": -480 }, "id": "b8bKCfIiooEV" }, "outputs": [], "source": [ "# set your project ID below\n", "PROJECT_ID = \"bigframes-dev\" # @param {type:\"string\"}\n", "\n", "# set your region\n", "REGION = \"US\" # @param {type: \"string\"}\n", "\n", "# Set the project id in gcloud\n", "#! gcloud config set project {PROJECT_ID}" ] }, { "cell_type": "markdown", "metadata": { "id": "GbUgWr6LooEV" }, "source": [ "#### Authenticate your Google Cloud account\n", "\n", "Depending on your Jupyter environment, you might have to manually authenticate. Follow the relevant instructions below." ] }, { "cell_type": "markdown", "metadata": { "id": "U7ChP8jUooEV" }, "source": [ "**Vertex AI Workbench**\n", "\n", "Do nothing, you are already authenticated." ] }, { "cell_type": "markdown", "metadata": { "id": "VfHOYcZZooEW" }, "source": [ "**Local JupyterLab instance**\n", "\n", "Uncomment and run the following cell:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "id": "3cGhUVM0ooEW" }, "outputs": [], "source": [ "# ! gcloud auth login" ] }, { "cell_type": "markdown", "metadata": { "id": "AoHnXlg-ooEW" }, "source": [ "**Colab**\n", "\n", "Uncomment and run the following cell:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 2, "status": "ok", "timestamp": 1742191608487, "user": { "displayName": "", "userId": "" }, "user_tz": -480 }, "id": "j3lmnsh7ooEW", "outputId": "eb68daf5-5558-487a-91d2-4b4f9e476da0" }, "outputs": [], "source": [ "# from google.colab import auth\n", "# auth.authenticate_user()" ] }, { "cell_type": "markdown", "metadata": { "id": "a9gsyttuooEW" }, "source": [ "Now we are ready to use BigQuery DataFrames!" ] }, { "cell_type": "markdown", "metadata": { "id": "xckgWno6ouHY" }, "source": [ "## Step 1: Data Ingestion and Embedding Generation" ] }, { "cell_type": "markdown", "metadata": { "id": "Hjg9jDN-ooEW" }, "source": [ "Install libraries" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "executionInfo": { "elapsed": 947, "status": "ok", "timestamp": 1742195413800, "user": { "displayName": "", "userId": "" }, "user_tz": -480 }, "id": "R7STCS8xB5d2" }, "outputs": [], "source": [ "import bigframes.pandas as bf\n", "import bigframes.ml as bf_ml\n", "import bigframes.bigquery as bf_bq\n", "import bigframes.ml.llm as bf_llm\n", "\n", "\n", "from google.cloud import bigquery\n", "from google.cloud import storage\n", "\n", "# Construct a BigQuery client object.\n", "client = bigquery.Client()\n", "\n", "import pandas as pd\n", "from IPython.display import Image, display\n", "from PIL import Image as PILImage\n", "import io\n", "\n", "import json\n", "from IPython.display import Markdown\n", "\n", "# Note: The project option is not required in all environments.\n", "# On BigQuery Studio, the project ID is automatically detected.\n", "bf.options.bigquery.project = PROJECT_ID\n", "bf.options.bigquery.location = REGION\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "id": "iOFF9hrvs5WE" }, "source": [ "Partial ordering mode allows BigQuery DataFrames to push down many more row and column filters. On large clustered and partitioned tables, this can greatly reduce the number of bytes scanned and computation slots used. This [blog post](https://medium.com/google-cloud/introducing-partial-ordering-mode-for-bigquery-dataframes-bigframes-ec35841d95c0) goes over it in more detail." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "executionInfo": { "elapsed": 2, "status": "ok", "timestamp": 1742191620533, "user": { "displayName": "", "userId": "" }, "user_tz": -480 }, "id": "9Gil1Oaas7KA" }, "outputs": [], "source": [ "bf.options.bigquery.ordering_mode = \"partial\"" ] }, { "cell_type": "markdown", "metadata": { "id": "XGaGyyZsooEW" }, "source": [ "If you want to reset the location of the created DataFrame or Series objects, reset the session by executing `bf.close_session()`. After that, you can reuse `bf.options.bigquery.location` to specify another location." ] }, { "cell_type": "markdown", "metadata": { "id": "v6FGschEowht" }, "source": [ "Data Input - read the data from a publicly available BigQuery dataset" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 468, "status": "ok", "timestamp": 1742192516923, "user": { "displayName": "", "userId": "" }, "user_tz": -480 }, "id": "zDSwoBo1CU3G", "outputId": "83edbc2f-5a23-407b-8890-f968eb31be44" }, "outputs": [], "source": [ "publications = bf.read_gbq('patents-public-data.google_patents_research.publications')" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 34 }, "executionInfo": { "elapsed": 6697, "status": "ok", "timestamp": 1742192524632, "user": { "displayName": "", "userId": "" }, "user_tz": -480 }, "id": "tYDoaKgJChiq", "outputId": "9174da29-a051-4a99-e38f-6a2b09cfe4e9" }, "outputs": [], "source": [ "## create patents base table (subset of 10k out of ~110M records)\n", "\n", "keep = (publications.embedding_v1.str.len() > 0) & (publications.title.str.len() > 0) & (publications.abstract.str.len() > 30)\n", "\n", "## Choose 10000 random rows to analyze\n", "publications = publications[keep].peek(10000)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 556 }, "executionInfo": { "elapsed": 6, "status": "ok", "timestamp": 1742191801044, "user": { "displayName": "", "userId": "" }, "user_tz": -480 }, "id": "XmqdJInztzPl", "outputId": "ae05f3a6-edeb-423a-c061-c416717e1ec5" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
publication_numbertitletitle_translatedabstractabstract_translatedcpccpc_lowcpc_inventive_lowtop_termssimilarurlcountrypublication_descriptioncited_byembedding_v1
0WO-2007022924-B1Pharmaceutical compositions with melting point...FalseThe invention relates to the use of chemical f...False[{'code': 'A61K47/32', 'inventive': True, 'fir...['A61K47/32' 'A61K47/30' 'A61K47/00' 'A61K' 'A...['A61K47/32' 'A61K47/30' 'A61K47/00' 'A61K' 'A...['composition' 'mucosa' 'melting point' 'agent...[{'publication_number': 'WO-2007022924-B1', 'a...https://patents.google.com/patent/WO2007022924B1WIPO (PCT)Amended claims[][ 5.3550040e-02 -9.3632710e-02 1.4337189e-02 ...
1WO-03043855-B1Convenience lighting for interior and exterior...FalseA lighting apparatus for a vehicle(21) include...False[{'code': 'B60Q1/247', 'inventive': True, 'fir...['B60Q1/247' 'B60Q1/24' 'B60Q1/02' 'B60Q1/00' ...['B60Q1/247' 'B60Q1/24' 'B60Q1/02' 'B60Q1/00' ...['vehicle' 'light' 'apparatus defined' 'pillar...[{'publication_number': 'WO-03043855-B1', 'app...https://patents.google.com/patent/WO2003043855B1WIPO (PCT)Amended claims[][ 0.00484032 -0.02695554 -0.20798226 -0.207528...
2AU-2020396918-A2Shot detection and verification systemFalseA shot detection system for a projectile weapo...False[{'code': 'F41A19/01', 'inventive': True, 'fir...['F41A19/01' 'F41A19/00' 'F41A' 'F41' 'F' 'H04...['F41A19/01' 'F41A19/00' 'F41A' 'F41' 'F' 'H04...['interest' 'region' 'property' 'shot' 'test' ...[{'publication_number': 'US-2023228510-A1', 'a...https://patents.google.com/patent/AU2020396918A2AustraliaAmended post open to public inspection[][-1.49729420e-02 -2.27105440e-01 -2.68012730e-...
3PL-347539-A1Concrete mix of increased fire resistanceFalseThe burning resistance of concrete containing ...False[{'code': 'Y02W30/91', 'inventive': False, 'fi...['Y02W30/91' 'Y02W30/50' 'Y02W30/00' 'Y02W' 'Y...['Y02W30/91' 'Y02W30/50' 'Y02W30/00' 'Y02W' 'Y...['fire resistance' 'concrete mix' 'increased f...[{'publication_number': 'DK-1564194-T3', 'appl...https://patents.google.com/patent/PL347539A1PolandApplication[][ 0.01849568 -0.05340371 -0.19257502 -0.174919...
4AU-PS049302-A0Methods and systems (ap53)FalseA charging stand for charging a mobile phone, ...False[{'code': 'H02J7/00', 'inventive': True, 'firs...['H02J7/00' 'H02J' 'H02' 'H' 'H04B1/40' 'H04B1...['H02J7/00' 'H02J' 'H02' 'H' 'H04B1/40' 'H04B1...['connection pin' 'mobile phone' 'cartridge' '...[{'publication_number': 'AU-PS049302-A0', 'app...https://patents.google.com/patent/AUPS049302A0AustraliaApplication filed, as announced in the Gazette...[][ 0.00064732 -0.2136009 0.0040593 -0.024562...
\n", "
" ], "text/plain": [ " publication_number title \\\n", "0 WO-2007022924-B1 Pharmaceutical compositions with melting point... \n", "1 WO-03043855-B1 Convenience lighting for interior and exterior... \n", "2 AU-2020396918-A2 Shot detection and verification system \n", "3 PL-347539-A1 Concrete mix of increased fire resistance \n", "4 AU-PS049302-A0 Methods and systems (ap53) \n", "\n", " title_translated abstract \\\n", "0 False The invention relates to the use of chemical f... \n", "1 False A lighting apparatus for a vehicle(21) include... \n", "2 False A shot detection system for a projectile weapo... \n", "3 False The burning resistance of concrete containing ... \n", "4 False A charging stand for charging a mobile phone, ... \n", "\n", " abstract_translated cpc \\\n", "0 False [{'code': 'A61K47/32', 'inventive': True, 'fir... \n", "1 False [{'code': 'B60Q1/247', 'inventive': True, 'fir... \n", "2 False [{'code': 'F41A19/01', 'inventive': True, 'fir... \n", "3 False [{'code': 'Y02W30/91', 'inventive': False, 'fi... \n", "4 False [{'code': 'H02J7/00', 'inventive': True, 'firs... \n", "\n", " cpc_low \\\n", "0 ['A61K47/32' 'A61K47/30' 'A61K47/00' 'A61K' 'A... \n", "1 ['B60Q1/247' 'B60Q1/24' 'B60Q1/02' 'B60Q1/00' ... \n", "2 ['F41A19/01' 'F41A19/00' 'F41A' 'F41' 'F' 'H04... \n", "3 ['Y02W30/91' 'Y02W30/50' 'Y02W30/00' 'Y02W' 'Y... \n", "4 ['H02J7/00' 'H02J' 'H02' 'H' 'H04B1/40' 'H04B1... \n", "\n", " cpc_inventive_low \\\n", "0 ['A61K47/32' 'A61K47/30' 'A61K47/00' 'A61K' 'A... \n", "1 ['B60Q1/247' 'B60Q1/24' 'B60Q1/02' 'B60Q1/00' ... \n", "2 ['F41A19/01' 'F41A19/00' 'F41A' 'F41' 'F' 'H04... \n", "3 ['Y02W30/91' 'Y02W30/50' 'Y02W30/00' 'Y02W' 'Y... \n", "4 ['H02J7/00' 'H02J' 'H02' 'H' 'H04B1/40' 'H04B1... \n", "\n", " top_terms \\\n", "0 ['composition' 'mucosa' 'melting point' 'agent... \n", "1 ['vehicle' 'light' 'apparatus defined' 'pillar... \n", "2 ['interest' 'region' 'property' 'shot' 'test' ... \n", "3 ['fire resistance' 'concrete mix' 'increased f... \n", "4 ['connection pin' 'mobile phone' 'cartridge' '... \n", "\n", " similar \\\n", "0 [{'publication_number': 'WO-2007022924-B1', 'a... \n", "1 [{'publication_number': 'WO-03043855-B1', 'app... \n", "2 [{'publication_number': 'US-2023228510-A1', 'a... \n", "3 [{'publication_number': 'DK-1564194-T3', 'appl... \n", "4 [{'publication_number': 'AU-PS049302-A0', 'app... \n", "\n", " url country \\\n", "0 https://patents.google.com/patent/WO2007022924B1 WIPO (PCT) \n", "1 https://patents.google.com/patent/WO2003043855B1 WIPO (PCT) \n", "2 https://patents.google.com/patent/AU2020396918A2 Australia \n", "3 https://patents.google.com/patent/PL347539A1 Poland \n", "4 https://patents.google.com/patent/AUPS049302A0 Australia \n", "\n", " publication_description cited_by \\\n", "0 Amended claims [] \n", "1 Amended claims [] \n", "2 Amended post open to public inspection [] \n", "3 Application [] \n", "4 Application filed, as announced in the Gazette... [] \n", "\n", " embedding_v1 \n", "0 [ 5.3550040e-02 -9.3632710e-02 1.4337189e-02 ... \n", "1 [ 0.00484032 -0.02695554 -0.20798226 -0.207528... \n", "2 [-1.49729420e-02 -2.27105440e-01 -2.68012730e-... \n", "3 [ 0.01849568 -0.05340371 -0.19257502 -0.174919... \n", "4 [ 0.00064732 -0.2136009 0.0040593 -0.024562... " ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## take a look at the sample dataset\n", "\n", "publications.head(5)" ] }, { "cell_type": "markdown", "metadata": { "id": "Wl2o-NYMoygb" }, "source": [ "Generate the text embeddings" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 34 }, "executionInfo": { "elapsed": 4528, "status": "ok", "timestamp": 1742192047236, "user": { "displayName": "", "userId": "" }, "user_tz": -480 }, "id": "li38q8FzDDMu", "outputId": "b8c1bd38-b484-4f71-bd38-927c8677d0c5" }, "outputs": [ { "data": { "text/html": [ "Query job 0e9d9117-4981-4f5c-b785-ed831c08e7aa is DONE. 0 Bytes processed. Open Job" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "Query job fa4f1a54-85d4-4030-992e-fddda5edf3e3 is DONE. 0 Bytes processed. Open Job" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from bigframes.ml.llm import TextEmbeddingGenerator\n", "\n", "text_model = TextEmbeddingGenerator(\n", " model_name=\"text-embedding-005\",\n", " # No connection id needed\n", ")" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 139 }, "executionInfo": { "elapsed": 126632, "status": "ok", "timestamp": 1742192656608, "user": { "displayName": "", "userId": "" }, "user_tz": -480 }, "id": "b5HHZob_u61B", "outputId": "c9ecc5fd-5d11-4fd8-f59b-9dce4e12e371" }, "outputs": [ { "data": { "text/html": [ "Load job 70377d71-bb13-46af-80c1-71ef16bf2949 is DONE. Open Job" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "Query job cc3b609d-b6b7-404f-9447-c76d3a52698b is DONE. 9.5 MB processed. Open Job" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stderr", "output_type": "stream", "text": [ "/usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/bigframes/core/array_value.py:109: PreviewWarning: JSON column interpretation as a custom PyArrow extention in\n", "`db_dtypes` is a preview feature and subject to change.\n", " warnings.warn(msg, bfe.PreviewWarning)\n" ] } ], "source": [ "## rename abstract column to content as the desired column on which embedding will be generated\n", "publications = publications[[\"publication_number\", \"title\", \"abstract\"]].rename(columns={'abstract': 'content'})\n", "\n", "## generate the embeddings\n", "## takes ~2-3 mins to run\n", "embedding = text_model.predict(publications)[[\"publication_number\", \"title\", \"content\", \"ml_generate_embedding_result\",\"ml_generate_embedding_status\"]]\n", "\n", "## filter out rows where the embedding generation failed. the embedding status value is empty if the embedding generation was successful\n", "embedding = embedding[~embedding[\"ml_generate_embedding_status\"].isnull()]\n" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 464 }, "executionInfo": { "elapsed": 6715, "status": "ok", "timestamp": 1742192727525, "user": { "displayName": "", "userId": "" }, "user_tz": -480 }, "id": "OIT5FbqAwqG5", "outputId": "d04c994a-a0c8-44b0-e897-d871036eeb1f" }, "outputs": [ { "data": { "text/html": [ "Query job 5b15fc4a-fa9a-4608-825f-be5af9953a38 is DONE. 71.0 MB processed. Open Job" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
publication_numbertitlecontentml_generate_embedding_resultml_generate_embedding_status
5611WO-2014005277-A1Resource management in a cloud computing envir...Technologies and implementations for managing ...[-2.92946529e-02 -1.24640828e-02 1.27173709e-...
6895AU-2011325479-B27-([1,2,3]triazol-4-yl)-pyrrolo[2,3-b]pyrazine...Compounds of formula I, in which R[-6.45397678e-02 1.19616119e-02 -9.85191786e-...
6IL-45347-A7h-indolizino(5,6,7-ij)isoquinoline derivative...Compounds of the formula:\\n[US3946019A][-3.82784344e-02 -2.31682733e-02 -4.35006060e-...
5923WO-2005111625-A3Method to predict prostate cancerA method for predicting the probability or ris...[ 0.02480386 -0.01648765 0.03873815 -0.025998...
6370US-7868678-B2Configurable differential linesEmbodiments related to configurable differenti...[ 2.71715336e-02 -1.93733890e-02 2.82729534e-...
\n", "

5 rows × 5 columns

\n", "
[5 rows x 5 columns in total]" ], "text/plain": [ " publication_number title \\\n", "5611 WO-2014005277-A1 Resource management in a cloud computing envir... \n", "6895 AU-2011325479-B2 7-([1,2,3]triazol-4-yl)-pyrrolo[2,3-b]pyrazine... \n", "6 IL-45347-A 7h-indolizino(5,6,7-ij)isoquinoline derivative... \n", "5923 WO-2005111625-A3 Method to predict prostate cancer \n", "6370 US-7868678-B2 Configurable differential lines \n", "\n", " content \\\n", "5611 Technologies and implementations for managing ... \n", "6895 Compounds of formula I, in which R \n", "6 Compounds of the formula:\\n[US3946019A] \n", "5923 A method for predicting the probability or ris... \n", "6370 Embodiments related to configurable differenti... \n", "\n", " ml_generate_embedding_result \\\n", "5611 [-2.92946529e-02 -1.24640828e-02 1.27173709e-... \n", "6895 [-6.45397678e-02 1.19616119e-02 -9.85191786e-... \n", "6 [-3.82784344e-02 -2.31682733e-02 -4.35006060e-... \n", "5923 [ 0.02480386 -0.01648765 0.03873815 -0.025998... \n", "6370 [ 2.71715336e-02 -1.93733890e-02 2.82729534e-... \n", "\n", " ml_generate_embedding_status \n", "5611 \n", "6895 \n", "6 \n", "5923 \n", "6370 \n", "\n", "[5 rows x 5 columns]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "embedding.head(5)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 53 }, "executionInfo": { "elapsed": 6590, "status": "ok", "timestamp": 1742192833667, "user": { "displayName": "", "userId": "" }, "user_tz": -480 }, "id": "GP3ZqX_bxLGq", "outputId": "fb823ea2-e47c-415f-84d4-543dd3291e15" }, "outputs": [ { "data": { "text/html": [ "Query job 06ce090b-e3f9-4252-b847-45c2a296ca61 is DONE. 70.9 MB processed. Open Job" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'my_dataset.my_embeddings_table'" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# store embeddings in a BQ table\n", "DATASET_ID = \"my_dataset\" # @param {type:\"string\"}\n", "TEXT_EMBEDDING_TABLE_ID = \"my_embeddings_table\" # @param {type:\"string\"}\n", "embedding.to_gbq(f\"{DATASET_ID}.{TEXT_EMBEDDING_TABLE_ID}\", if_exists='replace')" ] }, { "cell_type": "markdown", "metadata": { "id": "OUZ3NNbzo1Tb" }, "source": [ "## Step 2: Indexing and Similarity Search" ] }, { "cell_type": "markdown", "metadata": { "id": "mvJH2FCmynMm" }, "source": [ "### [Create a Vector Index](https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.bigquery#bigframes_bigquery_create_vector_index) using BigFrames\n", "\n", "\n", "**Index Type**\n", "\n", "The algorithm to use to build the vector index.\n", "The supported values are IVF and TREE_AH." ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 34 }, "executionInfo": { "elapsed": 3882, "status": "ok", "timestamp": 1742193028877, "user": { "displayName": "", "userId": "" }, "user_tz": -480 }, "id": "6SBVdv6gyU5A", "outputId": "6583e113-de27-4b44-972d-c1cc061e3c76" }, "outputs": [], "source": [ "## create vector index (note only works of tables >5000 rows)\n", "\n", "bf_bq.create_vector_index(\n", " table_id = f\"{DATASET_ID}.{TEXT_EMBEDDING_TABLE_ID}\",\n", " column_name = \"ml_generate_embedding_result\",\n", " replace= True,\n", " index_name = \"bf_python_index\",\n", " distance_type=\"cosine\",\n", " index_type= \"ivf\"\n", ")" ] }, { "cell_type": "markdown", "metadata": { "id": "bo8mBbRLzCOA" }, "source": [ "### Vector Search (semantic search) using Vector Index\n", "\n", "ANN (approx nearest neighbor) search using the created vector index" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "executionInfo": { "elapsed": 639, "status": "ok", "timestamp": 1742194606771, "user": { "displayName": "", "userId": "" }, "user_tz": -480 }, "id": "v19BJm_wzPdZ" }, "outputs": [], "source": [ "## Set variable for vector search\n", "\n", "TEXT_SEARCH_STRING = \"Chip assemblies employing solder bonds to back-side lands including an electrolytic nickel layer\" ## replace with whatever search string you want to use for the vector search\n", "FRACTION_LISTS_TO_SEARCH = 0.01" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 121 }, "executionInfo": { "elapsed": 6927, "status": "ok", "timestamp": 1742194625774, "user": { "displayName": "", "userId": "" }, "user_tz": -480 }, "id": "pAQY1ejpzPap", "outputId": "485698ad-ac6e-4c93-844e-5d0f30aff13a" }, "outputs": [ { "data": { "text/html": [ "Query job 016ad678-9609-4c78-8f07-3f9887ce67ac is DONE. 0 Bytes processed. Open Job" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stderr", "output_type": "stream", "text": [ "/usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/bigframes/core/array_value.py:109: PreviewWarning: JSON column interpretation as a custom PyArrow extention in\n", "`db_dtypes` is a preview feature and subject to change.\n", " warnings.warn(msg, bfe.PreviewWarning)\n" ] } ], "source": [ "# convert search string to dataframe\n", "TEXT_SEARCH_DF = bf.DataFrame([TEXT_SEARCH_STRING], columns=['search_string'])\n", "\n", "#generate embedding of search query\n", "search_query = bf.DataFrame(text_model.predict(TEXT_SEARCH_DF))" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 104 }, "executionInfo": { "elapsed": 5110, "status": "ok", "timestamp": 1742194670801, "user": { "displayName": "", "userId": "" }, "user_tz": -480 }, "id": "sx0AGAdn5FYX", "outputId": "551ebac3-594f-4303-ca97-5301dfee72bb" }, "outputs": [], "source": [ "## search the base table for the user's query\n", "\n", "vector_search_results = bf_bq.vector_search(\n", " base_table=f\"{DATASET_ID}.{TEXT_EMBEDDING_TABLE_ID}\",\n", " column_to_search=\"ml_generate_embedding_result\",\n", " query=search_query,\n", " distance_type=\"cosine\",\n", " query_column_to_search=\"ml_generate_embedding_result\",\n", " top_k=5,\n", ")" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 270 }, "executionInfo": { "elapsed": 3511, "status": "ok", "timestamp": 1742195090670, "user": { "displayName": "", "userId": "" }, "user_tz": -480 }, "id": "px1v4iJM5L0c", "outputId": "d107b6e3-a362-42db-c0c2-084d02acd244" }, "outputs": [ { "data": { "text/html": [ "Load job b6b88844-9ed7-4c92-8984-556414592f0b is DONE. Open Job" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "Query job aa95f59c-7229-4e76-bd2c-3a63deea3285 is DONE. 4.7 kB processed. Open Job" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
querypublication_numbertitle (relevant match)abstract (relevant match)distance
0Chip assemblies employing solder bonds to back...CN-103515336-AChip package, chip arrangement, circuit board ...A chip package is provided, the chip package i...0.287274
0Chip assemblies employing solder bonds to back...US-9548145-B2Microelectronic assembly with multi-layer supp...A method of forming a microelectronic assembly...0.290519
0Chip assemblies employing solder bonds to back...JP-2012074505-ASemiconductor mounting device substrate, semic...To provide a substrate for a semiconductor mou...0.294241
0Chip assemblies employing solder bonds to back...US-2015380164-A1Ceramic electronic componentA ceramic electronic component includes an ele...0.295716
0Chip assemblies employing solder bonds to back...US-2012153447-A1Microelectronic flip chip packages with solder...Processes of assembling microelectronic packag...0.300337
\n", "

5 rows × 5 columns

\n", "
[5 rows x 5 columns in total]" ], "text/plain": [ " query publication_number \\\n", "0 Chip assemblies employing solder bonds to back... CN-103515336-A \n", "0 Chip assemblies employing solder bonds to back... US-9548145-B2 \n", "0 Chip assemblies employing solder bonds to back... JP-2012074505-A \n", "0 Chip assemblies employing solder bonds to back... US-2015380164-A1 \n", "0 Chip assemblies employing solder bonds to back... US-2012153447-A1 \n", "\n", " title (relevant match) \\\n", "0 Chip package, chip arrangement, circuit board ... \n", "0 Microelectronic assembly with multi-layer supp... \n", "0 Semiconductor mounting device substrate, semic... \n", "0 Ceramic electronic component \n", "0 Microelectronic flip chip packages with solder... \n", "\n", " abstract (relevant match) distance \n", "0 A chip package is provided, the chip package i... 0.287274 \n", "0 A method of forming a microelectronic assembly... 0.290519 \n", "0 To provide a substrate for a semiconductor mou... 0.294241 \n", "0 A ceramic electronic component includes an ele... 0.295716 \n", "0 Processes of assembling microelectronic packag... 0.300337 \n", "\n", "[5 rows x 5 columns]" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## View the returned results based on simalirity with the user's query\n", "\n", "vector_search_results[\n", " [\n", " 'content',\n", " 'publication_number',\n", " 'title',\n", " 'content_1',\n", " 'distance',\n", " ]\n", "].rename(columns={\n", " 'content': 'query',\n", " 'content_1':'abstract (relevant match)' ,\n", " 'title':'title (relevant match)',\n", "})" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "executionInfo": { "elapsed": 1622, "status": "ok", "timestamp": 1742195139318, "user": { "displayName": "", "userId": "" }, "user_tz": -480 }, "id": "5fb_O-ne5cvH" }, "outputs": [], "source": [ "## Brute force result (for comparison)\n", "\n", "\n", "brute_force_result = bf_bq.vector_search(\n", " base_table=f\"{DATASET_ID}.{TEXT_EMBEDDING_TABLE_ID}\",\n", " column_to_search=\"ml_generate_embedding_result\",\n", " query=search_query,\n", " top_k=5,\n", " distance_type=\"cosine\",\n", " use_brute_force=True,\n", ")\n" ] }, { "cell_type": "markdown", "metadata": { "id": "21rNsFMHo8hO" }, "source": [ "## Step 3: AI-Powered Summarization with Retrieval Augmented Generation (RAG)" ] }, { "cell_type": "markdown", "metadata": { "id": "K3pIQrzB7T_G" }, "source": [ "Patent documents can be dense and time-consuming to digest. AI-Powered Patent Summarization utilizes Retrieval Augmented Generation (RAG) to streamline this process. By retrieving relevant patent information through vector search and then synthesizing it with a large language model, we can generate concise, human-readable summaries, saving valuable time and effort. The code sample below walks through how to set this up continuing with the same user query as the previous use case." ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 34 }, "executionInfo": { "elapsed": 4827, "status": "ok", "timestamp": 1742195565658, "user": { "displayName": "", "userId": "" }, "user_tz": -480 }, "id": "jb5rueqU7T5J", "outputId": "43732836-ebae-4fb3-b28e-bfea51146c72" }, "outputs": [ { "data": { "text/html": [ "Query job 3fabe659-f95b-49cb-b0c7-9d32b09177bf is DONE. 0 Bytes processed. Open Job" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "## gemini model\n", "\n", "llm_model = bf_llm.GeminiTextGenerator(model_name = \"gemini-2.0-flash-001\") ## replace with other model as needed" ] }, { "cell_type": "markdown", "metadata": { "id": "41e12JTf70sr" }, "source": [ "We will use the same user query from Section 2, and pass the list of abstracts returned by the vector search into the prompt for the RAG application" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "executionInfo": { "elapsed": 1474, "status": "ok", "timestamp": 1742195536109, "user": { "displayName": "", "userId": "" }, "user_tz": -480 }, "id": "EyP-ZFJK8h-2" }, "outputs": [], "source": [ "TEMPERATURE = 0.4" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 72 }, "executionInfo": { "elapsed": 3371, "status": "ok", "timestamp": 1742195421813, "user": { "displayName": "", "userId": "" }, "user_tz": -480 }, "id": "eP99R6SV7Tug", "outputId": "c34bc931-5be8-410e-ac1f-604df31ef533" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['{\"abstract\": \"A chip package is provided, the chip package including: a chip carrier; a chip disposed over and electrically connected to a chip carrier top side; an electrically insulating material disposed over and at least partially surrounding the chip; one or more electrically conductive contact regions formed over the electrically insulating material and in electrical connection with the chip; and another electrically insulating material disposed over a chip carrier bottom side. An electrically conductive contact region on the chip carrier bottom side is released from the further electrically insulating material.\"}', '{\"abstract\": \"A method of forming a microelectronic assembly includes positioning a support structure adjacent to an active region of a device but not extending onto the active region. The support structure has planar sections. Each planar section has a substantially uniform composition. The composition of at least one of the planar sections differs from the composition of at least one of the other planar sections. A lid is positioned in contact with the support structure and extends over the active region. The support structure is bonded to the device and to the lid.\"}', '{\"abstract\": \"To provide a substrate for a semiconductor mounting device capable of obtaining high reliability. In a semiconductor mounting device substrate of the present invention, a semiconductor chip can be surface-mounted by a flip chip connection method on a semiconductor chip mounting region of a first main surface of a multilayer wiring substrate. A plurality of second main surface side solder bumps 52 forming a plate-like component mounting region 53 are formed at a location immediately below the semiconductor chip 21 on the second main surface 13 of the multilayer wiring board 11. A plate-like component 101 mainly composed of an inorganic material is surface-mounted on the multilayer wiring board 11 by a flip chip connection method via a plurality of second main surface side solder bumps 52. A plurality of second main surface side solder bumps 52 are sealed by a second main surface side underfill 107 provided in the gap S <b> 2 between the second main surface 13 and the plate-like component 101. [Selection] Figure 1\"}', '{\"abstract\": \"A ceramic electronic component includes an electronic component body, an inner electrode, and an outer electrode. The outer electrode includes a fired electrode layer and first and second plated layers. The fired electrode layer is disposed on the electronic component body. The first plated layer is disposed on the fired electrode layer. The thickness of the first plated layer is about 3 \\\\u03bcm to about 8 \\\\u03bcm, for example. The first plated layer contains nickel. The second plated layer is disposed on the first plated layer. The thickness of the second plated layer is about 0.025 \\\\u03bcm to about 1 \\\\u03bcm, for example. The second plated layer contains lead.\"}', '{\"abstract\": \"Processes of assembling microelectronic packages with lead frames and/or other suitable substrates are described herein. In one embodiment, a method for fabricating a semiconductor assembly includes forming an attachment area and a non-attachment area on a lead finger of a lead frame. The attachment area is more wettable to the solder ball than the non-attachment area during reflow. The method also includes contacting a solder ball carried by a semiconductor die with the attachment area of the lead finger, reflowing the solder ball while the solder ball is in contact with the attachment area of the lead finger, and controllably collapsing the solder ball to establish an electrical connection between the semiconductor die and the lead finger of the lead frame.\"}']\n" ] } ], "source": [ "# Extract strings into a list of JSON strings\n", "json_strings = [json.dumps({'abstract': s}) for s in vector_search_results['content_1']]\n", "ALL_ABSTRACTS = json_strings\n", "\n", "# Print the result (optional)\n", "print(ALL_ABSTRACTS)" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "collapsed": true, "executionInfo": { "elapsed": 1620, "status": "ok", "timestamp": 1742195587180, "user": { "displayName": "", "userId": "" }, "user_tz": -480 }, "id": "kSNSi1GV8OAD", "outputId": "37fbc822-1160-4fbd-c7d6-ecb4a16db394" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "You are an expert patent analyst. I will provide you the abstracts of the top 5 patents in json format retrieved by a vector search based on a user's query.\n", "Your task is to analyze these abstracts and generate a concise, coherent summary that encapsulates the core innovations and concepts shared among them.\n", "\n", "In your output, share the original user query.\n", "Then output the concise, coherent summary that encapsulates the core innovations and concepts shared among the top 5 abstracts. The heading for this section should\n", "be : Summary of the top 5 abstracts that are semantically closest to the user query.\n", "\n", "User Query: Chip assemblies employing solder bonds to back-side lands including an electrolytic nickel layer\n", "Top 5 abstracts: ['{\"abstract\": \"A chip package is provided, the chip package including: a chip carrier; a chip disposed over and electrically connected to a chip carrier top side; an electrically insulating material disposed over and at least partially surrounding the chip; one or more electrically conductive contact regions formed over the electrically insulating material and in electrical connection with the chip; and another electrically insulating material disposed over a chip carrier bottom side. An electrically conductive contact region on the chip carrier bottom side is released from the further electrically insulating material.\"}', '{\"abstract\": \"A method of forming a microelectronic assembly includes positioning a support structure adjacent to an active region of a device but not extending onto the active region. The support structure has planar sections. Each planar section has a substantially uniform composition. The composition of at least one of the planar sections differs from the composition of at least one of the other planar sections. A lid is positioned in contact with the support structure and extends over the active region. The support structure is bonded to the device and to the lid.\"}', '{\"abstract\": \"To provide a substrate for a semiconductor mounting device capable of obtaining high reliability. In a semiconductor mounting device substrate of the present invention, a semiconductor chip can be surface-mounted by a flip chip connection method on a semiconductor chip mounting region of a first main surface of a multilayer wiring substrate. A plurality of second main surface side solder bumps 52 forming a plate-like component mounting region 53 are formed at a location immediately below the semiconductor chip 21 on the second main surface 13 of the multilayer wiring board 11. A plate-like component 101 mainly composed of an inorganic material is surface-mounted on the multilayer wiring board 11 by a flip chip connection method via a plurality of second main surface side solder bumps 52. A plurality of second main surface side solder bumps 52 are sealed by a second main surface side underfill 107 provided in the gap S <b> 2 between the second main surface 13 and the plate-like component 101. [Selection] Figure 1\"}', '{\"abstract\": \"A ceramic electronic component includes an electronic component body, an inner electrode, and an outer electrode. The outer electrode includes a fired electrode layer and first and second plated layers. The fired electrode layer is disposed on the electronic component body. The first plated layer is disposed on the fired electrode layer. The thickness of the first plated layer is about 3 \\\\u03bcm to about 8 \\\\u03bcm, for example. The first plated layer contains nickel. The second plated layer is disposed on the first plated layer. The thickness of the second plated layer is about 0.025 \\\\u03bcm to about 1 \\\\u03bcm, for example. The second plated layer contains lead.\"}', '{\"abstract\": \"Processes of assembling microelectronic packages with lead frames and/or other suitable substrates are described herein. In one embodiment, a method for fabricating a semiconductor assembly includes forming an attachment area and a non-attachment area on a lead finger of a lead frame. The attachment area is more wettable to the solder ball than the non-attachment area during reflow. The method also includes contacting a solder ball carried by a semiconductor die with the attachment area of the lead finger, reflowing the solder ball while the solder ball is in contact with the attachment area of the lead finger, and controllably collapsing the solder ball to establish an electrical connection between the semiconductor die and the lead finger of the lead frame.\"}']\n", "\n", "Instructions:\n", "\n", "Focus on identifying the common themes and key technological advancements described in the abstracts.\n", "Synthesize the information into a clear and concise summary, approximately 150-200 words.\n", "Avoid simply copying phrases from the abstracts. Instead, aim to provide a cohesive overview of the shared concepts.\n", "Highlight the potential applications and benefits of the described inventions.\n", "Maintain a professional and objective tone.\n", "Do not mention the individual patents by number, focus on summarizing the shared concepts.\n", "\n" ] } ], "source": [ "## Setup the LLM prompt\n", "\n", "prompt = f\"\"\"\n", "You are an expert patent analyst. I will provide you the abstracts of the top 5 patents in json format retrieved by a vector search based on a user's query.\n", "Your task is to analyze these abstracts and generate a concise, coherent summary that encapsulates the core innovations and concepts shared among them.\n", "\n", "In your output, share the original user query.\n", "Then output the concise, coherent summary that encapsulates the core innovations and concepts shared among the top 5 abstracts. The heading for this section should\n", "be : Summary of the top 5 abstracts that are semantically closest to the user query.\n", "\n", "User Query: {TEXT_SEARCH_STRING}\n", "Top 5 abstracts: {ALL_ABSTRACTS}\n", "\n", "Instructions:\n", "\n", "Focus on identifying the common themes and key technological advancements described in the abstracts.\n", "Synthesize the information into a clear and concise summary, approximately 150-200 words.\n", "Avoid simply copying phrases from the abstracts. Instead, aim to provide a cohesive overview of the shared concepts.\n", "Highlight the potential applications and benefits of the described inventions.\n", "Maintain a professional and objective tone.\n", "Do not mention the individual patents by number, focus on summarizing the shared concepts.\n", "\"\"\"\n", "\n", "print(prompt)" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "executionInfo": { "elapsed": 1, "status": "ok", "timestamp": 1742195567707, "user": { "displayName": "", "userId": "" }, "user_tz": -480 }, "id": "njiQdfkT8Y7V" }, "outputs": [], "source": [ "## Define a function that will take the input propmpt and run the LLM\n", "\n", "def predict(prompt: str, temperature: float = TEMPERATURE) -> str:\n", " # Create dataframe\n", " input = bf.DataFrame(\n", " {\n", " \"prompt\": [prompt],\n", " }\n", " )\n", "\n", " # Return response\n", " return llm_model.predict(input, temperature=temperature).ml_generate_text_llm_result.iloc[0]" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 426 }, "executionInfo": { "elapsed": 14425, "status": "ok", "timestamp": 1742195608280, "user": { "displayName": "", "userId": "" }, "user_tz": -480 }, "id": "OYYkVYbs8Y0P", "outputId": "def839e3-3dee-4320-9cb5-cac855ddea6b" }, "outputs": [ { "data": { "text/html": [ "Load job 34f3b649-6e45-46db-a6e5-405ae0a8bf69 is DONE. Open Job" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "Query job a574725f-64ae-4a19-aac0-959bec0bffeb is DONE. 5.0 kB processed. Open Job" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stderr", "output_type": "stream", "text": [ "/usr/local/google/home/swast/src/github.com/googleapis/python-bigquery-dataframes-2/bigframes/core/array_value.py:109: PreviewWarning: JSON column interpretation as a custom PyArrow extention in\n", "`db_dtypes` is a preview feature and subject to change.\n", " warnings.warn(msg, bfe.PreviewWarning)\n" ] }, { "data": { "text/markdown": [ "User Query: Chip assemblies employing solder bonds to back-side lands including an electrolytic nickel layer\n", "\n", "Summary of the top 5 abstracts that are semantically closest to the user query:\n", "\n", "The abstracts describe various aspects of microelectronic assembly and packaging, with a focus on enhancing reliability and electrical connectivity. A common theme is the use of solder bumps or balls for creating electrical connections between different components, such as semiconductor chips and substrates or lead frames. Several abstracts highlight methods for improving the solderability and wettability of contact regions, often involving the use of multiple layers with differing compositions. The use of electrically insulating materials to provide support and protection to the chip and electrical connections is also described. One abstract specifically mentions a nickel-containing plated layer as part of an outer electrode, suggesting its role in improving the electrical or mechanical properties of the connection. The innovations aim to improve the reliability and performance of microelectronic devices through optimized material selection, assembly processes, and structural designs.\n" ], "text/plain": [ "" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Invoke LLM with prompt\n", "response = predict(prompt, temperature = TEMPERATURE)\n", "\n", "# Print results as Markdown\n", "Markdown(response)" ] }, { "cell_type": "markdown", "metadata": { "id": "sy82XLDfooEb" }, "source": [ "# Summary and next steps\n", "\n", "Ready to dive deeper and explore the endless possibilities? Start building your own vector search applications with BigFrames and BigQuery today! Check out our [documentation](https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.bigquery#bigframes_bigquery_vector_search), explore our sample [notebooks](https://github.com/googleapis/python-bigquery-dataframes/tree/main/notebooks), and unleash the power of vector analytics on your data.\n", "The BigFrames team would also love to hear from you. If you would like to reach out, please send an email to: bigframes-feedback@google.com or by filing an issue at the [open source BigFrames repository](https://github.com/googleapis/python-bigquery-dataframes/issues). To receive updates about BigFrames, subscribe to the BigFrames email list." ] } ], "metadata": { "colab": { "name": "bq_dataframes_llm_kmeans", "provenance": [], "toc_visible": true }, "kernelspec": { "display_name": "venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.16" } }, "nbformat": 4, "nbformat_minor": 0 }