bigframes.bigquery.vector_search#
- bigframes.bigquery.vector_search(base_table: str, column_to_search: str, query: dataframe.DataFrame | series.Series, *, query_column_to_search: str | None = None, top_k: int | None = None, distance_type: Literal['euclidean', 'cosine', 'dot_product'] | None = None, fraction_lists_to_search: float | None = None, use_brute_force: bool | None = None, allow_large_results: bool | None = None) dataframe.DataFrame[source]#
Conduct vector search which searches embeddings to find semantically similar entities.
This method calls the VECTOR_SEARCH() SQL function.
Examples:
>>> import bigframes.pandas as bpd >>> import bigframes.bigquery as bbq
DataFrame embeddings for which to find nearest neighbors. The
ARRAY<FLOAT64>column is used as the search query:>>> search_query = bpd.DataFrame({"query_id": ["dog", "cat"], ... "embedding": [[1.0, 2.0], [3.0, 5.2]]}) >>> bbq.vector_search( ... base_table="bigframes-dev.bigframes_tests_sys.base_table", ... column_to_search="my_embedding", ... query=search_query, ... top_k=2).sort_values("id") query_id embedding id my_embedding distance 0 dog [1. 2.] 1 [1. 2.] 0.0 1 cat [3. 5.2] 2 [2. 4.] 1.56205 0 dog [1. 2.] 4 [1. 3.2] 1.2 1 cat [3. 5.2] 5 [5. 5.4] 2.009975 [4 rows x 5 columns]
Series embeddings for which to find nearest neighbors:
>>> search_query = bpd.Series([[1.0, 2.0], [3.0, 5.2]], ... index=["dog", "cat"], ... name="embedding") >>> bbq.vector_search( ... base_table="bigframes-dev.bigframes_tests_sys.base_table", ... column_to_search="my_embedding", ... query=search_query, ... top_k=2, ... use_brute_force=True).sort_values("id") embedding id my_embedding distance dog [1. 2.] 1 [1. 2.] 0.0 cat [3. 5.2] 2 [2. 4.] 1.56205 dog [1. 2.] 4 [1. 3.2] 1.2 cat [3. 5.2] 5 [5. 5.4] 2.009975 [4 rows x 4 columns]
You can specify the name of the column in the query DataFrame embeddings and distance type. If you specify query_column_to_search_value, it will use the provided column which contains the embeddings for which to find nearest neighbors. Otherwiese, it uses the column_to_search value.
>>> search_query = bpd.DataFrame({"query_id": ["dog", "cat"], ... "embedding": [[1.0, 2.0], [3.0, 5.2]], ... "another_embedding": [[0.7, 2.2], [3.3, 5.2]]}) >>> bbq.vector_search( ... base_table="bigframes-dev.bigframes_tests_sys.base_table", ... column_to_search="my_embedding", ... query=search_query, ... distance_type="cosine", ... query_column_to_search="another_embedding", ... top_k=2).sort_values("id") query_id embedding another_embedding id my_embedding distance 1 cat [3. 5.2] [3.3 5.2] 1 [1. 2.] 0.005181 1 cat [3. 5.2] [3.3 5.2] 2 [2. 4.] 0.005181 0 dog [1. 2.] [0.7 2.2] 3 [1.5 7. ] 0.004697 0 dog [1. 2.] [0.7 2.2] 4 [1. 3.2] 0.000013 [4 rows x 6 columns]
- Parameters:
base_table (str) – The table to search for nearest neighbor embeddings.
column_to_search (str) – The name of the base table column to search for nearest neighbor embeddings. The column must have a type of
ARRAY<FLOAT64>. All elements in the array must be non-NULL.query (bigframes.dataframe.DataFrame | bigframes.dataframe.Series) – A Series or DataFrame that provides the embeddings for which to find nearest neighbors.
query_column_to_search (str) – Specifies the name of the column in the query that contains the embeddings for which to find nearest neighbors. The column must have a type of
ARRAY<FLOAT64>. All elements in the array must be non-NULL and all values in the column must have the same array dimensions as the values in thecolumn_to_searchcolumn. Can only be set when query is a DataFrame.top_k (int) – Sepecifies the number of nearest neighbors to return. Default to 10.
distance_type (str, defalt "euclidean") – Specifies the type of metric to use to compute the distance between two vectors. Possible values are “euclidean”, “cosine” and “dot_product”. Default to “euclidean”.
fraction_lists_to_search (float, range in [0.0, 1.0]) – Specifies the percentage of lists to search. Specifying a higher percentage leads to higher recall and slower performance, and the converse is true when specifying a lower percentage. It is only used when a vector index is also used. You can only specify
fraction_lists_to_searchwhenuse_brute_forceis set to False.use_brute_force (bool) – Determines whether to use brute force search by skipping the vector index if one is available. Default to False.
allow_large_results (bool, optional) – Whether to allow large query results. If
True, the query results can be larger than the maximum response size. Defaults tobpd.options.compute.allow_large_results.
- Returns:
A DataFrame containing vector search result.
- Return type: