Use BigQuery DataFrames to run Anthropic LLM at scale#
Run in Colab
|
|
|
Overview#
Anthropic Claude models are available as APIs on Vertex AI (docs).
To run the Claude models at large scale data we can utilze the BigQuery
DataFrames remote functions (docs).
BigQuery DataFrames provides a simple pythonic interface remote_function to
deploy the user code as a BigQuery remote function and then invoke it at scale
by utilizing the parallel distributed computing architecture of BigQuery and
Google Cloud Function.
In this notebook we showcase one such example. For the demonstration purpose we use a small amount of data, but the example generalizes for large data. Check out various IO APIs provided by BigQuery DataFrames here to see how you could create a DataFrame from your Big Data sitting in a BigQuery table or GCS bucket.
Set Up#
Set up a claude model in Vertex#
https://cloud.google.com/vertex-ai/generative-ai/docs/partner-models/use-claude#before_you_begin
Install Anthropic with Vertex if needed#
Uncomment the following cell and run the cell to install anthropic python package with vertex extension if you don’t already have it.
# !pip install anthropic[vertex] --quiet
Define project and location for GCP integration#
PROJECT = "bigframes-dev" # replace with your project
LOCATION = "us-east5"
Initialize BigQuery DataFrames dataframe#
BigQuery DataFrames is a set of open source Python libraries that let you take advantage of BigQuery data processing by using familiar Python APIs. See for more details https://cloud.google.com/bigquery/docs/bigquery-dataframes-introduction.
# Import BigQuery DataFrames pandas module and initialize it with your project
# and location
import bigframes.pandas as bpd
bpd.options.bigquery.project = PROJECT
bpd.options.bigquery.location = LOCATION
Let’s use a DataFrame with small amount of inline data for demo purpose.
You could create a DataFrame from your own data. See APIs like read_gbq,
read_csv, read_json etc. at https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.pandas.
df = bpd.DataFrame({"questions": [
"What is the capital of France?",
"Explain the concept of photosynthesis in simple terms.",
"Write a haiku about artificial intelligence."
]})
df
| questions | |
|---|---|
| 0 | What is the capital of France? |
| 1 | Explain the concept of photosynthesis in simpl... |
| 2 | Write a haiku about artificial intelligence. |
3 rows × 1 columns
Use BigQuery DataFrames remote_function#
Let’s create a remote function from a custom python function that takes a prompt
and returns the output of the claude LLM running in Vertex. We will be using
max_batching_rows=1 to control parallelization. This ensures that a single
prompt is processed per batch in the underlying cloud function so that the batch
processing does not time out. An ideal value for max_batching_rows depends on
the complexity of the prompts in the real use case and should be discovered
through offline experimentation. Check out the API for other ways to control
parallelization https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.pandas#bigframes_pandas_remote_function.
@bpd.remote_function(packages=["anthropic[vertex]", "google-auth[requests]"],
max_batching_rows=1,
bigquery_connection="bigframes-dev.us-east5.bigframes-rf-conn", # replace with your connection
cloud_function_service_account="default",
)
def anthropic_transformer(message: str) -> str:
from anthropic import AnthropicVertex
client = AnthropicVertex(region=LOCATION, project_id=PROJECT)
message = client.messages.create(
max_tokens=1024,
messages=[
{
"role": "user",
"content": message,
}
],
model="claude-3-haiku@20240307",
)
content_text = message.content[0].text if message.content else ""
return content_text
# Print the BigQuery remote function created
anthropic_transformer.bigframes_remote_function
'bigframes-dev._e9a5162ae4daa9f50fda3f95febaa9781131f3b8.bigframes_sessionc10c73_49262141176cbf70037559ae84e834d3'
# Print the cloud function created
anthropic_transformer.bigframes_cloud_function
'projects/bigframes-dev/locations/us-east5/functions/bigframes-sessionc10c73-49262141176cbf70037559ae84e834d3'
# Apply the remote function on the user data
df["answers"] = df["questions"].apply(anthropic_transformer)
df
SQL
SELECT
`bfuid_col_3` AS `bfuid_col_3`,
`bfuid_col_4` AS `bfuid_col_4`,
`bfuid_col_5` AS `bfuid_col_5`
FROM
(SELECT
`t1`.`bfuid_col_3`,
`t1`.`bfuid_col_4`,
`t1`.`bfuid_col_5`,
`t1`.`bfuid_col_6` AS `bfuid_col_7`
FROM (
SELECT
`t0`.`level_0`,
`t0`.`column_0`,
`t0`.`bfuid_col_6`,
`t0`.`level_0` AS `bfuid_col_3`,
`t0`.`column_0` AS `bfuid_col_4`,
`bigframes-dev._e9a5162ae4daa9f50fda3f95febaa9781131f3b8.bigframes_sessionc10c73_49262141176cbf70037559ae84e834d3`(`t0`.`column_0`) AS `bfuid_col_5`
FROM (
SELECT
*
FROM UNNEST(ARRAY<STRUCT<`level_0` INT64, `column_0` STRING, `bfuid_col_6` INT64>>[STRUCT(0, 'What is the capital of France?', 0), STRUCT(1, 'Explain the concept of photosynthesis in simple terms.', 1), STRUCT(2, 'Write a haiku about artificial intelligence.', 2)]) AS `level_0`
) AS `t0`
) AS `t1`)
ORDER BY `bfuid_col_7` ASC NULLS LAST
LIMIT 10| questions | answers | |
|---|---|---|
| 0 | What is the capital of France? | The capital of France is Paris. |
| 1 | Explain the concept of photosynthesis in simpl... | Photosynthesis is the process by which plants ... |
| 2 | Write a haiku about artificial intelligence. | Here is a haiku about artificial intelligence:... |
3 rows × 2 columns
Clean Up#
bpd.close_session()
Run in Colab