Use BigQuery DataFrames to run Anthropic LLM at scale#

Overview#

Anthropic Claude models are available as APIs on Vertex AI (docs).

To run the Claude models at large scale data we can utilze the BigQuery DataFrames remote functions (docs). BigQuery DataFrames provides a simple pythonic interface remote_function to deploy the user code as a BigQuery remote function and then invoke it at scale by utilizing the parallel distributed computing architecture of BigQuery and Google Cloud Function.

In this notebook we showcase one such example. For the demonstration purpose we use a small amount of data, but the example generalizes for large data. Check out various IO APIs provided by BigQuery DataFrames here to see how you could create a DataFrame from your Big Data sitting in a BigQuery table or GCS bucket.

Set Up#

Set up a claude model in Vertex#

https://cloud.google.com/vertex-ai/generative-ai/docs/partner-models/use-claude#before_you_begin

Install Anthropic with Vertex if needed#

Uncomment the following cell and run the cell to install anthropic python package with vertex extension if you don’t already have it.

# !pip install anthropic[vertex] --quiet

Define project and location for GCP integration#

PROJECT_ID = "bigframes-dev" # @param {type:"string"}
LOCATION = "us-east5"

Initialize BigQuery DataFrames dataframe#

BigQuery DataFrames is a set of open source Python libraries that let you take advantage of BigQuery data processing by using familiar Python APIs. See for more details https://cloud.google.com/bigquery/docs/bigquery-dataframes-introduction.

# Import BigQuery DataFrames pandas module and initialize it with your project
# and location

import bigframes.pandas as bpd
bpd.options.bigquery.project = PROJECT_ID
bpd.options.bigquery.location = LOCATION

Let’s use a DataFrame with small amount of inline data for demo purpose. You could create a DataFrame from your own data. See APIs like read_gbq, read_csv, read_json etc. at https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.pandas.

df = bpd.DataFrame({"questions": [
   "What is the capital of France?",
   "Explain the concept of photosynthesis in simple terms.",
   "Write a haiku about artificial intelligence."
 ]})
df

	questions
0	What is the capital of France?
1	Explain the concept of photosynthesis in simpl...
2	Write a haiku about artificial intelligence.

3 rows × 1 columns

[3 rows x 1 columns in total]

Use BigQuery DataFrames `remote_function`#

Let’s create a remote function from a custom python function that takes a prompt and returns the output of the claude LLM running in Vertex. We will be using max_batching_rows=1 to control parallelization. This ensures that a single prompt is processed per batch in the underlying cloud function so that the batch processing does not time out. An ideal value for max_batching_rows depends on the complexity of the prompts in the real use case and should be discovered through offline experimentation. Check out the API for other ways to control parallelization https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.pandas#bigframes_pandas_remote_function.

@bpd.remote_function(packages=["anthropic[vertex]", "google-auth[requests]"],
                     max_batching_rows=1, 
                     bigquery_connection="bigframes-dev.us-east5.bigframes-rf-conn", # replace with your connection
                     cloud_function_service_account="default",
)
def anthropic_transformer(message: str) -> str:
  from anthropic import AnthropicVertex
  client = AnthropicVertex(region=LOCATION, project_id=PROJECT_ID)

  message = client.messages.create(
              max_tokens=1024,
              messages=[
                  {
                      "role": "user",
                      "content": message,
                  }
              ],
              model="claude-3-haiku@20240307",
          )
  content_text = message.content[0].text if message.content else ""
  return content_text

Query processed 0 Bytes in a moment of slot time. [Job bigframes-dev:us-east5.9bc70627-6891-44a4-b7d7-8a28e213cdec details]

# Print the BigQuery remote function created
anthropic_transformer.bigframes_remote_function

'bigframes-dev._e9a5162ae4daa9f50fda3f95febaa9781131f3b8.bigframes_sessionc10c73_49262141176cbf70037559ae84e834d3'

# Print the cloud function created
anthropic_transformer.bigframes_cloud_function

'projects/bigframes-dev/locations/us-east5/functions/bigframes-sessionc10c73-49262141176cbf70037559ae84e834d3'

# Apply the remote function on the user data
df["answers"] = df["questions"].apply(anthropic_transformer)
df

Query started with request ID bigframes-dev:us-east5.821579f4-63ea-4072-a3ce-318e43768432.

SQL

SELECT
`bfuid_col_3` AS `bfuid_col_3`,
`bfuid_col_4` AS `bfuid_col_4`,
`bfuid_col_5` AS `bfuid_col_5`
FROM
(SELECT
  `t1`.`bfuid_col_3`,
  `t1`.`bfuid_col_4`,
  `t1`.`bfuid_col_5`,
  `t1`.`bfuid_col_6` AS `bfuid_col_7`
FROM (
  SELECT
    `t0`.`level_0`,
    `t0`.`column_0`,
    `t0`.`bfuid_col_6`,
    `t0`.`level_0` AS `bfuid_col_3`,
    `t0`.`column_0` AS `bfuid_col_4`,
    `bigframes-dev._e9a5162ae4daa9f50fda3f95febaa9781131f3b8.bigframes_sessionc10c73_49262141176cbf70037559ae84e834d3`(`t0`.`column_0`) AS `bfuid_col_5`
  FROM (
    SELECT
      *
    FROM UNNEST(ARRAY<STRUCT<`level_0` INT64, `column_0` STRING, `bfuid_col_6` INT64>>[STRUCT(0, 'What is the capital of France?', 0), STRUCT(1, 'Explain the concept of photosynthesis in simple terms.', 1), STRUCT(2, 'Write a haiku about artificial intelligence.', 2)]) AS `level_0`
  ) AS `t0`
) AS `t1`)
ORDER BY `bfuid_col_7` ASC NULLS LAST
LIMIT 10

	questions	answers
0	What is the capital of France?	The capital of France is Paris.
1	Explain the concept of photosynthesis in simpl...	Photosynthesis is the process by which plants ...
2	Write a haiku about artificial intelligence.	Here is a haiku about artificial intelligence:...

3 rows × 2 columns

[3 rows x 2 columns in total]

Clean Up#

bpd.close_session()

Session sessionc10c73 closed.