bigframes.pandas.udf#

bigframes.pandas.udf(*, input_types: None | type | Sequence[type] = None, output_type: type | None = None, dataset: str, bigquery_connection: str | None = None, name: str, packages: Sequence[str] | None = None, max_batching_rows: int | None = None, container_cpu: float | None = None, container_memory: str | None = None)[source]#

Decorator to turn a Python user defined function (udf) into a [BigQuery managed user-defined function](https://cloud.google.com/bigquery/docs/user-defined-functions-python).

Note

This feature is in preview. The code in the udf must be (1) self-contained, i.e. it must not contain any references to an import or variable defined outside the function body, and (2) Python 3.11 compatible, as that is the environment in which the code is executed in the cloud.

Note

Please have BigQuery Data Editor (roles/bigquery.dataEditor) IAM role enabled for you.

Examples:

>>> import datetime

Turning an arbitrary python function into a BigQuery managed python udf:

>>> bq_name = datetime.datetime.now().strftime("bigframes_%Y%m%d%H%M%S%f")
>>> @bpd.udf(dataset="bigfranes_testing", name=bq_name)
... def minutes_to_hours(x: int) -> float:
...     return x/60
>>> minutes = bpd.Series([0, 30, 60, 90, 120])
>>> minutes
0      0
1     30
2     60
3     90
4    120
dtype: Int64
>>> hours = minutes.apply(minutes_to_hours)
>>> hours
0    0.0
1    0.5
2    1.0
3    1.5
4    2.0
dtype: Float64

To turn a user defined function with external package dependencies into a BigQuery managed python udf, you would provide the names of the packages (optionally with the package version) via packages param.

>>> bq_name = datetime.datetime.now().strftime("bigframes_%Y%m%d%H%M%S%f")
>>> @bpd.udf(
...     dataset="bigfranes_testing",
...     name=bq_name,
...     packages=["cryptography"]
... )
... def get_hash(input: str) -> str:
...     from cryptography.fernet import Fernet
...
...     # handle missing value
...     if input is None:
...         input = ""
...
...     key = Fernet.generate_key()
...     f = Fernet(key)
...     return f.encrypt(input.encode()).decode()
>>> names = bpd.Series(["Alice", "Bob"])
>>> hashes = names.apply(get_hash)

You can clean-up the BigQuery functions created above using the BigQuery client from the BigQuery DataFrames session:

>>> session = bpd.get_global_session()
>>> session.bqclient.delete_routine(minutes_to_hours.bigframes_bigquery_function)
>>> session.bqclient.delete_routine(get_hash.bigframes_bigquery_function)
Parameters:
  • input_types (type or sequence(type), Optional) – For scalar user defined function it should be the input type or sequence of input types. The supported scalar input types are bool, bytes, float, int, str.

  • output_type (type, Optional) – Data type of the output in the user defined function. If the user defined function returns an array, then list[type] should be specified. The supported output types are bool, bytes, float, int, str, list[bool], list[float], list[int] and list[str].

  • dataset (str) – Dataset in which to create a BigQuery managed function. It should be in <project_id>.<dataset_name> or <dataset_name> format.

  • bigquery_connection (str, Optional) – Name of the BigQuery connection. It is used to provide an identity to the serverless instances running the user code. It helps BigQuery manage and track the resources used by the udf. This connection is required for internet access and for interacting with other GCP services. To access GCP services, the appropriate IAM permissions must also be granted to the connection’s Service Account. When it defaults to None, the udf will be created without any connection. A udf without a connection has no internet access and no access to other GCP services.

  • name (str) – Explicit name of the persisted BigQuery managed function. Use it with caution, because more than one users working in the same project and dataset could overwrite each other’s managed functions if they use the same persistent name. Please note that any session specific clean up ( bigframes.session.Session.close/ bigframes.pandas.close_session/ bigframes.pandas.reset_session/ bigframes.pandas.clean_up_by_session_id) does not clean up this function, and leaves it for the user to manage the function directly.

  • packages (str[], Optional) – Explicit name of the external package dependencies. Each dependency is added to the requirements.txt as is, and can be of the form supported in https://pip.pypa.io/en/stable/reference/requirements-file-format/.

  • max_batching_rows (int, Optional) – The maximum number of rows in each batch. If you specify max_batching_rows, BigQuery determines the number of rows in a batch, up to the max_batching_rows limit. If max_batching_rows is not specified, the number of rows to batch is determined automatically.

  • container_cpu (float, Optional) – The CPU limits for containers that run Python UDFs. By default, the CPU allocated is 0.33 vCPU. See details at https://cloud.google.com/bigquery/docs/user-defined-functions-python#configure-container-limits.

  • container_memory (str, Optional) – The memory limits for containers that run Python UDFs. By default, the memory allocated to each container instance is 512 MiB. See details at https://cloud.google.com/bigquery/docs/user-defined-functions-python#configure-container-limits.

Returns:

A managed function object pointing to the cloud assets created in the background to support the remote execution. The cloud ssets can be located through the following properties set in the object:

bigframes_bigquery_function - The bigquery managed function deployed for the user defined code.

Return type:

collections.abc.Callable