bigframes.pandas.remote_function#

bigframes.pandas.remote_function(input_types: None | type | Sequence[type] = None, output_type: type | None = None, dataset: str | None = None, *, bigquery_connection: str | None = None, reuse: bool = True, name: str | None = None, packages: Sequence[str] | None = None, cloud_function_service_account: str, cloud_function_kms_key_name: str | None = None, cloud_function_docker_repository: str | None = None, max_batching_rows: int | None = 1000, cloud_function_timeout: int | None = 600, cloud_function_max_instances: int | None = None, cloud_function_vpc_connector: str | None = None, cloud_function_vpc_connector_egress_settings: Literal['all', 'private-ranges-only', 'unspecified'] | None = None, cloud_function_memory_mib: int | None = 1024, cloud_function_ingress_settings: Literal['all', 'internal-only', 'internal-and-gclb'] = 'internal-only', cloud_build_service_account: str | None = None)[source]#

Decorator to turn a user defined function into a BigQuery remote function. Check out the code samples at: https://cloud.google.com/bigquery/docs/remote-functions#bigquery-dataframes.

Note

input_types=Series scenario is in preview. It currently only supports dataframe with column types Int64/Float64/boolean/ string/binary[pyarrow].

Warning

To use remote functions with Bigframes 2.0 and onwards, please (preferred) set an explicit user-managed cloud_function_service_account or (discouraged) set cloud_function_service_account to use the Compute Engine service account by setting it to “default”.

See, https://cloud.google.com/functions/docs/securing/function-identity.

Note

Please make sure following is setup before using this API:

  1. Have the below APIs enabled for your project:

    • BigQuery Connection API

    • Cloud Functions API

    • Cloud Run API

    • Cloud Build API

    • Artifact Registry API

    • Cloud Resource Manager API

    This can be done from the cloud console (change PROJECT_ID to yours): https://console.cloud.google.com/apis/enableflow?apiid=bigqueryconnection.googleapis.com,cloudfunctions.googleapis.com,run.googleapis.com,cloudbuild.googleapis.com,artifactregistry.googleapis.com,cloudresourcemanager.googleapis.com&project=PROJECT_ID

    Or from the gcloud CLI:

    $ gcloud services enable bigqueryconnection.googleapis.com cloudfunctions.googleapis.com run.googleapis.com cloudbuild.googleapis.com artifactregistry.googleapis.com cloudresourcemanager.googleapis.com

  2. Have following IAM roles enabled for you:

    • BigQuery Data Editor (roles/bigquery.dataEditor)

    • BigQuery Connection Admin (roles/bigquery.connectionAdmin)

    • Cloud Functions Developer (roles/cloudfunctions.developer)

    • Service Account User (roles/iam.serviceAccountUser) on the service account PROJECT_NUMBER-compute@developer.gserviceaccount.com

    • Storage Object Viewer (roles/storage.objectViewer)

    • Project IAM Admin (roles/resourcemanager.projectIamAdmin) (Only required if the bigquery connection being used is not pre-created and is created dynamically with user credentials.)

  3. Either the user has setIamPolicy privilege on the project, or a BigQuery connection is pre-created with necessary IAM role set:

    1. To create a connection, follow https://cloud.google.com/bigquery/docs/reference/standard-sql/remote-functions#create_a_connection

    2. To set up IAM, follow https://cloud.google.com/bigquery/docs/reference/standard-sql/remote-functions#grant_permission_on_function

      Alternatively, the IAM could also be setup via the gcloud CLI:

      $ gcloud projects add-iam-policy-binding PROJECT_ID –member=”serviceAccount:CONNECTION_SERVICE_ACCOUNT_ID” –role=”roles/run.invoker”.

Parameters:
  • input_types (type or sequence(type), Optional) – For scalar user defined function it should be the input type or sequence of input types. The supported scalar input types are bool, bytes, float, int, str. For row processing user defined function (i.e. functions that receive a single input representing a row in form of a Series), type Series should be specified.

  • output_type (type, Optional) – Data type of the output in the user defined function. If the user defined function returns an array, then list[type] should be specified. The supported output types are bool, bytes, float, int, str, list[bool], list[float], list[int] and list[str].

  • dataset (str, Optional) – Dataset in which to create a BigQuery remote function. It should be in <project_id>.<dataset_name> or <dataset_name> format. If this parameter is not provided then session dataset id is used.

  • bigquery_connection (str, Optional) – Name of the BigQuery connection. You should either have the connection already created in the location you have chosen, or you should have the Project IAM Admin role to enable the service to create the connection for you if you need it. If this parameter is not provided then the BigQuery connection from the session is used.

  • reuse (bool, Optional) – Reuse the remote function if already exists. True by default, which will result in reusing an existing remote function and corresponding cloud function that was previously created (if any) for the same udf. Please note that for an unnamed (i.e. created without an explicit name argument) remote function, the BigQuery DataFrames session id is attached in the cloud artifacts names. So for the effective reuse across the sessions it is recommended to create the remote function with an explicit name. Setting it to False would force creating a unique remote function. If the required remote function does not exist then it would be created irrespective of this param.

  • name (str, Optional) – Explicit name of the persisted BigQuery remote function. Use it with caution, because more than one users working in the same project and dataset could overwrite each other’s remote functions if they use the same persistent name. When an explicit name is provided, any session specific clean up ( bigframes.session.Session.close/ bigframes.pandas.close_session/ bigframes.pandas.reset_session/ bigframes.pandas.clean_up_by_session_id) does not clean up the function, and leaves it for the user to manage the function and the associated cloud function directly.

  • packages (str[], Optional) – Explicit name of the external package dependencies. Each dependency is added to the requirements.txt as is, and can be of the form supported in https://pip.pypa.io/en/stable/reference/requirements-file-format/.

  • cloud_function_service_account (str) – Service account to use for the cloud functions. If “default” provided then the default service account would be used. See https://cloud.google.com/functions/docs/securing/function-identity for more details. Please make sure the service account has the necessary IAM permissions configured as described in https://cloud.google.com/functions/docs/reference/iam/roles#additional-configuration.

  • cloud_function_kms_key_name (str, Optional) – Customer managed encryption key to protect cloud functions and related data at rest. This is of the format projects/PROJECT_ID/locations/LOCATION/keyRings/KEYRING/cryptoKeys/KEY. Read https://cloud.google.com/functions/docs/securing/cmek for more details including granting necessary service accounts access to the key.

  • cloud_function_docker_repository (str, Optional) – Docker repository created with the same encryption key as cloud_function_kms_key_name to store encrypted artifacts created to support the cloud function. This is of the format projects/PROJECT_ID/locations/LOCATION/repositories/REPOSITORY_NAME. For more details see https://cloud.google.com/functions/docs/securing/cmek#before_you_begin.

  • max_batching_rows (int, Optional) – The maximum number of rows to be batched for processing in the BQ remote function. Default value is 1000. A lower number can be passed to avoid timeouts in case the user code is too complex to process large number of rows fast enough. A higher number can be used to increase throughput in case the user code is fast enough. None can be passed to let BQ remote functions service apply default batching. See for more details https://cloud.google.com/bigquery/docs/remote-functions#limiting_number_of_rows_in_a_batch_request.

  • cloud_function_timeout (int, Optional) – The maximum amount of time (in seconds) BigQuery should wait for the cloud function to return a response. See for more details https://cloud.google.com/functions/docs/configuring/timeout. Please note that even though the cloud function (2nd gen) itself allows seeting up to 60 minutes of timeout, BigQuery remote function can wait only up to 20 minutes, see for more details https://cloud.google.com/bigquery/quotas#remote_function_limits. By default BigQuery DataFrames uses a 10 minute timeout. None can be passed to let the cloud functions default timeout take effect.

  • cloud_function_max_instances (int, Optional) – The maximumm instance count for the cloud function created. This can be used to control how many cloud function instances can be active at max at any given point of time. Lower setting can help control the spike in the billing. Higher setting can help support processing larger scale data. When not specified, cloud function’s default setting applies. For more details see https://cloud.google.com/functions/docs/configuring/max-instances.

  • cloud_function_vpc_connector (str, Optional) – The VPC connector you would like to configure for your cloud function. This is useful if your code needs access to data or service(s) that are on a VPC network. See for more details https://cloud.google.com/functions/docs/networking/connecting-vpc.

  • cloud_function_vpc_connector_egress_settings (str, Optional) – Egress settings for the VPC connector, controlling what outbound traffic is routed through the VPC connector. Options are: all, private-ranges-only, or unspecified. If not specified, private-ranges-only is used by default. See for more details https://cloud.google.com/run/docs/configuring/vpc-connectors#egress-job.

  • cloud_function_memory_mib (int, Optional) – The amounts of memory (in mebibytes) to allocate for the cloud function (2nd gen) created. This also dictates a corresponding amount of allocated CPU for the function. By default a memory of 1024 MiB is set for the cloud functions created to support BigQuery DataFrames remote function. If you want to let the default memory of cloud functions be allocated, pass None. See for more details https://cloud.google.com/functions/docs/configuring/memory.

  • cloud_function_ingress_settings (str, Optional) – Ingress settings controls dictating what traffic can reach the function. Options are: all, internal-only, or internal-and-gclb. If no setting is provided, internal-only will be used by default. See for more details https://cloud.google.com/functions/docs/networking/network-settings#ingress_settings.

  • cloud_build_service_account (str, Optional) – Service account in the fully qualified format projects/PROJECT_ID/serviceAccounts/SERVICE_ACCOUNT_EMAIL, or just the SERVICE_ACCOUNT_EMAIL. The latter would be interpreted as belonging to the BigQuery DataFrames session project. This is to be used by Cloud Build to build the function source code into a deployable artifact. If not provided, the default Cloud Build service account is used. See https://cloud.google.com/build/docs/cloud-build-service-account for more details.

Returns:

A remote function object pointing to the cloud assets created in the background to support the remote execution. The cloud assets can be located through the following properties set in the object:

bigframes_cloud_function - The google cloud function deployed for the user defined code.

bigframes_remote_function - The bigquery remote function capable of calling into bigframes_cloud_function.

Return type:

collections.abc.Callable