bigframes.pandas.DataFrame.apply#

DataFrame.apply(func, *, axis=0, args: Tuple = (), **kwargs)[source]#

Apply a function along an axis of the DataFrame.

Objects passed to the function are Series objects whose index is the DataFrame’s index (axis=0) or the DataFrame’s columns (axis=1). The final return type is inferred from the return type of the applied function.

Note

axis=1 scenario is in preview.

Examples:

>>> df = bpd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
>>> df
   col1  col2
0     1     3
1     2     4

[2 rows x 2 columns]
>>> def square(x):
...     return x * x
>>> df.apply(square)
   col1  col2
0     1     9
1     4    16

[2 rows x 2 columns]

You could apply a user defined function to every row of the DataFrame by creating a remote function out of it, and using it with axis=1. Within the function, each row is passed as a pandas.Series. It is recommended to select only the necessary columns before calling apply(). Note: This feature is currently in preview.

>>> @bpd.remote_function(reuse=False, cloud_function_service_account="default")
... def foo(row: pd.Series) -> int:
...     result = 1
...     result += row["col1"]
...     result += row["col2"]*row["col2"]
...     return result
>>> df[["col1", "col2"]].apply(foo, axis=1)
0    11
1    19
dtype: Int64

You could return an array output for every input row from the remote function.

>>> @bpd.remote_function(reuse=False, cloud_function_service_account="default")
... def marks_analyzer(marks: pd.Series) -> list[float]:
...     import statistics
...     average = marks.mean()
...     median = marks.median()
...     gemetric_mean = statistics.geometric_mean(marks.values)
...     harmonic_mean = statistics.harmonic_mean(marks.values)
...     return [
...         round(stat, 2) for stat in
...         (average, median, gemetric_mean, harmonic_mean)
...     ]
>>> df = bpd.DataFrame({
...     "physics": [67, 80, 75],
...     "chemistry": [88, 56, 72],
...     "algebra": [78, 91, 79]
... }, index=["Alice", "Bob", "Charlie"])
>>> stats = df.apply(marks_analyzer, axis=1)
>>> stats
Alice      [77.67 78.   77.19 76.71]
Bob        [75.67 80.   74.15 72.56]
Charlie    [75.33 75.   75.28 75.22]
dtype: list<item: double>[pyarrow]

You could also apply a remote function which accepts multiple parameters to every row of a DataFrame by using it with axis=1 if the DataFrame has matching number of columns and data types. Note: This feature is currently in preview.

>>> df = bpd.DataFrame({
...     'col1': [1, 2],
...     'col2': [3, 4],
...     'col3': [5, 5]
... })
>>> df
   col1  col2  col3
0     1     3     5
1     2     4     5

[2 rows x 3 columns]
>>> @bpd.remote_function(reuse=False, cloud_function_service_account="default")
... def foo(x: int, y: int, z: int) -> float:
...     result = 1
...     result += x
...     result += y/z
...     return result
>>> df.apply(foo, axis=1)
0    2.6
1    3.8
dtype: Float64
Parameters:
  • func (function) –

    Function to apply to each column or row. To apply to each row (i.e. when axis=1 is specified) the function can be of one of the two types:

    (1). It accepts a single input parameter of type Series, in

    which case each row is delivered to the function as a pandas Series.

    (2). It accept one or more parameters, in which case column values

    are delivered to the function as separate arguments (mapping to those parameters) for each row. For this to work the DataFrame must have same number of columns and matching data types.

  • axis ({index (0), columns (1)}) – Axis along which the function is applied. Specify 0 or ‘index’ to apply function to each column. Specify 1 or ‘columns’ to apply function to each row.

  • args (tuple) – Positional arguments to pass to func in addition to the array/series.

  • **kwargs – Additional keyword arguments to pass as keywords arguments to func.

Returns:

Result of applying func along the given axis of the DataFrame.

Return type:

bigframes.pandas.DataFrame or bigframes.pandas.Series

Raises:
  • ValueError – If a remote function is not provided when axis=1 is specified.

  • ValueError – If number or input params in the remote function are not the same as the number of columns in the dataframe.

  • ValueError – If the dtypes of the columns in the dataframe are not compatible with the data types of the remote function input params.