bigframes.pandas.DataFrame.apply#
- DataFrame.apply(func, *, axis=0, args: Tuple = (), **kwargs)[source]#
Apply a function along an axis of the DataFrame.
Objects passed to the function are Series objects whose index is the DataFrame’s index (
axis=0) or the DataFrame’s columns (axis=1). The final return type is inferred from the return type of the applied function.Note
axis=1scenario is in preview.Examples:
>>> df = bpd.DataFrame({'col1': [1, 2], 'col2': [3, 4]}) >>> df col1 col2 0 1 3 1 2 4 [2 rows x 2 columns]
>>> def square(x): ... return x * x
>>> df.apply(square) col1 col2 0 1 9 1 4 16 [2 rows x 2 columns]
You could apply a user defined function to every row of the DataFrame by creating a remote function out of it, and using it with axis=1. Within the function, each row is passed as a
pandas.Series. It is recommended to select only the necessary columns before calling apply(). Note: This feature is currently in preview.>>> @bpd.remote_function(reuse=False, cloud_function_service_account="default") ... def foo(row: pd.Series) -> int: ... result = 1 ... result += row["col1"] ... result += row["col2"]*row["col2"] ... return result
>>> df[["col1", "col2"]].apply(foo, axis=1) 0 11 1 19 dtype: Int64
You could return an array output for every input row from the remote function.
>>> @bpd.remote_function(reuse=False, cloud_function_service_account="default") ... def marks_analyzer(marks: pd.Series) -> list[float]: ... import statistics ... average = marks.mean() ... median = marks.median() ... gemetric_mean = statistics.geometric_mean(marks.values) ... harmonic_mean = statistics.harmonic_mean(marks.values) ... return [ ... round(stat, 2) for stat in ... (average, median, gemetric_mean, harmonic_mean) ... ]
>>> df = bpd.DataFrame({ ... "physics": [67, 80, 75], ... "chemistry": [88, 56, 72], ... "algebra": [78, 91, 79] ... }, index=["Alice", "Bob", "Charlie"]) >>> stats = df.apply(marks_analyzer, axis=1) >>> stats Alice [77.67 78. 77.19 76.71] Bob [75.67 80. 74.15 72.56] Charlie [75.33 75. 75.28 75.22] dtype: list<item: double>[pyarrow]
You could also apply a remote function which accepts multiple parameters to every row of a DataFrame by using it with axis=1 if the DataFrame has matching number of columns and data types. Note: This feature is currently in preview.
>>> df = bpd.DataFrame({ ... 'col1': [1, 2], ... 'col2': [3, 4], ... 'col3': [5, 5] ... }) >>> df col1 col2 col3 0 1 3 5 1 2 4 5 [2 rows x 3 columns]
>>> @bpd.remote_function(reuse=False, cloud_function_service_account="default") ... def foo(x: int, y: int, z: int) -> float: ... result = 1 ... result += x ... result += y/z ... return result
>>> df.apply(foo, axis=1) 0 2.6 1 3.8 dtype: Float64
- Parameters:
func (function) –
Function to apply to each column or row. To apply to each row (i.e. when axis=1 is specified) the function can be of one of the two types:
- (1). It accepts a single input parameter of type Series, in
which case each row is delivered to the function as a pandas Series.
- (2). It accept one or more parameters, in which case column values
are delivered to the function as separate arguments (mapping to those parameters) for each row. For this to work the DataFrame must have same number of columns and matching data types.
axis ({index (0), columns (1)}) – Axis along which the function is applied. Specify 0 or ‘index’ to apply function to each column. Specify 1 or ‘columns’ to apply function to each row.
args (tuple) – Positional arguments to pass to func in addition to the array/series.
**kwargs – Additional keyword arguments to pass as keywords arguments to func.
- Returns:
Result of applying
funcalong the given axis of the DataFrame.- Return type:
- Raises:
ValueError – If a remote function is not provided when
axis=1is specified.ValueError – If number or input params in the remote function are not the same as the number of columns in the dataframe.
ValueError – If the dtypes of the columns in the dataframe are not compatible with the data types of the remote function input params.