bigframes.ml.model_selection.train_test_split#

Splits dataframes or series into random train and test subsets.

Examples:

>>> import bigframes.pandas as bpd
>>> from bigframes.ml.model_selection import train_test_split
>>> X = bpd.DataFrame({"feat0": [0, 2, 4, 6, 8], "feat1": [1, 3, 5, 7, 9]})
>>> y = bpd.DataFrame({"label": [0, 1, 2, 3, 4]})
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
>>> X_train
    feat0  feat1
0      0      1
1      2      3
4      8      9

[3 rows x 2 columns]
>>> y_train
    label
0      0
1      1
4      4

[3 rows x 1 columns]
>>> X_test
    feat0  feat1
2      4      5
3      6      7

[2 rows x 2 columns]
>>> y_test
    label
2      2
3      3

[2 rows x 1 columns]

Parameters:

*arrays (bigframes.dataframe.DataFrame or bigframes.series.Series) – A sequence of BigQuery DataFrames or Series that can be joined on their indexes.
test_size (default None) – The proportion of the dataset to include in the test split. If None, this will default to the complement of train_size. If both are none, it will be set to 0.25.
train_size (default None) – The proportion of the dataset to include in the train split. If None, this will default to the complement of test_size.
random_state (default None) – A seed to use for randomly choosing the rows of the split. If not set, a random split will be generated each time.
stratify – (bigframes.series.Series or None, default None): If not None, data is split in a stratified fashion, using this as the class labels. Each split has the same distribution of the class labels with the original dataset. Default to None. Note: By setting the stratify parameter, the memory consumption and generated SQL will be linear to the unique values in the Series. May return errors if the unique values size is too large.

Returns:

A list of BigQuery DataFrames or Series.

Return type:

List[Union[bigframes.dataframe.DataFrame, bigframes.series.Series]]

bigframes.ml.model_selection.train_test_split#

This Page