bigframes.ml.model_selection.KFold#

class bigframes.ml.model_selection.KFold(n_splits: int = 5, *, random_state: int | None = None)[source]#

K-Fold cross-validator.

Split data in train/test sets. Split dataset into k consecutive folds.

Each fold is then used once as a validation while the k - 1 remaining folds form the training set.

Examples:

>>> import bigframes.pandas as bpd
>>> from bigframes.ml.model_selection import KFold
>>> X = bpd.DataFrame({"feat0": [1, 3, 5], "feat1": [2, 4, 6]})
>>> y = bpd.DataFrame({"label": [1, 2, 3]})
>>> kf = KFold(n_splits=3, random_state=42)
>>> for i, (X_train, X_test, y_train, y_test) in enumerate(kf.split(X, y)):
...     print(f"Fold {i}:")
...     print(f"  X_train: {X_train}")
...     print(f"  X_test: {X_test}")
...     print(f"  y_train: {y_train}")
...     print(f"  y_test: {y_test}")
...
Fold 0:
  X_train:    feat0  feat1
1      3      4
2      5      6

[2 rows x 2 columns]
  X_test:    feat0  feat1
0      1      2

[1 rows x 2 columns]
  y_train:    label
1      2
2      3

[2 rows x 1 columns]
  y_test:    label
0      1

[1 rows x 1 columns]
Fold 1:
  X_train:    feat0  feat1
0      1      2
2      5      6

[2 rows x 2 columns]
  X_test:    feat0  feat1
1      3      4

[1 rows x 2 columns]
  y_train:    label
0      1
2      3

[2 rows x 1 columns]
  y_test:    label
1      2

[1 rows x 1 columns]
Fold 2:
  X_train:    feat0  feat1
0      1      2
1      3      4

[2 rows x 2 columns]
  X_test:    feat0  feat1
2      5      6

[1 rows x 2 columns]
  y_train:    label
0      1
1      2

[2 rows x 1 columns]
  y_test:    label
2      3

[1 rows x 1 columns]
Parameters:
  • n_splits (int) – Number of folds. Must be at least 2. Default to 5.

  • random_state (Optional[int]) – A seed to use for randomly choosing the rows of the split. If not set, a random split will be generated each time. Default to None.

Methods

__init__([n_splits, random_state])

get_n_splits()

Returns the number of splitting iterations in the cross-validator.

split(X[, y])

Generate indices to split data into training and test set.