bigframes.ml.linear_model.LogisticRegression#

class bigframes.ml.linear_model.LogisticRegression(*, optimize_strategy: Literal['auto_strategy', 'batch_gradient_descent'] = 'auto_strategy', fit_intercept: bool = True, l1_reg: float | None = None, l2_reg: float = 0.0, max_iterations: int = 20, warm_start: bool = False, learning_rate: float | None = None, learning_rate_strategy: Literal['line_search', 'constant'] = 'line_search', tol: float = 0.01, ls_init_learning_rate: float | None = None, calculate_p_values: bool = False, enable_global_explain: bool = False, class_weight: Literal['balanced'] | Dict[str, float] | None = None)[source]#

Logistic Regression (aka logit, MaxEnt) classifier.

Examples:

>>> from bigframes.ml.linear_model import LogisticRegression
>>> import bigframes.pandas as bpd
>>> X = bpd.DataFrame({                 "feature0": [20, 21, 19, 18],                 "feature1": [0, 1, 1, 0],                 "feature2": [0.2, 0.3, 0.4, 0.5]})
>>> y = bpd.DataFrame({"outcome": [0, 0, 1, 1]})
>>> # Create the LogisticRegression
>>> model = LogisticRegression()
>>> model.fit(X, y)
LogisticRegression()
>>> model.predict(X)
    predicted_outcome   predicted_outcome_probs feature0        feature1        feature2
0       0       [{'label': 1, 'prob': 3.1895929877221615e-07} ...       20      0       0.2
1       0       [{'label': 1, 'prob': 5.662891265051953e-06} ...        21      1       0.3
2       1       [{'label': 1, 'prob': 0.9999917826885262} {'l...        19      1       0.4
3       1       [{'label': 1, 'prob': 0.9999999993659574} {'l...        18      0       0.5
4 rows × 5 columns

[4 rows x 5 columns in total]

>>> # Score the model
>>> score = model.score(X, y)
>>> score
    precision   recall  accuracy        f1_score        log_loss        roc_auc
0       1.0     1.0     1.0     1.0     0.000004        1.0
1 rows × 6 columns

[1 rows x 6 columns in total]

Parameters:

optimize_strategy (str, default "auto_strategy") – The strategy to train logistic regression models. Possible values are “auto_strategy” and “batch_gradient_descent”. The two are equilevant since “auto_strategy” will fall back to “batch_gradient_descent”. The API is kept for consistency. Default to “auto_strategy”.
fit_intercept (default True) – Default True. Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function.
class_weight (dict or 'balanced', default None) – Default None. Weights associated with classes in the form {class_label: weight}.If not given, all classes are supposed to have weight one. The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y)). Dict isn’t supported.
l1_reg (float or None, default None) – The amount of L1 regularization applied. Default to None. Can’t be set in “normal_equation” mode. If unset, value 0 is used.
l2_reg (float, default 0.0) – The amount of L2 regularization applied. Default to 0.
max_iterations (int, default 20) – The maximum number of training iterations or steps. Default to 20.
warm_start (bool, default False) – Determines whether to train a model with new training data, new model options, or both. Unless you explicitly override them, the initial options used to train the model are used for the warm start run. Default to False.
learning_rate (float or None, default None) – The learn rate for gradient descent when learning_rate_strategy=’constant’. If unset, value 0.1 is used. If learning_rate_strategy=’line_search’, an error is returned.
learning_rate_strategy (str, default "line_search") – The strategy for specifying the learning rate during training. Default to “line_search”.
tol (float, default 0.01) – The minimum relative loss improvement that is necessary to continue training when EARLY_STOP is set to true. For example, a value of 0.01 specifies that each iteration must reduce the loss by 1% for training to continue. Default to 0.01.
ls_init_learning_rate (float or None, default None) – Sets the initial learning rate that learning_rate_strategy=’line_search’ uses. This option can only be used if line_search is specified. If unset, value 0.1 is used.
calculate_p_values (bool, default False) – Specifies whether to compute p-values and standard errors during training. Default to False.
enable_global_explain (bool, default False) – Whether to compute global explanations using explainable AI to evaluate global feature importance to the model. Default to False.

predict(X: DataFrame | Series | DataFrame | Series) → DataFrame[source]#

Predict class labels for samples in X.

Parameters:: X (bigframes.dataframe.DataFrame or bigframes.series.Series or pandas.core.frame.DataFrame or pandas.core.series.Series) – Series or DataFrame of shape (n_samples, n_features). The data matrix for which we want to get the predictions.
Returns:: DataFrame of shape (n_samples, n_input_columns + n_prediction_columns). Returns predicted values.
Return type:: bigframes.dataframe.DataFrame

predict_explain(X: DataFrame | Series | DataFrame | Series, *, top_k_features: int = 5) → DataFrame[source]#

Explain predictions for a logistic regression model.

Note

Output matches that of the BigQuery ML.EXPLAIN_PREDICT function. See: https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-explain-predict

Parameters:

or (X (bigframes.dataframe.DataFrame or bigframes.series.Series)
pandas.core.series.Series) (pandas.core.frame.DataFrame or) – Series or a DataFrame to explain its predictions.
top_k_features (int, default 5) –
an INT64 value that specifies how many top feature attribution pairs are generated for each row of input data. The features are ranked by the absolute values of their attributions.

By default, top_k_features is set to 5. If its value is greater than the number of features in the training data, the attributions of all features are returned.

Returns:

The predicted DataFrames with explanation columns.

Return type:

bigframes.pandas.DataFrame

Return the mean accuracy on the given test data and labels.

In multi-label classification, this is the subset accuracy, which is a harsh metric since you require that each label set be correctly predicted for each sample.

Note

Output matches that of the BigQuery ML.EVALUATE function. See: https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-evaluate#classification_models for the outputs relevant to this model type.

Parameters:

X (bigframes.dataframe.DataFrame or bigframes.series.Series) – DataFrame of shape (n_samples, n_features). Test samples.
y (bigframes.dataframe.DataFrame or bigframes.series.Series) – DataFrame of shape (n_samples,) or (n_samples, n_outputs). True labels for X.

Returns:

A DataFrame of the evaluation result.

Return type:

bigframes.dataframe.DataFrame

to_gbq(model_name: str, replace: bool = False) → LogisticRegression[source]#

Save the model to BigQuery.

Parameters:

model_name (str) – The name of the model.
replace (bool, default False) – Determine whether to replace if the model already exists. Default to False.

Returns:

Saved model.

Return type:

LogisticRegression

bigframes.ml.linear_model.LogisticRegression#

This Page