bigframes.ml.decomposition.PCA#
- class bigframes.ml.decomposition.PCA(n_components: int | float | None = None, *, svd_solver: Literal['full', 'randomized', 'auto'] = 'auto')[source]#
Principal component analysis (PCA).
Examples:
>>> import bigframes.pandas as bpd >>> from bigframes.ml.decomposition import PCA >>> X = bpd.DataFrame({"feat0": [-1, -2, -3, 1, 2, 3], "feat1": [-1, -1, -2, 1, 1, 2]}) >>> pca = PCA(n_components=2).fit(X) >>> pca.predict(X) principal_component_1 principal_component_2 0 -0.755243 0.157628 1 -1.05405 -0.141179 2 -1.809292 0.016449 3 0.755243 -0.157628 4 1.05405 0.141179 5 1.809292 -0.016449 [6 rows x 2 columns] >>> pca.explained_variance_ratio_ principal_component_id explained_variance_ratio 0 1 0.00901 1 0 0.99099 [2 rows x 2 columns]
- Parameters:
n_components (int, float or None, default None) – Number of components to keep. If n_components is not set, all components are kept, n_components = min(n_samples, n_features). If 0 < n_components < 1, select the number of components such that the amount of variance that needs to be explained is greater than the percentage specified by n_components.
svd_solver ("full", "randomized" or "auto", default "auto") – The solver to use to calculate the principal components. Details: https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create-pca#pca_solver.
- property components_: DataFrame#
Principal axes in feature space, representing the directions of maximum variance in the data.
- Returns:
- DataFrame of principal components, containing following columns:
principal_component_id: An integer that identifies the principal component.
feature: The column name that contains the feature.
numerical_value: If feature is numeric, the value of feature for the principal component that principal_component_id identifies. If feature isn’t numeric, the value is NULL.
- categorical_value: A list of mappings containing information about categorical features. Each mapping contains the following fields:
categorical_value.category: The name of each category.
categorical_value.value: The value of categorical_value.category for the centroid that centroid_id identifies.
The output contains one row per feature per component.
- Return type:
- detect_anomalies(X: DataFrame | Series | DataFrame | Series, *, contamination: float = 0.1) DataFrame[source]#
Detect the anomaly data points of the input.
- Parameters:
X (bigframes.dataframe.DataFrame or bigframes.series.Series) – Series or a DataFrame to detect anomalies.
contamination (float, default 0.1) – Identifies the proportion of anomalies in the training dataset that are used to create the model. The value must be in the range [0, 0.5].
- Returns:
detected DataFrame.
- Return type:
- property explained_variance_: DataFrame#
The amount of variance explained by each of the selected components.
- Returns:
- DataFrame containing following columns:
principal_component_id: An integer that identifies the principal component.
explained_variance: The factor by which the eigenvector is scaled. Eigenvalue and explained variance are the same concepts in PCA.
- Return type:
- property explained_variance_ratio_: DataFrame#
Percentage of variance explained by each of the selected components.
- Returns:
- DataFrame containing following columns:
principal_component_id: An integer that identifies the principal component.
explained_variance_ratio: the total variance is the sum of variances, also known as eigenvalues, of all of the individual principal components. The explained variance ratio by a principal component is the ratio between the variance, also known as eigenvalue, of that principal component and the total variance.
- Return type:
- predict(X: DataFrame | Series | DataFrame | Series) DataFrame[source]#
Predict the closest cluster for each sample in X.
- Parameters:
X (bigframes.dataframe.DataFrame or bigframes.series.Series or pandas.core.frame.DataFrame or pandas.core.series.Series) – Series or a DataFrame to predict.
- Returns:
Predicted DataFrames.
- Return type:
- score(X=None, y=None) DataFrame[source]#
Calculate evaluation metrics of the model.
Note
Output matches that of the BigQuery ML.EVALUATE function. See: https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-evaluate#pca_models for the outputs relevant to this model type.
- Parameters:
X (default None) – Ignored.
y (default None) – Ignored.
- Returns:
DataFrame that represents model metrics.
- Return type: