autoprognosis.studies.classifiers module

class ClassifierStudy(dataset: pandas.core.frame.DataFrame, target: str, num_iter: int = 20, num_study_iter: int = 5, num_ensemble_iter: int = 15, timeout: int = 360, metric: str = 'aucroc', study_name: Optional[str] = None, feature_scaling: List[str] = ['normal_transform', 'maxabs_scaler', 'feature_normalizer', 'minmax_scaler', 'nop', 'scaler', 'uniform_transform'], feature_selection: List[str] = ['nop', 'pca', 'fast_ica'], classifiers: List[str] = ['random_forest', 'xgboost', 'catboost', 'lgbm', 'logistic_regression'], imputers: List[str] = ['ice'], workspace: pathlib.Path = PosixPath('tmp'), hooks: autoprognosis.hooks.base.Hooks = <autoprognosis.hooks.default.DefaultHooks object>, score_threshold: float = 0.65, group_id: Optional[str] = None, nan_placeholder: Optional[Any] = None, random_state: int = 0, sample_for_search: bool = True, max_search_sample_size: int = 10000, ensemble_size: int = 3, n_folds_cv: int = 5)

Bases: autoprognosis.studies._base.Study

Core logic for classification studies.

A study automatically handles imputation, preprocessing and model selection for a certain dataset. The output is an optimal model architecture, selected by the AutoML logic.

Parameters
  • dataset – DataFrame. The dataset to analyze.

  • target – str. The target column in the dataset.

  • num_iter – int. Maximum Number of optimization trials. This is the limit of trials for each base estimator in the “classifiers” list, used in combination with the “timeout” parameter. For each estimator, the search will end after “num_iter” trials or “timeout” seconds.

  • num_study_iter – int. The number of study iterations. This is the limit for the outer optimization loop. After each outer loop, an intermediary model is cached and can be used by another process, while the outer loop continues to improve the result.

  • timeout – int. Maximum wait time(seconds) for each estimator hyperparameter search. This timeout will apply to each estimator in the “classifiers” list.

  • metric

    str. The metric to use for optimization. Available objective metrics:

    • ”aucroc” : the Area Under the Receiver Operating Characteristic Curve (ROC AUC) from prediction scores.

    • ”aucprc” : The average precision summarizes a precision-recall curve as the weighted mean of precisions achieved at each threshold, with the increase in recall from the previous threshold used as the weight.

    • ”accuracy” : Accuracy classification score.

    • ”f1_score_micro”: F1 score is a harmonic mean of the precision and recall. This version uses the “micro” average: calculate metrics globally by counting the total true positives, false negatives and false positives.

    • ”f1_score_macro”: F1 score is a harmonic mean of the precision and recall. This version uses the “macro” average: calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.

    • ”f1_score_weighted”: F1 score is a harmonic mean of the precision and recall. This version uses the “weighted” average: Calculate metrics for each label, and find their average weighted by support (the number of true instances for each label).

    • ”mcc”: The Matthews correlation coefficient is used in machine learning as a measure of the quality of binary and multiclass classifications. It takes into account true and false positives and negatives and is generally regarded as a balanced measure which can be used even if the classes are of very different sizes.

    • ”kappa”, “kappa_quadratic”: computes Cohen’s kappa, a score that expresses the level of agreement between two annotators on a classification problem.

  • study_name – str. The name of the study, to be used in the caches.

  • feature_scaling

    list. Plugin search pool to use in the pipeline for scaling. Defaults to : [‘maxabs_scaler’, ‘scaler’, ‘feature_normalizer’, ‘normal_transform’, ‘uniform_transform’, ‘nop’, ‘minmax_scaler’] Available plugins, retrieved using Preprocessors(category=”feature_scaling”).list_available():

    • ’maxabs_scaler’

    • ’scaler’

    • ’feature_normalizer’

    • ’normal_transform’

    • ’uniform_transform’

    • ’nop’ # empty operation

    • ’minmax_scaler’

  • feature_selection

    list. Plugin search pool to use in the pipeline for feature selection. Defaults [“nop”, “variance_threshold”, “pca”, “fast_ica”] Available plugins, retrieved using Preprocessors(category=”dimensionality_reduction”).list_available():

    • ’feature_agglomeration’

    • ’fast_ica’

    • ’variance_threshold’

    • ’gauss_projection’

    • ’pca’

    • ’nop’ # no operation

  • classifiers

    list. Plugin search pool to use in the pipeline for prediction. Defaults to [“random_forest”, “xgboost”, “logistic_regression”, “catboost”]. Available plugins, retrieved using Classifiers().list_available():

    • ’adaboost’

    • ’bernoulli_naive_bayes’

    • ’neural_nets’

    • ’linear_svm’

    • ’qda’

    • ’decision_trees’

    • ’logistic_regression’

    • ’hist_gradient_boosting’

    • ’extra_tree_classifier’

    • ’bagging’

    • ’gradient_boosting’

    • ’ridge_classifier’

    • ’gaussian_process’

    • ’perceptron’

    • ’lgbm’

    • ’catboost’

    • ’random_forest’

    • ’tabnet’

    • ’multinomial_naive_bayes’

    • ’lda’

    • ’gaussian_naive_bayes’

    • ’knn’

    • ’xgboost’

  • imputers

    list. Plugin search pool to use in the pipeline for imputation. Defaults to [“mean”, “ice”, “missforest”, “hyperimpute”]. Available plugins, retrieved using Imputers().list_available():

    • ’sinkhorn’

    • ’EM’

    • ’mice’

    • ’ice’

    • ’hyperimpute’

    • ’most_frequent’

    • ’median’

    • ’missforest’

    • ’softimpute’

    • ’nop’

    • ’mean’

    • ’gain’

  • hooks – Hooks. Custom callbacks to be notified about the search progress.

  • workspace – Path. Where to store the output model.

  • score_threshold – float. The minimum metric score for a candidate.

  • id – str. The id column in the dataset.

  • random_state – int Random seed

  • sample_for_search – bool Subsample the evaluation dataset in the search pipeline. Improves the speed of the search.

  • max_search_sample_size – int Subsample size for the evaluation dataset, if sample is True.

  • n_folds_cv – int. Number of cross-validation folds to use for study evaluation

  • ensemble_size – int Maximum number of models to include in the ensemble

Example

>>> from sklearn.datasets import load_breast_cancer
>>>
>>> from autoprognosis.studies.classifiers import ClassifierStudy
>>> from autoprognosis.utils.serialization import load_model_from_file
>>> from autoprognosis.utils.tester import evaluate_estimator
>>>
>>> X, Y = load_breast_cancer(return_X_y=True, as_frame=True)
>>>
>>> df = X.copy()
>>> df["target"] = Y
>>>
>>> study_name = "example"
>>>
>>> study = ClassifierStudy(
>>>     study_name=study_name,
>>>     dataset=df,  # pandas DataFrame
>>>     target="target",  # the label column in the dataset
>>> )
>>> model = study.fit()
>>>
>>> # Predict the probabilities of each class using the model
>>> model.predict_proba(X)
fit() Any

Run the study and train the model. The call returns the fitted model.

run() Any

Run the study. The call returns the optimal model architecture - not fitted.