autoprognosis.explorers.classifiers module

class ClassifierSeeker

Bases: object

AutoML core logic for classification tasks.

Parameters:
  • study_name – str. Study ID, used for caching.

  • num_iter – int. Maximum Number of optimization trials. This is the limit of trials for each base estimator in the “classifiers” list, used in combination with the “timeout” parameter. For each estimator, the search will end after “num_iter” trials or “timeout” seconds.

  • metric

    str. The metric to use for optimization. Available objective metrics:

    • ”aucroc” : the Area Under the Receiver Operating Characteristic Curve (ROC AUC) from prediction scores.

    • ”aucprc” : The average precision summarizes a precision-recall curve as the weighted mean of precisions achieved at each threshold, with the increase in recall from the previous threshold used as the weight.

    • ”accuracy” : Accuracy classification score.

    • ”f1_score_micro”: F1 score is a harmonic mean of the precision and recall. This version uses the “micro” average: calculate metrics globally by counting the total true positives, false negatives and false positives.

    • ”f1_score_macro”: F1 score is a harmonic mean of the precision and recall. This version uses the “macro” average: calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.

    • ”f1_score_weighted”: F1 score is a harmonic mean of the precision and recall. This version uses the “weighted” average: Calculate metrics for each label, and find their average weighted by support (the number of true instances for each label).

    • ”mcc”: The Matthews correlation coefficient is used in machine learning as a measure of the quality of binary and multiclass classifications. It takes into account true and false positives and negatives and is generally regarded as a balanced measure which can be used even if the classes are of very different sizes.

    • ”kappa”, “kappa_quadratic”: computes Cohen’s kappa, a score that expresses the level of agreement between two annotators on a classification problem.

  • n_folds_cv – int. Number of folds to use for evaluation

  • top_k – int Number of candidates to return

  • timeout – int. Maximum wait time(seconds) for each estimator hyperparameter search. This timeout will apply to each estimator in the “classifiers” list.

  • feature_scaling

    list. Plugin search pool to use in the pipeline for scaling. Defaults to : [‘maxabs_scaler’, ‘scaler’, ‘feature_normalizer’, ‘normal_transform’, ‘uniform_transform’, ‘nop’, ‘minmax_scaler’] Available plugins, retrieved using Preprocessors(category=”feature_scaling”).list_available():

    • ’maxabs_scaler’

    • ’scaler’

    • ’feature_normalizer’

    • ’normal_transform’

    • ’uniform_transform’

    • ’nop’ # empty operation

    • ’minmax_scaler’

  • feature_selection

    list. Plugin search pool to use in the pipeline for feature selection. Defaults [“nop”, “variance_threshold”, “pca”, “fast_ica”] Available plugins, retrieved using Preprocessors(category=”dimensionality_reduction”).list_available():

    • ’feature_agglomeration’

    • ’fast_ica’

    • ’variance_threshold’

    • ’gauss_projection’

    • ’pca’

    • ’nop’ # no operation

  • classifiers

    list. Plugin search pool to use in the pipeline for prediction. Defaults to [“random_forest”, “xgboost”, “logistic_regression”, “catboost”]. Available plugins, retrieved using Classifiers().list_available():

    • ’adaboost’

    • ’bernoulli_naive_bayes’

    • ’neural_nets’

    • ’linear_svm’

    • ’qda’

    • ’decision_trees’

    • ’logistic_regression’

    • ’hist_gradient_boosting’

    • ’extra_tree_classifier’

    • ’bagging’

    • ’gradient_boosting’

    • ’ridge_classifier’

    • ’gaussian_process’

    • ’perceptron’

    • ’lgbm’

    • ’catboost’

    • ’random_forest’

    • ’tabnet’

    • ’multinomial_naive_bayes’

    • ’lda’

    • ’gaussian_naive_bayes’

    • ’knn’

    • ’xgboost’

  • imputers

    list. Plugin search pool to use in the pipeline for imputation. Defaults to [“mean”, “ice”, “missforest”, “hyperimpute”]. Available plugins, retrieved using Imputers().list_available():

    • ’sinkhorn’

    • ’EM’

    • ’mice’

    • ’ice’

    • ’hyperimpute’

    • ’most_frequent’

    • ’median’

    • ’missforest’

    • ’softimpute’

    • ’nop’

    • ’mean’

    • ’gain’

  • hooks – Hooks. Custom callbacks to be notified about the search progress.

  • random_state – int: Random seed

search(X: DataFrame, Y: Series, group_ids: Series | None = None) List

Search the optimal model for the task.

Parameters:
  • X – DataFrame The covariates

  • y – DataFrame/Series The labels

  • group_ids – Optional str Optional Group labels for the samples used while splitting the dataset into train/test set.

search_best_args_for_estimator(estimator: Any, X: DataFrame, Y: Series, group_ids: Series | None = None) Tuple[List[float], List[float]]