autoprognosis.utils.tester module

class classifier_metrics(metric: Union[str, list] = ['aucroc', 'aucprc', 'accuracy', 'f1_score_micro', 'f1_score_macro', 'f1_score_weighted', 'kappa', 'kappa_quadratic', 'precision_micro', 'precision_macro', 'precision_weighted', 'recall_micro', 'recall_macro', 'recall_weighted', 'mcc'])

Bases: object

Helper class for evaluating the performance of the classifier.

Parameters

metric

list, default=[“aucroc”, “aucprc”, “accuracy”, “f1_score_micro”, “f1_score_macro”, “f1_score_weighted”, “kappa”, “precision_micro”, “precision_macro”, “precision_weighted”, “recall_micro”, “recall_macro”, “recall_weighted”, “mcc”,] The type of metric to use for evaluation. Potential values:

  • ”aucroc” : the Area Under the Receiver Operating Characteristic Curve (ROC AUC) from prediction scores.

  • ”aucprc” : The average precision summarizes a precision-recall curve as the weighted mean of precisions achieved at each threshold, with the increase in recall from the previous threshold used as the weight.

  • ”accuracy” : Accuracy classification score.

  • ”f1_score_micro”: F1 score is a harmonic mean of the precision and recall. This version uses the “micro” average: calculate metrics globally by counting the total true positives, false negatives and false positives.

  • ”f1_score_macro”: F1 score is a harmonic mean of the precision and recall. This version uses the “macro” average: calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.

  • ”f1_score_weighted”: F1 score is a harmonic mean of the precision and recall. This version uses the “weighted” average: Calculate metrics for each label, and find their average weighted by support (the number of true instances for each label).

  • ”kappa”, “kappa_quadratic”: computes Cohen’s kappa, a score that expresses the level of agreement between two annotators on a classification problem.

  • ”precision_micro”: Precision is defined as the number of true positives over the number of true positives plus the number of false positives. This version(micro) calculates metrics globally by counting the total true positives.

  • ”precision_macro”: Precision is defined as the number of true positives over the number of true positives plus the number of false positives. This version(macro) calculates metrics for each label, and finds their unweighted mean.

  • ”precision_weighted”: Precision is defined as the number of true positives over the number of true positives plus the number of false positives. This version(weighted) calculates metrics for each label, and find their average weighted by support.

  • ”recall_micro”: Recall is defined as the number of true positives over the number of true positives plus the number of false negatives. This version(micro) calculates metrics globally by counting the total true positives.

  • ”recall_macro”: Recall is defined as the number of true positives over the number of true positives plus the number of false negatives. This version(macro) calculates metrics for each label, and finds their unweighted mean.

  • ”recall_weighted”: Recall is defined as the number of true positives over the number of true positives plus the number of false negatives. This version(weighted) calculates metrics for each label, and find their average weighted by support.

  • ”mcc”: The Matthews correlation coefficient is used in machine learning as a measure of the quality of binary and multiclass classifications. It takes into account true and false positives and negatives and is generally regarded as a balanced measure which can be used even if the classes are of very different sizes.

average_precision_score(y_test: numpy.ndarray, y_pred_proba: numpy.ndarray) float
get_metric() Union[str, list]
roc_auc_score(y_test: numpy.ndarray, y_pred_proba: numpy.ndarray) float
score_proba(y_test: numpy.ndarray, y_pred_proba: numpy.ndarray) Dict[str, float]
evaluate_estimator(estimator: Any, X: Union[pandas.core.frame.DataFrame, numpy.ndarray], Y: Union[pandas.core.series.Series, numpy.ndarray, List], n_folds: int = 3, seed: int = 0, pretrained: bool = False, group_ids: Optional[pandas.core.series.Series] = None, *args: Any, **kwargs: Any) Dict

Helper for evaluating classifiers.

Parameters
  • estimator – Baseline model to evaluate. if pretrained == False, it must not be fitted.

  • X – pd.DataFrame or np.ndarray: The covariates

  • Y – pd.Series or np.ndarray or list: The labels

  • n_folds – int cross-validation folds

  • seed – int Random seed

  • pretrained – bool If the estimator was already trained or not.

  • group_ids – pd.Series The group_ids to use for stratified cross-validation

Returns

Dict containing “raw” and “str” nodes. The “str” node contains prettified metrics, while the raw metrics includes tuples of form (mean, std) for each metric. Both “raw” and “str” nodes contain the following metrics:

  • ”aucroc” : the Area Under the Receiver Operating Characteristic Curve (ROC AUC) from prediction scores.

  • ”aucprc” : The average precision summarizes a precision-recall curve as the weighted mean of precisions achieved at each threshold, with the increase in recall from the previous threshold used as the weight.

  • ”accuracy” : Accuracy classification score.

  • ”f1_score_micro”: F1 score is a harmonic mean of the precision and recall. This version uses the “micro” average: calculate metrics globally by counting the total true positives, false negatives and false positives.

  • ”f1_score_macro”: F1 score is a harmonic mean of the precision and recall. This version uses the “macro” average: calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.

  • ”f1_score_weighted”: F1 score is a harmonic mean of the precision and recall. This version uses the “weighted” average: Calculate metrics for each label, and find their average weighted by support (the number of true instances for each label).

  • ”kappa”: computes Cohen’s kappa, a score that expresses the level of agreement between two annotators on a classification problem.

  • ”precision_micro”: Precision is defined as the number of true positives over the number of true positives plus the number of false positives. This version(micro) calculates metrics globally by counting the total true positives.

  • ”precision_macro”: Precision is defined as the number of true positives over the number of true positives plus the number of false positives. This version(macro) calculates metrics for each label, and finds their unweighted mean.

  • ”precision_weighted”: Precision is defined as the number of true positives over the number of true positives plus the number of false positives. This version(weighted) calculates metrics for each label, and find their average weighted by support.

  • ”recall_micro”: Recall is defined as the number of true positives over the number of true positives plus the number of false negatives. This version(micro) calculates metrics globally by counting the total true positives.

  • ”recall_macro”: Recall is defined as the number of true positives over the number of true positives plus the number of false negatives. This version(macro) calculates metrics for each label, and finds their unweighted mean.

  • ”recall_weighted”: Recall is defined as the number of true positives over the number of true positives plus the number of false negatives. This version(weighted) calculates metrics for each label, and find their average weighted by support.

  • ”mcc”: The Matthews correlation coefficient is used in machine learning as a measure of the quality of binary and multiclass classifications. It takes into account true and false positives and negatives and is generally regarded as a balanced measure which can be used even if the classes are of very different sizes.

evaluate_estimator_multiple_seeds(estimator: Any, X: Union[pandas.core.frame.DataFrame, numpy.ndarray], Y: Union[pandas.core.series.Series, numpy.ndarray, List], n_folds: int = 3, seeds: List[int] = [0, 1, 2], pretrained: bool = False, group_ids: Optional[pandas.core.series.Series] = None) Dict

Helper for evaluating classifiers with multiple seeds.

Parameters
  • estimator – Baseline model to evaluate. if pretrained == False, it must not be fitted.

  • X – pd.DataFrame or np.ndarray: The covariates

  • Y – pd.Series or np.ndarray or list: The labels

  • n_folds – int cross-validation folds

  • seeds – List Random seeds

  • pretrained – bool If the estimator was already trained or not.

  • group_ids – pd.Series The group_ids to use for stratified cross-validation

Returns

Dict containing “seeds”, “agg” and “str” nodes. The “str” node contains the aggregated prettified metrics, while the raw metrics includes tuples of form (mean, std) for each metric. The “seeds” node contains the results for each random seed. Both “agg” and “str” nodes contain the following metrics:

  • ”aucroc” : the Area Under the Receiver Operating Characteristic Curve (ROC AUC) from prediction scores.

  • ”aucprc” : The average precision summarizes a precision-recall curve as the weighted mean of precisions achieved at each threshold, with the increase in recall from the previous threshold used as the weight.

  • ”accuracy” : Accuracy classification score.

  • ”f1_score_micro”: F1 score is a harmonic mean of the precision and recall. This version uses the “micro” average: calculate metrics globally by counting the total true positives, false negatives and false positives.

  • ”f1_score_macro”: F1 score is a harmonic mean of the precision and recall. This version uses the “macro” average: calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.

  • ”f1_score_weighted”: F1 score is a harmonic mean of the precision and recall. This version uses the “weighted” average: Calculate metrics for each label, and find their average weighted by support (the number of true instances for each label).

  • ”kappa”: computes Cohen’s kappa, a score that expresses the level of agreement between two annotators on a classification problem.

  • ”precision_micro”: Precision is defined as the number of true positives over the number of true positives plus the number of false positives. This version(micro) calculates metrics globally by counting the total true positives.

  • ”precision_macro”: Precision is defined as the number of true positives over the number of true positives plus the number of false positives. This version(macro) calculates metrics for each label, and finds their unweighted mean.

  • ”precision_weighted”: Precision is defined as the number of true positives over the number of true positives plus the number of false positives. This version(weighted) calculates metrics for each label, and find their average weighted by support.

  • ”recall_micro”: Recall is defined as the number of true positives over the number of true positives plus the number of false negatives. This version(micro) calculates metrics globally by counting the total true positives.

  • ”recall_macro”: Recall is defined as the number of true positives over the number of true positives plus the number of false negatives. This version(macro) calculates metrics for each label, and finds their unweighted mean.

  • ”recall_weighted”: Recall is defined as the number of true positives over the number of true positives plus the number of false negatives. This version(weighted) calculates metrics for each label, and find their average weighted by support.

  • ”mcc”: The Matthews correlation coefficient is used in machine learning as a measure of the quality of binary and multiclass classifications. It takes into account true and false positives and negatives and is generally regarded as a balanced measure which can be used even if the classes are of very different sizes.

evaluate_regression(estimator: Any, X: Union[pandas.core.frame.DataFrame, numpy.ndarray], Y: Union[pandas.core.series.Series, numpy.ndarray, List], n_folds: int = 3, seed: int = 0, pretrained: bool = False, group_ids: Optional[pandas.core.series.Series] = None, *args: Any, **kwargs: Any) Dict

Helper for evaluating regression tasks.

Parameters
  • estimator – Baseline model to evaluate. if pretrained == False, it must not be fitted.

  • X – pd.DataFrame or np.ndarray covariates

  • Y – pd.Series or np.ndarray or list outcomes

  • n_folds – int Number of cross-validation folds

  • seed – int Random seed

  • group_ids – pd.Series Optional group_ids for stratified cross-validation

Returns

Dict containing “raw” and “str” nodes. The “str” node contains prettified metrics, while the raw metrics includes tuples of form (mean, std) for each metric. Both “raw” and “str” nodes contain the following metrics:

  • ”r2”: R^2(coefficient of determination) regression score function.

  • ”mse”: Mean squared error regression loss.

  • ”mae”: Mean absolute error regression loss.

evaluate_regression_multiple_seeds(estimator: Any, X: Union[pandas.core.frame.DataFrame, numpy.ndarray], Y: Union[pandas.core.series.Series, numpy.ndarray, List], n_folds: int = 3, pretrained: bool = False, group_ids: Optional[pandas.core.series.Series] = None, seeds: List[int] = [0, 1, 2]) Dict

Helper for evaluating regression tasks with multiple seeds.

Parameters
  • estimator – Baseline model to evaluate. if pretrained == False, it must not be fitted.

  • X – pd.DataFrame or np.ndarray covariates

  • Y – pd.Series or np.ndarray or list outcomes

  • n_folds – int Number of cross-validation folds

  • seeds – list Random seeds

  • group_ids – pd.Series Optional group_ids for stratified cross-validation

Returns

Dict containing “seeds”, “agg” and “str” nodes. The “str” node contains the aggregated prettified metrics, while the raw metrics includes tuples of form (mean, std) for each metric. The “seeds” node contains the results for each random seed. Both “agg” and “str” nodes contain the following metrics:

  • ”r2”: R^2(coefficient of determination) regression score function.

  • ”mse”: Mean squared error regression loss.

  • ”mae”: Mean absolute error regression loss.

evaluate_survival_estimator(estimator: Any, X: Union[pandas.core.frame.DataFrame, numpy.ndarray], T: Union[pandas.core.series.Series, numpy.ndarray, List], Y: Union[pandas.core.series.Series, numpy.ndarray, List], time_horizons: Union[List[float], numpy.ndarray], n_folds: int = 3, seed: int = 0, pretrained: bool = False, risk_threshold: float = 0.5, group_ids: Optional[pandas.core.series.Series] = None) Dict

Helper for evaluating survival analysis tasks.

Parameters
  • estimator – Baseline model to evaluate. if pretrained == False, it must not be fitted.

  • X – DataFrame or np.ndarray The covariates

  • T – Series or np.ndarray or list time to event/censoring values

  • Y – Series or np.ndarray or list event or censored

  • time_horizons – list or np.ndarray Horizons where to evaluate the performance.

  • n_folds – int Number of folds for cross validation

  • seed – int Random seed

  • pretrained – bool If the estimator was trained or not

  • group_ids – Group labels for the samples used while splitting the dataset into train/test set.

Returns

Dict containing “raw”, “str” and “horizons” nodes. The “str” node contains prettified metrics, while the raw metrics includes tuples of form (mean, std) for each metric. The “horizons” node splits the metrics by horizon. Each nodes contain the following metrics:

  • ”c_index” : The concordance index or c-index is a metric to evaluate the predictions made by a survival algorithm. It is defined as the proportion of concordant pairs divided by the total number of possible evaluation pairs.

  • ”brier_score”: The Brier Score is a strictly proper score function or strictly proper scoring rule that measures the accuracy of probabilistic predictions.

  • ”aucroc” : the Area Under the Receiver Operating Characteristic Curve (ROC AUC) from prediction scores.

  • ”sensitivity”: Sensitivity (true positive rate) is the probability of a positive test result, conditioned on the individual truly being positive.

  • ”specificity”: Specificity (true negative rate) is the probability of a negative test result, conditioned on the individual truly being negative.

  • ”PPV”: The positive predictive value(PPV) is the probability that following a positive test result, that individual will truly have that specific disease.

  • ”NPV”: The negative predictive value(NPV) is the probability that following a negative test result, that individual will truly not have that specific disease.

evaluate_survival_estimator_multiple_seeds(estimator: Any, X: Union[pandas.core.frame.DataFrame, numpy.ndarray], T: Union[pandas.core.series.Series, numpy.ndarray, List], Y: Union[pandas.core.series.Series, numpy.ndarray, List], time_horizons: Union[List[float], numpy.ndarray], n_folds: int = 3, pretrained: bool = False, risk_threshold: float = 0.5, group_ids: Optional[pandas.core.series.Series] = None, seeds: List[int] = [0, 1, 2]) Dict

Helper for evaluating survival analysis tasks with multiple random seeds.

Parameters
  • estimator – Baseline model to evaluate. if pretrained == False, it must not be fitted.

  • X – DataFrame or np.ndarray The covariates

  • T – Series or np.ndarray or list time to event

  • Y – Series or np.ndarray or list event or censored

  • time_horizons – list or np.ndarray Horizons where to evaluate the performance.

  • n_folds – int Number of folds for cross validation

  • seeds – List Random seeds

  • pretrained – bool If the estimator was trained or not

  • group_ids – Group labels for the samples used while splitting the dataset into train/test set.

Returns

Dict containing “seeds”, “agg” and “str” nodes. The “str” node contains the aggregated prettified metrics, while the raw metrics includes tuples of form (mean, std) for each metric. The “seeds” node contains the results for each random seed. Both “agg” and “str” nodes contain the following metrics:

  • ”c_index” : The concordance index or c-index is a metric to evaluate the predictions made by a survival algorithm. It is defined as the proportion of concordant pairs divided by the total number of possible evaluation pairs.

  • ”brier_score”: The Brier Score is a strictly proper score function or strictly proper scoring rule that measures the accuracy of probabilistic predictions.

  • ”aucroc” : the Area Under the Receiver Operating Characteristic Curve (ROC AUC) from prediction scores.

  • ”sensitivity”: Sensitivity (true positive rate) is the probability of a positive test result, conditioned on the individual truly being positive.

  • ”specificity”: Specificity (true negative rate) is the probability of a negative test result, conditioned on the individual truly being negative.

  • ”PPV”: The positive predictive value(PPV) is the probability that following a positive test result, that individual will truly have that specific disease.

  • ”NPV”: The negative predictive value(NPV) is the probability that following a negative test result, that individual will truly not have that specific disease.

score_classification_model(estimator: Any, X_train: pandas.core.frame.DataFrame, X_test: pandas.core.frame.DataFrame, y_train: pandas.core.frame.DataFrame, y_test: pandas.core.frame.DataFrame) float