AutoPrognosis documentation!

AutoPrognosis - A system for automating the design of predictive modeling pipelines tailored for clinical prognosis.

image

πŸ”‘ Features

  • πŸš€ Automatically learns ensembles of pipelines for classification, regression or survival analysis tasks.

  • πŸŒ€ Easy to extend pluginable architecture.

  • πŸ”₯ Interpretability and uncertainty quantification tools.

  • 🩹 Data imputation using HyperImpute.

  • ⚑ Build demonstrators using Streamlit.

  • πŸ““ Python and R tutorials available.

πŸš€ Installation

Using pip

The library can be installed from PyPI using

$ pip install autoprognosis

or from source, using

$ pip install .

Environment variables

The library can be configured from a set of environment variables.

Variable

Description

N_OPT_JOBS

Number of cores to use for hyperparameter search. Default : 1

N_LEARNER_JOBS

Number of cores to use by inidividual learners. Default: all cpus

REDIS_HOST

IP address for the Redis database. Default 127.0.0.1

REDIS_PORT

Redis port. Default: 6379

Example: export N_OPT_JOBS = 2 to use 2 cores for hyperparam search.

πŸ’₯ Sample Usage

Advanced Python tutorials can be found in the Python tutorials section.

R examples can be found in the R tutorials section.

List the available classifiers

from autoprognosis.plugins.prediction.classifiers import Classifiers
print(Classifiers().list_available())

Create a study for classifiers

from sklearn.datasets import load_breast_cancer

from autoprognosis.studies.classifiers import ClassifierStudy
from autoprognosis.utils.serialization import load_model_from_file
from autoprognosis.utils.tester import evaluate_estimator


X, Y = load_breast_cancer(return_X_y=True, as_frame=True)

df = X.copy()
df["target"] = Y

study_name = "example"

study = ClassifierStudy(
    study_name=study_name,
    dataset=df,  # pandas DataFrame
    target="target",  # the label column in the dataset
)
model = study.fit()

# Predict the probabilities of each class using the model
model.predict_proba(X)

(Advanced) Customize the study for classifiers

from pathlib import Path

from sklearn.datasets import load_breast_cancer

from autoprognosis.studies.classifiers import ClassifierStudy
from autoprognosis.utils.serialization import load_model_from_file
from autoprognosis.utils.tester import evaluate_estimator


X, Y = load_breast_cancer(return_X_y=True, as_frame=True)

df = X.copy()
df["target"] = Y

workspace = Path("workspace")
study_name = "example"

study = ClassifierStudy(
    study_name=study_name,
    dataset=df,  # pandas DataFrame
    target="target",  # the label column in the dataset
    num_iter=100,  # how many trials to do for each candidate
    timeout=60,  # seconds
    classifiers=["logistic_regression", "lda", "qda"],
    workspace=workspace,
)

study.run()

output = workspace / study_name / "model.p"
model = load_model_from_file(output)

# <model> contains the optimal architecture, but the model is not trained yet. You need to call fit() to use it.
# This way, we can further benchmark the selected model on the training set.
metrics = evaluate_estimator(model, X, Y)

print(f"model {model.name()} -> {metrics['clf']}")

# Train the model
model.fit(X, Y)

# Predict the probabilities of each class using the model
model.predict_proba(X)

List the available regressors

from autoprognosis.plugins.prediction.regression import Regression
print(Regression().list_available())

Create a Regression study

# third party
import pandas as pd

# autoprognosis absolute
from autoprognosis.utils.serialization import load_model_from_file
from autoprognosis.utils.tester import evaluate_regression
from autoprognosis.studies.regression import RegressionStudy

# Load dataset
df = pd.read_csv(
    "https://archive.ics.uci.edu/ml/machine-learning-databases/00291/airfoil_self_noise.dat",
    header=None,
    sep="\\t",
)
last_col = df.columns[-1]
y = df[last_col]
X = df.drop(columns=[last_col])

df = X.copy()
df["target"] = y

# Search the model
study_name="regression_example"
study = RegressionStudy(
    study_name=study_name,
    dataset=df,  # pandas DataFrame
    target="target",  # the label column in the dataset
)
model = study.fit()

# Predict using the model
model.predict(X)

Advanced Customize the Regression study

# stdlib
from pathlib import Path

# third party
import pandas as pd

# autoprognosis absolute
from autoprognosis.utils.serialization import load_model_from_file
from autoprognosis.utils.tester import evaluate_regression
from autoprognosis.studies.regression import RegressionStudy

# Load dataset
df = pd.read_csv(
    "https://archive.ics.uci.edu/ml/machine-learning-databases/00291/airfoil_self_noise.dat",
    header=None,
    sep="\\t",
)
last_col = df.columns[-1]
y = df[last_col]
X = df.drop(columns=[last_col])

df = X.copy()
df["target"] = y

# Search the model
workspace = Path("workspace")
workspace.mkdir(parents=True, exist_ok=True)

study_name="regression_example"
study = RegressionStudy(
    study_name=study_name,
    dataset=df,  # pandas DataFrame
    target="target",  # the label column in the dataset
    num_iter=10,  # how many trials to do for each candidate. Default: 50
    num_study_iter=2,  # how many outer iterations to do. Default: 5
    timeout=50,  # timeout for optimization for each classfier. Default: 600 seconds
    regressors=["linear_regression", "xgboost_regressor"],
    workspace=workspace,
)

study.run()

# Test the model
output = workspace / study_name / "model.p"

model = load_model_from_file(output)
# <model> contains the optimal architecture, but the model is not trained yet. You need to call fit() to use it.
# This way, we can further benchmark the selected model on the training set.

metrics = evaluate_regression(model, X, y)

print(f"Model {model.name()} score: {metrics['str']}")

# Train the model
model.fit(X, y)


# Predict using the model
model.predict(X)

List available survival analysis estimators

from autoprognosis.plugins.prediction.risk_estimation import RiskEstimation
print(RiskEstimation().list_available())

Create a Survival analysis study

# third party
import numpy as np
from pycox import datasets

# autoprognosis absolute
from autoprognosis.studies.risk_estimation import RiskEstimationStudy
from autoprognosis.utils.serialization import load_model_from_file
from autoprognosis.utils.tester import evaluate_survival_estimator

df = datasets.gbsg.read_df()
df = df[df["duration"] > 0]

X = df.drop(columns = ["duration"])
T = df["duration"]
Y = df["event"]

eval_time_horizons = np.linspace(T.min(), T.max(), 5)[1:-1]

study_name = "example_risks"

study = RiskEstimationStudy(
    study_name=study_name,
    dataset=df,
    target="event",
    time_to_event="duration",
    time_horizons=eval_time_horizons,
)

model = study.fit()

# Predict using the model
model.predict(X, eval_time_horizons)

Advanced Customize the Survival analysis study

# stdlib
import os
from pathlib import Path

# third party
import numpy as np
from pycox import datasets

# autoprognosis absolute
from autoprognosis.studies.risk_estimation import RiskEstimationStudy
from autoprognosis.utils.serialization import load_model_from_file
from autoprognosis.utils.tester import evaluate_survival_estimator

df = datasets.gbsg.read_df()
df = df[df["duration"] > 0]

X = df.drop(columns = ["duration"])
T = df["duration"]
Y = df["event"]

eval_time_horizons = np.linspace(T.min(), T.max(), 5)[1:-1]

workspace = Path("workspace")
study_name = "example_risks"

study = RiskEstimationStudy(
    study_name=study_name,
    dataset=df,
    target="event",
    time_to_event="duration",
    time_horizons=eval_time_horizons,
    num_iter=10,
    num_study_iter=1,
    timeout=10,
    risk_estimators=["cox_ph", "survival_xgboost"],
    score_threshold=0.5,
    workspace=workspace,
)

study.run()

output = workspace / study_name / "model.p"

model = load_model_from_file(output)
# <model> contains the optimal architecture, but the model is not trained yet. You need to call fit() to use it.
# This way, we can further benchmark the selected model on the training set.

metrics = evaluate_survival_estimator(model, X, T, Y, eval_time_horizons)

print(f"Model {model.name()} score: {metrics['clf']}")

# Train the model
model.fit(X, T, Y)

# Predict using the model
model.predict(X, eval_time_horizons)

⚑ Plugins

from autoprognosis.plugins.imputers import  Imputers

imputer = Imputers().get(<NAME>)
from autoprognosis.plugins.preprocessors import Preprocessors

preprocessor = Preprocessors().get(<NAME>)

Name

Description

maxabs_scaler

Scale each feature by its maximum absolute value. ``MaxAbsScaler` <https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MaxAbsScaler.html>`_

scaler

Standardize features by removing the mean and scaling to unit variance. - ``StandardScaler` <https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler>`_

feature_normalizer

Normalize samples individually to unit norm. ``Normalizer` <https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Normalizer.html#sklearn.preprocessing.Normalizer>`_

normal_transform

Transform features using quantiles information.``QuantileTransformer` <https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.QuantileTransformer.html#sklearn.preprocessing.QuantileTransformer>`_

uniform_transform

Transform features using quantiles information.``QuantileTransformer` <https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.QuantileTransformer.html#sklearn.preprocessing.QuantileTransformer>`_

minmax_scaler

Transform features by scaling each feature to a given range.``MinMaxScaler` <https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html#sklearn.preprocessing.MinMaxScaler>`_

from autoprognosis.plugins.prediction.classifiers import Classifiers

classifier = Classifiers().get(<NAME>)

Name

Description

neural_nets

PyTorch based neural net classifier.

logistic_regression

``LogisticRegression` <https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html>`_

catboost

Gradient boosting on decision trees - ``CatBoost` <https://catboost.ai/>`_

random_forest

A random forest classifier. ``RandomForestClassifier` <https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html>`_

tabnet

``TabNet : Attentive Interpretable Tabular Learning` <https://github.com/dreamquark-ai/tabnet>`_

xgboost

``XGBoostClassifier` <https://xgboost.readthedocs.io/en/stable/>`_

from autoprognosis.plugins.prediction.risk_estimation import RiskEstimation

predictor = RiskEstimation().get(<NAME>)

Name

Description

survival_xgboost

``XGBoost Survival Embeddings` <https://github.com/loft-br/xgboost-survival-embeddings>`_

loglogistic_aft

``Log-Logistic AFT model` <https://lifelines.readthedocs.io/en/latest/fitters/regression/LogLogisticAFTFitter.html>`_

deephit

``DeepHit: A Deep Learning Approach to Survival Analysis with Competing Risks` <https://github.com/chl8856/DeepHit>`_

cox_ph

``Cox’s proportional hazard model` <https://lifelines.readthedocs.io/en/latest/fitters/regression/CoxPHFitter.html>`_

weibull_aft

``Weibull AFT model.` <https://lifelines.readthedocs.io/en/latest/fitters/regression/WeibullAFTFitter.html>`_

lognormal_aft

``Log-Normal AFT model` <https://lifelines.readthedocs.io/en/latest/fitters/regression/LogNormalAFTFitter.html>`_

coxnet

``CoxNet is a Cox proportional hazards model also referred to as DeepSurv` <https://github.com/havakv/pycox>`_

from autoprognosis.plugins.prediction.regression import Regression

regressor = Regression().get(<NAME>)

Name

Description

tabnet_regressor

``TabNet : Attentive Interpretable Tabular Learning` <https://github.com/dreamquark-ai/tabnet>`_

catboost_regressor

Gradient boosting on decision trees - ``CatBoost` <https://catboost.ai/>`_

random_forest_regressor

``RandomForestRegressor` <https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html>`_

xgboost_regressor

``XGBoostClassifier` <https://xgboost.readthedocs.io/en/stable/>`_

neural_nets_regression

PyTorch-based neural net regressor.

linear_regression

``LinearRegression` <https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html>`_

from autoprognosis.plugins.explainers import Explainers

explainer = Explainers().get(<NAME>)

Name

Description

risk_effect_size

Feature importance using Cohen’s distance between probabilities

lime

``Lime: Explaining the predictions of any machine learning classifier` <https://github.com/marcotcr/lime>`_

symbolic_pursuit

``Symbolic Pursuit` <Learning outside the black-box: at the pursuit of interpretable models>`_

shap_permutation_sampler

``SHAP Permutation Sampler` <https://shap.readthedocs.io/en/latest/generated/shap.explainers.Permutation.html>`_

kernel_shap

``SHAP KernelExplainer` <https://shap-lrjball.readthedocs.io/en/latest/generated/shap.KernelExplainer.html>`_

invase

``INVASE: Instance-wise Variable Selection` <https://github.com/vanderschaarlab/invase>`_

from autoprognosis.plugins.uncertainty import UncertaintyQuantification
model = UncertaintyQuantification().get(<NAME>)

πŸ”¨ Test

After installing the library, the tests can be executed using pytest

$ pip install .[testing]
$ pytest -vxs -m "not slow"

Citing

If you use this code, please cite the associated paper:

@misc{https://doi.org/10.48550/arxiv.2210.12090,
  doi = {10.48550/ARXIV.2210.12090},
  url = {https://arxiv.org/abs/2210.12090},
  author = {Imrie, Fergus and Cebere, Bogdan and McKinney, Eoin F. and van der Schaar, Mihaela},
  keywords = {Machine Learning (cs.LG), Artificial Intelligence (cs.AI), FOS: Computer and information sciences, FOS: Computer and information sciences},
  title = {AutoPrognosis 2.0: Democratizing Diagnostic and Prognostic Modeling in Healthcare with Automated Machine Learning},
  publisher = {arXiv},
  year = {2022},
  copyright = {Creative Commons Attribution 4.0 International}
}

References

  1. AutoPrognosis: Automated Clinical Prognostic Modeling via Bayesian Optimization with Structured Kernel Learning

  2. Prognostication and Risk Factors for Cystic Fibrosis via Automated Machine Learning

  3. Cardiovascular Disease Risk Prediction using Automated Machine Learning: A Prospective Study of 423,604 UK Biobank Participants

Examples

AutoML studies

Imputation plugins

Preprocessing plugins

Prediction plugins

Explainability plugins

Benchmarks