AutoPrognosis documentation!

AutoPrognosis - A system for automating the design of predictive modeling pipelines tailored for clinical prognosis.

🔑 Features

🚀 Automatically learns ensembles of pipelines for classification, regression or survival analysis tasks.
🌀 Easy to extend pluginable architecture.
🔥 Interpretability and uncertainty quantification tools.
🩹 Data imputation using HyperImpute.
⚡ Build demonstrators using Streamlit.
📓 Python and R tutorials available.

🚀 Installation

Using pip

The library can be installed from PyPI using

$ pip install autoprognosis

or from source, using

$ pip install .

Redis (Optional, but recommended)

AutoPrognosis can use Redis as a backend to improve the performance and quality of the searches.

For that, install the redis-server package following the steps described on the official site.

Environment variables

The library can be configured from a set of environment variables.

Variable	Description
`N_OPT_JOBS`	Number of cores to use for hyperparameter search. Default : 1
`N_LEARNER_JOBS`	Number of cores to use by inidividual learners. Default: all cpus
`REDIS_HOST`	IP address for the Redis database. Default 127.0.0.1
`REDIS_PORT`	Redis port. Default: 6379

Example: export N_OPT_JOBS = 2 to use 2 cores for hyperparam search.

💥 Sample Usage

Advanced Python tutorials can be found in the Python tutorials section.

R examples can be found in the R tutorials section.

List the available classifiers

from autoprognosis.plugins.prediction.classifiers import Classifiers
print(Classifiers().list_available())

Create a study for classifiers

from sklearn.datasets import load_breast_cancer

from autoprognosis.studies.classifiers import ClassifierStudy
from autoprognosis.utils.serialization import load_model_from_file
from autoprognosis.utils.tester import evaluate_estimator


X, Y = load_breast_cancer(return_X_y=True, as_frame=True)

df = X.copy()
df["target"] = Y

study_name = "example"

study = ClassifierStudy(
    study_name=study_name,
    dataset=df,  # pandas DataFrame
    target="target",  # the label column in the dataset
)
model = study.fit()

# Predict the probabilities of each class using the model
model.predict_proba(X)

(Advanced) Customize the study for classifiers

from pathlib import Path

from sklearn.datasets import load_breast_cancer

from autoprognosis.studies.classifiers import ClassifierStudy
from autoprognosis.utils.serialization import load_model_from_file
from autoprognosis.utils.tester import evaluate_estimator


X, Y = load_breast_cancer(return_X_y=True, as_frame=True)

df = X.copy()
df["target"] = Y

workspace = Path("workspace")
study_name = "example"

study = ClassifierStudy(
    study_name=study_name,
    dataset=df,  # pandas DataFrame
    target="target",  # the label column in the dataset
    num_iter=100,  # how many trials to do for each candidate
    timeout=60,  # seconds
    classifiers=["logistic_regression", "lda", "qda"],
    workspace=workspace,
)

study.run()

output = workspace / study_name / "model.p"
model = load_model_from_file(output)

# <model> contains the optimal architecture, but the model is not trained yet. You need to call fit() to use it.
# This way, we can further benchmark the selected model on the training set.
metrics = evaluate_estimator(model, X, Y)

print(f"model {model.name()} -> {metrics['clf']}")

# Train the model
model.fit(X, Y)

# Predict the probabilities of each class using the model
model.predict_proba(X)

List the available regressors

from autoprognosis.plugins.prediction.regression import Regression
print(Regression().list_available())

Create a Regression study

# third party
import pandas as pd

# autoprognosis absolute
from autoprognosis.utils.serialization import load_model_from_file
from autoprognosis.utils.tester import evaluate_regression
from autoprognosis.studies.regression import RegressionStudy

# Load dataset
df = pd.read_csv(
    "https://archive.ics.uci.edu/ml/machine-learning-databases/00291/airfoil_self_noise.dat",
    header=None,
    sep="\\t",
)
last_col = df.columns[-1]
y = df[last_col]
X = df.drop(columns=[last_col])

df = X.copy()
df["target"] = y

# Search the model
study_name="regression_example"
study = RegressionStudy(
    study_name=study_name,
    dataset=df,  # pandas DataFrame
    target="target",  # the label column in the dataset
)
model = study.fit()

# Predict using the model
model.predict(X)

Advanced Customize the Regression study

# stdlib
from pathlib import Path

# third party
import pandas as pd

# autoprognosis absolute
from autoprognosis.utils.serialization import load_model_from_file
from autoprognosis.utils.tester import evaluate_regression
from autoprognosis.studies.regression import RegressionStudy

# Load dataset
df = pd.read_csv(
    "https://archive.ics.uci.edu/ml/machine-learning-databases/00291/airfoil_self_noise.dat",
    header=None,
    sep="\\t",
)
last_col = df.columns[-1]
y = df[last_col]
X = df.drop(columns=[last_col])

df = X.copy()
df["target"] = y

# Search the model
workspace = Path("workspace")
workspace.mkdir(parents=True, exist_ok=True)

study_name="regression_example"
study = RegressionStudy(
    study_name=study_name,
    dataset=df,  # pandas DataFrame
    target="target",  # the label column in the dataset
    num_iter=10,  # how many trials to do for each candidate. Default: 50
    num_study_iter=2,  # how many outer iterations to do. Default: 5
    timeout=50,  # timeout for optimization for each classfier. Default: 600 seconds
    regressors=["linear_regression", "xgboost_regressor"],
    workspace=workspace,
)

study.run()

# Test the model
output = workspace / study_name / "model.p"

model = load_model_from_file(output)
# <model> contains the optimal architecture, but the model is not trained yet. You need to call fit() to use it.
# This way, we can further benchmark the selected model on the training set.

metrics = evaluate_regression(model, X, y)

print(f"Model {model.name()} score: {metrics['str']}")

# Train the model
model.fit(X, y)


# Predict using the model
model.predict(X)

List available survival analysis estimators

from autoprognosis.plugins.prediction.risk_estimation import RiskEstimation
print(RiskEstimation().list_available())

Create a Survival analysis study

# third party
import numpy as np
from pycox import datasets

# autoprognosis absolute
from autoprognosis.studies.risk_estimation import RiskEstimationStudy
from autoprognosis.utils.serialization import load_model_from_file
from autoprognosis.utils.tester import evaluate_survival_estimator

df = datasets.gbsg.read_df()
df = df[df["duration"] > 0]

X = df.drop(columns = ["duration"])
T = df["duration"]
Y = df["event"]

eval_time_horizons = np.linspace(T.min(), T.max(), 5)[1:-1]

study_name = "example_risks"

study = RiskEstimationStudy(
    study_name=study_name,
    dataset=df,
    target="event",
    time_to_event="duration",
    time_horizons=eval_time_horizons,
)

model = study.fit()

# Predict using the model
model.predict(X, eval_time_horizons)

Advanced Customize the Survival analysis study

# stdlib
import os
from pathlib import Path

# third party
import numpy as np
from pycox import datasets

# autoprognosis absolute
from autoprognosis.studies.risk_estimation import RiskEstimationStudy
from autoprognosis.utils.serialization import load_model_from_file
from autoprognosis.utils.tester import evaluate_survival_estimator

df = datasets.gbsg.read_df()
df = df[df["duration"] > 0]

X = df.drop(columns = ["duration"])
T = df["duration"]
Y = df["event"]

eval_time_horizons = np.linspace(T.min(), T.max(), 5)[1:-1]

workspace = Path("workspace")
study_name = "example_risks"

study = RiskEstimationStudy(
    study_name=study_name,
    dataset=df,
    target="event",
    time_to_event="duration",
    time_horizons=eval_time_horizons,
    num_iter=10,
    num_study_iter=1,
    timeout=10,
    risk_estimators=["cox_ph", "survival_xgboost"],
    score_threshold=0.5,
    workspace=workspace,
)

study.run()

output = workspace / study_name / "model.p"

model = load_model_from_file(output)
# <model> contains the optimal architecture, but the model is not trained yet. You need to call fit() to use it.
# This way, we can further benchmark the selected model on the training set.

metrics = evaluate_survival_estimator(model, X, T, Y, eval_time_horizons)

print(f"Model {model.name()} score: {metrics['clf']}")

# Train the model
model.fit(X, T, Y)

# Predict using the model
model.predict(X, eval_time_horizons)

⚡ Plugins

from autoprognosis.plugins.imputers import  Imputers

imputer = Imputers().get(<NAME>)

from autoprognosis.plugins.preprocessors import Preprocessors

preprocessor = Preprocessors().get(<NAME>)

Name	Description
maxabs_scaler	Scale each feature by its maximum absolute value. ``MaxAbsScaler` <https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MaxAbsScaler.html>`_
scaler	Standardize features by removing the mean and scaling to unit variance. - ``StandardScaler` <https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler>`_
feature_normalizer	Normalize samples individually to unit norm. ``Normalizer` <https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Normalizer.html#sklearn.preprocessing.Normalizer>`_
normal_transform	Transform features using quantiles information.``QuantileTransformer` <https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.QuantileTransformer.html#sklearn.preprocessing.QuantileTransformer>`_
uniform_transform	Transform features using quantiles information.``QuantileTransformer` <https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.QuantileTransformer.html#sklearn.preprocessing.QuantileTransformer>`_
minmax_scaler	Transform features by scaling each feature to a given range.``MinMaxScaler` <https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html#sklearn.preprocessing.MinMaxScaler>`_

from autoprognosis.plugins.prediction.classifiers import Classifiers

classifier = Classifiers().get(<NAME>)

Name	Description
neural_nets	PyTorch based neural net classifier.
logistic_regression	``LogisticRegression` <https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html>`_
catboost	Gradient boosting on decision trees - ``CatBoost` <https://catboost.ai/>`_
random_forest	A random forest classifier. ``RandomForestClassifier` <https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html>`_
tabnet	``TabNet : Attentive Interpretable Tabular Learning` <https://github.com/dreamquark-ai/tabnet>`_
xgboost	``XGBoostClassifier` <https://xgboost.readthedocs.io/en/stable/>`_

from autoprognosis.plugins.prediction.risk_estimation import RiskEstimation

predictor = RiskEstimation().get(<NAME>)

Name	Description
survival_xgboost	``XGBoost Survival Embeddings` <https://github.com/loft-br/xgboost-survival-embeddings>`_
loglogistic_aft	``Log-Logistic AFT model` <https://lifelines.readthedocs.io/en/latest/fitters/regression/LogLogisticAFTFitter.html>`_
deephit	``DeepHit: A Deep Learning Approach to Survival Analysis with Competing Risks` <https://github.com/chl8856/DeepHit>`_
cox_ph	``Cox’s proportional hazard model` <https://lifelines.readthedocs.io/en/latest/fitters/regression/CoxPHFitter.html>`_
weibull_aft	``Weibull AFT model.` <https://lifelines.readthedocs.io/en/latest/fitters/regression/WeibullAFTFitter.html>`_
lognormal_aft	``Log-Normal AFT model` <https://lifelines.readthedocs.io/en/latest/fitters/regression/LogNormalAFTFitter.html>`_
coxnet	``CoxNet is a Cox proportional hazards model also referred to as DeepSurv` <https://github.com/havakv/pycox>`_

from autoprognosis.plugins.prediction.regression import Regression

regressor = Regression().get(<NAME>)

Name	Description
tabnet_regressor	``TabNet : Attentive Interpretable Tabular Learning` <https://github.com/dreamquark-ai/tabnet>`_
catboost_regressor	Gradient boosting on decision trees - ``CatBoost` <https://catboost.ai/>`_
random_forest_regressor	``RandomForestRegressor` <https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html>`_
xgboost_regressor	``XGBoostClassifier` <https://xgboost.readthedocs.io/en/stable/>`_
neural_nets_regression	PyTorch-based neural net regressor.
linear_regression	``LinearRegression` <https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html>`_

from autoprognosis.plugins.explainers import Explainers

explainer = Explainers().get(<NAME>)

Name	Description
risk_effect_size	Feature importance using Cohen’s distance between probabilities
lime	``Lime: Explaining the predictions of any machine learning classifier` <https://github.com/marcotcr/lime>`_
symbolic_pursuit	``Symbolic Pursuit` <Learning outside the black-box: at the pursuit of interpretable models>`_
shap_permutation_sampler	``SHAP Permutation Sampler` <https://shap.readthedocs.io/en/latest/generated/shap.explainers.Permutation.html>`_
kernel_shap	``SHAP KernelExplainer` <https://shap-lrjball.readthedocs.io/en/latest/generated/shap.KernelExplainer.html>`_
invase	``INVASE: Instance-wise Variable Selection` <https://github.com/vanderschaarlab/invase>`_

from autoprognosis.plugins.uncertainty import UncertaintyQuantification
model = UncertaintyQuantification().get(<NAME>)

🔨 Test

After installing the library, the tests can be executed using pytest

$ pip install .[testing]
$ pytest -vxs -m "not slow"

Citing

If you use this code, please cite the associated paper:

@misc{https://doi.org/10.48550/arxiv.2210.12090,
  doi = {10.48550/ARXIV.2210.12090},
  url = {https://arxiv.org/abs/2210.12090},
  author = {Imrie, Fergus and Cebere, Bogdan and McKinney, Eoin F. and van der Schaar, Mihaela},
  keywords = {Machine Learning (cs.LG), Artificial Intelligence (cs.AI), FOS: Computer and information sciences, FOS: Computer and information sciences},
  title = {AutoPrognosis 2.0: Democratizing Diagnostic and Prognostic Modeling in Healthcare with Automated Machine Learning},
  publisher = {arXiv},
  year = {2022},
  copyright = {Creative Commons Attribution 4.0 International}
}