AutoPrognosis documentation!ο
AutoPrognosis - A system for automating the design of predictive modeling pipelines tailored for clinical prognosis.ο
π Featuresο
π Automatically learns ensembles of pipelines for classification, regression or survival analysis tasks.
π Easy to extend pluginable architecture.
π₯ Interpretability and uncertainty quantification tools.
π©Ή Data imputation using HyperImpute.
β‘ Build demonstrators using Streamlit.
π Python and R tutorials available.
π Installationο
Using pipο
The library can be installed from PyPI using
$ pip install autoprognosis
or from source, using
$ pip install .
Redis (Optional, but recommended)ο
AutoPrognosis can use Redis as a backend to improve the performance and quality of the searches.
For that, install the redis-server package following the steps described on the official site.
Environment variablesο
The library can be configured from a set of environment variables.
Variable |
Description |
---|---|
|
Number of cores to use for hyperparameter search. Default : 1 |
|
Number of cores to use by inidividual learners. Default: all cpus |
|
IP address for the Redis database. Default 127.0.0.1 |
|
Redis port. Default: 6379 |
Example: export N_OPT_JOBS = 2
to use 2 cores for hyperparam search.
π₯ Sample Usageο
Advanced Python tutorials can be found in the Python tutorials section.
R examples can be found in the R tutorials section.
List the available classifiers
from autoprognosis.plugins.prediction.classifiers import Classifiers
print(Classifiers().list_available())
Create a study for classifiers
from sklearn.datasets import load_breast_cancer
from autoprognosis.studies.classifiers import ClassifierStudy
from autoprognosis.utils.serialization import load_model_from_file
from autoprognosis.utils.tester import evaluate_estimator
X, Y = load_breast_cancer(return_X_y=True, as_frame=True)
df = X.copy()
df["target"] = Y
study_name = "example"
study = ClassifierStudy(
study_name=study_name,
dataset=df, # pandas DataFrame
target="target", # the label column in the dataset
)
model = study.fit()
# Predict the probabilities of each class using the model
model.predict_proba(X)
(Advanced) Customize the study for classifiers
from pathlib import Path
from sklearn.datasets import load_breast_cancer
from autoprognosis.studies.classifiers import ClassifierStudy
from autoprognosis.utils.serialization import load_model_from_file
from autoprognosis.utils.tester import evaluate_estimator
X, Y = load_breast_cancer(return_X_y=True, as_frame=True)
df = X.copy()
df["target"] = Y
workspace = Path("workspace")
study_name = "example"
study = ClassifierStudy(
study_name=study_name,
dataset=df, # pandas DataFrame
target="target", # the label column in the dataset
num_iter=100, # how many trials to do for each candidate
timeout=60, # seconds
classifiers=["logistic_regression", "lda", "qda"],
workspace=workspace,
)
study.run()
output = workspace / study_name / "model.p"
model = load_model_from_file(output)
# <model> contains the optimal architecture, but the model is not trained yet. You need to call fit() to use it.
# This way, we can further benchmark the selected model on the training set.
metrics = evaluate_estimator(model, X, Y)
print(f"model {model.name()} -> {metrics['clf']}")
# Train the model
model.fit(X, Y)
# Predict the probabilities of each class using the model
model.predict_proba(X)
List the available regressors
from autoprognosis.plugins.prediction.regression import Regression
print(Regression().list_available())
Create a Regression study
# third party
import pandas as pd
# autoprognosis absolute
from autoprognosis.utils.serialization import load_model_from_file
from autoprognosis.utils.tester import evaluate_regression
from autoprognosis.studies.regression import RegressionStudy
# Load dataset
df = pd.read_csv(
"https://archive.ics.uci.edu/ml/machine-learning-databases/00291/airfoil_self_noise.dat",
header=None,
sep="\\t",
)
last_col = df.columns[-1]
y = df[last_col]
X = df.drop(columns=[last_col])
df = X.copy()
df["target"] = y
# Search the model
study_name="regression_example"
study = RegressionStudy(
study_name=study_name,
dataset=df, # pandas DataFrame
target="target", # the label column in the dataset
)
model = study.fit()
# Predict using the model
model.predict(X)
Advanced Customize the Regression study
# stdlib
from pathlib import Path
# third party
import pandas as pd
# autoprognosis absolute
from autoprognosis.utils.serialization import load_model_from_file
from autoprognosis.utils.tester import evaluate_regression
from autoprognosis.studies.regression import RegressionStudy
# Load dataset
df = pd.read_csv(
"https://archive.ics.uci.edu/ml/machine-learning-databases/00291/airfoil_self_noise.dat",
header=None,
sep="\\t",
)
last_col = df.columns[-1]
y = df[last_col]
X = df.drop(columns=[last_col])
df = X.copy()
df["target"] = y
# Search the model
workspace = Path("workspace")
workspace.mkdir(parents=True, exist_ok=True)
study_name="regression_example"
study = RegressionStudy(
study_name=study_name,
dataset=df, # pandas DataFrame
target="target", # the label column in the dataset
num_iter=10, # how many trials to do for each candidate. Default: 50
num_study_iter=2, # how many outer iterations to do. Default: 5
timeout=50, # timeout for optimization for each classfier. Default: 600 seconds
regressors=["linear_regression", "xgboost_regressor"],
workspace=workspace,
)
study.run()
# Test the model
output = workspace / study_name / "model.p"
model = load_model_from_file(output)
# <model> contains the optimal architecture, but the model is not trained yet. You need to call fit() to use it.
# This way, we can further benchmark the selected model on the training set.
metrics = evaluate_regression(model, X, y)
print(f"Model {model.name()} score: {metrics['str']}")
# Train the model
model.fit(X, y)
# Predict using the model
model.predict(X)
List available survival analysis estimators
from autoprognosis.plugins.prediction.risk_estimation import RiskEstimation
print(RiskEstimation().list_available())
Create a Survival analysis study
# third party
import numpy as np
from pycox import datasets
# autoprognosis absolute
from autoprognosis.studies.risk_estimation import RiskEstimationStudy
from autoprognosis.utils.serialization import load_model_from_file
from autoprognosis.utils.tester import evaluate_survival_estimator
df = datasets.gbsg.read_df()
df = df[df["duration"] > 0]
X = df.drop(columns = ["duration"])
T = df["duration"]
Y = df["event"]
eval_time_horizons = np.linspace(T.min(), T.max(), 5)[1:-1]
study_name = "example_risks"
study = RiskEstimationStudy(
study_name=study_name,
dataset=df,
target="event",
time_to_event="duration",
time_horizons=eval_time_horizons,
)
model = study.fit()
# Predict using the model
model.predict(X, eval_time_horizons)
Advanced Customize the Survival analysis study
# stdlib
import os
from pathlib import Path
# third party
import numpy as np
from pycox import datasets
# autoprognosis absolute
from autoprognosis.studies.risk_estimation import RiskEstimationStudy
from autoprognosis.utils.serialization import load_model_from_file
from autoprognosis.utils.tester import evaluate_survival_estimator
df = datasets.gbsg.read_df()
df = df[df["duration"] > 0]
X = df.drop(columns = ["duration"])
T = df["duration"]
Y = df["event"]
eval_time_horizons = np.linspace(T.min(), T.max(), 5)[1:-1]
workspace = Path("workspace")
study_name = "example_risks"
study = RiskEstimationStudy(
study_name=study_name,
dataset=df,
target="event",
time_to_event="duration",
time_horizons=eval_time_horizons,
num_iter=10,
num_study_iter=1,
timeout=10,
risk_estimators=["cox_ph", "survival_xgboost"],
score_threshold=0.5,
workspace=workspace,
)
study.run()
output = workspace / study_name / "model.p"
model = load_model_from_file(output)
# <model> contains the optimal architecture, but the model is not trained yet. You need to call fit() to use it.
# This way, we can further benchmark the selected model on the training set.
metrics = evaluate_survival_estimator(model, X, T, Y, eval_time_horizons)
print(f"Model {model.name()} score: {metrics['clf']}")
# Train the model
model.fit(X, T, Y)
# Predict using the model
model.predict(X, eval_time_horizons)
β‘ Pluginsο
from autoprognosis.plugins.imputers import Imputers
imputer = Imputers().get(<NAME>)
from autoprognosis.plugins.preprocessors import Preprocessors
preprocessor = Preprocessors().get(<NAME>)
from autoprognosis.plugins.prediction.classifiers import Classifiers
classifier = Classifiers().get(<NAME>)
Name |
Description |
---|---|
neural_nets |
PyTorch based neural net classifier. |
logistic_regression |
``LogisticRegression` <https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html>`_ |
catboost |
Gradient boosting on decision trees - ``CatBoost` <https://catboost.ai/>`_ |
random_forest |
A random forest classifier. ``RandomForestClassifier` <https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html>`_ |
tabnet |
``TabNet : Attentive Interpretable Tabular Learning` <https://github.com/dreamquark-ai/tabnet>`_ |
xgboost |
``XGBoostClassifier` <https://xgboost.readthedocs.io/en/stable/>`_ |
from autoprognosis.plugins.prediction.risk_estimation import RiskEstimation
predictor = RiskEstimation().get(<NAME>)
Name |
Description |
---|---|
survival_xgboost |
``XGBoost Survival Embeddings` <https://github.com/loft-br/xgboost-survival-embeddings>`_ |
loglogistic_aft |
``Log-Logistic AFT model` <https://lifelines.readthedocs.io/en/latest/fitters/regression/LogLogisticAFTFitter.html>`_ |
deephit |
``DeepHit: A Deep Learning Approach to Survival Analysis with Competing Risks` <https://github.com/chl8856/DeepHit>`_ |
cox_ph |
``Coxβs proportional hazard model` <https://lifelines.readthedocs.io/en/latest/fitters/regression/CoxPHFitter.html>`_ |
weibull_aft |
``Weibull AFT model.` <https://lifelines.readthedocs.io/en/latest/fitters/regression/WeibullAFTFitter.html>`_ |
lognormal_aft |
``Log-Normal AFT model` <https://lifelines.readthedocs.io/en/latest/fitters/regression/LogNormalAFTFitter.html>`_ |
coxnet |
``CoxNet is a Cox proportional hazards model also referred to as DeepSurv` <https://github.com/havakv/pycox>`_ |
from autoprognosis.plugins.prediction.regression import Regression
regressor = Regression().get(<NAME>)
Name |
Description |
---|---|
tabnet_regressor |
``TabNet : Attentive Interpretable Tabular Learning` <https://github.com/dreamquark-ai/tabnet>`_ |
catboost_regressor |
Gradient boosting on decision trees - ``CatBoost` <https://catboost.ai/>`_ |
random_forest_regressor |
``RandomForestRegressor` <https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html>`_ |
xgboost_regressor |
``XGBoostClassifier` <https://xgboost.readthedocs.io/en/stable/>`_ |
neural_nets_regression |
PyTorch-based neural net regressor. |
linear_regression |
``LinearRegression` <https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html>`_ |
from autoprognosis.plugins.explainers import Explainers
explainer = Explainers().get(<NAME>)
Name |
Description |
---|---|
risk_effect_size |
Feature importance using Cohenβs distance between probabilities |
lime |
``Lime: Explaining the predictions of any machine learning classifier` <https://github.com/marcotcr/lime>`_ |
symbolic_pursuit |
``Symbolic Pursuit` <Learning outside the black-box: at the pursuit of interpretable models>`_ |
shap_permutation_sampler |
``SHAP Permutation Sampler` <https://shap.readthedocs.io/en/latest/generated/shap.explainers.Permutation.html>`_ |
kernel_shap |
``SHAP KernelExplainer` <https://shap-lrjball.readthedocs.io/en/latest/generated/shap.KernelExplainer.html>`_ |
invase |
``INVASE: Instance-wise Variable Selection` <https://github.com/vanderschaarlab/invase>`_ |
from autoprognosis.plugins.uncertainty import UncertaintyQuantification
model = UncertaintyQuantification().get(<NAME>)
π¨ Testο
After installing the library, the tests can be executed using pytest
$ pip install .[testing]
$ pytest -vxs -m "not slow"
Citingο
If you use this code, please cite the associated paper:
@misc{https://doi.org/10.48550/arxiv.2210.12090,
doi = {10.48550/ARXIV.2210.12090},
url = {https://arxiv.org/abs/2210.12090},
author = {Imrie, Fergus and Cebere, Bogdan and McKinney, Eoin F. and van der Schaar, Mihaela},
keywords = {Machine Learning (cs.LG), Artificial Intelligence (cs.AI), FOS: Computer and information sciences, FOS: Computer and information sciences},
title = {AutoPrognosis 2.0: Democratizing Diagnostic and Prognostic Modeling in Healthcare with Automated Machine Learning},
publisher = {arXiv},
year = {2022},
copyright = {Creative Commons Attribution 4.0 International}
}