Tutorial: Survival Analysis AutoML with imputation

Welcome to the Survival analysis AutoML tutorial!

This tutorial will show how to use AutoPrognosis to learn a model for datasets with missing data. We show how to use a predefined imputer or how to use AutoPrognosis to select the optimal imputer.

[ ]:

# stdlib
import sys
import warnings

# third party
import numpy as np
import pandas as pd

warnings.filterwarnings("ignore")

# autoprognosis absolute
import autoprognosis.logger as log
from autoprognosis.studies.risk_estimation import RiskEstimationStudy

[ ]:

log.add(sink=sys.stderr, level="INFO")

Load dataset

[ ]:

# third party
from pycox import datasets

df = datasets.gbsg.read_df()
df = df[df["duration"] > 0]

X = df.drop(columns=["duration", "event"])
T = df["duration"]
Y = df["event"]

eval_time_horizons = [
    int(T[Y.iloc[:] == 1].quantile(0.50)),
]

[ ]:

# stdlib
import random

total_len = len(X)

for col in ["x3", "x4"]:
    indices = random.sample(range(0, total_len), 10)
    X.loc[indices, col] = np.nan

X.isnull().any()

[ ]:

dataset = X.copy()
dataset["target"] = Y
dataset["time_to_event"] = T

Option 1: Predefined imputer

[ ]:

# stdlib
from pathlib import Path

workspace = Path("workspace")
study_name = "test_risk_estimation_studies"

study = RiskEstimationStudy(
    study_name=study_name,
    dataset=dataset,
    target="target",
    time_to_event="time_to_event",
    time_horizons=eval_time_horizons,
    num_iter=2,  # DELETE THIS LINE FOR BETTER RESULTS.
    num_study_iter=1,  # DELETE THIS LINE FOR BETTER RESULTS.
    risk_estimators=[
        "cox_ph",
        "lognormal_aft",
        "survival_xgboost",
    ],  # DELETE THIS LINE FOR BETTER RESULTS.
    imputers=["mean"],
    feature_scaling=["minmax_scaler", "nop"],  # DELETE THIS LINE FOR BETTER RESULTS.
    score_threshold=0.4,
    workspace=workspace,
)

[ ]:

study.run()

[ ]:

# autoprognosis absolute
from autoprognosis.plugins.imputers import Imputers
from autoprognosis.utils.serialization import load_model_from_file
from autoprognosis.utils.tester import evaluate_survival_estimator

model_path = workspace / study_name / "model.p"

model = load_model_from_file(model_path)

X_imp = Imputers().get("mean").fit_transform(X)

evaluate_survival_estimator(model, X_imp, T, Y, eval_time_horizons)

Option 2: Let the optimizer find the best imputer

[ ]:

# stdlib
from pathlib import Path

workspace = Path("workspace")
workspace.mkdir(parents=True, exist_ok=True)

study_name = "test_risk_estimation_studies_v2"

study = RiskEstimationStudy(
    study_name=study_name,
    dataset=dataset,
    target="target",
    time_to_event="time_to_event",
    time_horizons=eval_time_horizons,
    num_iter=2,  # DELETE THIS LINE FOR BETTER RESULTS.
    num_study_iter=1,  # DELETE THIS LINE FOR BETTER RESULTS.
    risk_estimators=[
        "cox_ph",
        "lognormal_aft",
        "survival_xgboost",
    ],  # DELETE THIS LINE FOR BETTER RESULTS.
    imputers=["mean", "ice", "median"],  # DELETE THIS LINE FOR BETTER RESULTS.
    feature_scaling=["minmax_scaler", "nop"],  # DELETE THIS LINE FOR BETTER RESULTS.
    score_threshold=0.4,
    workspace=workspace,
)

[ ]:

study.run()

[ ]:

# autoprognosis absolute
from autoprognosis.utils.serialization import load_model_from_file
from autoprognosis.utils.tester import evaluate_survival_estimator

model_path = workspace / study_name / "model.p"

model = load_model_from_file(model_path)

evaluate_survival_estimator(model, X, T, Y, eval_time_horizons)

Congratulations!

Congratulations on completing this notebook tutorial! If you enjoyed this and would like to join the movement towards Machine learning and AI for medicine, you can do so in the following ways!

Star AutoPrognosis on GitHub

The easiest way to help our community is just by starring the Repos! This helps raise awareness of the tools we’re building.