autoprognosis.studies.risk_estimation module

class RiskEstimationStudy(dataset: pandas.core.frame.DataFrame, target: str, time_to_event: str, time_horizons: List[int], num_iter: int = 20, num_study_iter: int = 5, num_ensemble_iter: int = 15, timeout: int = 360, study_name: Optional[str] = None, workspace: pathlib.Path = PosixPath('tmp'), risk_estimators: List[str] = ['survival_xgboost', 'loglogistic_aft', 'deephit', 'cox_ph', 'weibull_aft', 'lognormal_aft', 'coxnet'], imputers: List[str] = ['ice'], feature_scaling: List[str] = ['normal_transform', 'maxabs_scaler', 'feature_normalizer', 'minmax_scaler', 'nop', 'scaler', 'uniform_transform'], feature_selection: List[str] = ['nop', 'pca', 'fast_ica'], hooks: autoprognosis.hooks.base.Hooks = <autoprognosis.hooks.default.DefaultHooks object>, score_threshold: float = 0.65, nan_placeholder: Optional[Any] = None, group_id: Optional[str] = None, random_state: int = 0, sample_for_search: bool = True, max_search_sample_size: int = 10000, ensemble_size: int = 3, n_folds_cv: int = 5)

Bases: autoprognosis.studies._base.Study

Core logic for risk estimation studies.

A study automatically handles imputation, preprocessing and model selection for a certain dataset. The output is an optimal model architecture, selected by the AutoML logic.

Parameters

dataset – DataFrame. The dataset to analyze.
target – str. The target column in the dataset.
time_to_event – str. The time_to_event column in the dataset.
num_iter – int. Maximum Number of optimization trials. This is the limit of trials for each base estimator in the “risk_estimators” list, used in combination with the “timeout” parameter. For each estimator, the search will end after “num_iter” trials or “timeout” seconds.
num_study_iter – int. The number of study iterations. This is the limit for the outer optimization loop. After each outer loop, an intermediary model is cached and can be used by another process, while the outer loop continues to improve the result.
timeout – int. Maximum wait time(seconds) for each estimator hyperparameter search. This timeout will apply to each estimator in the “risk_estimators” list.
study_name – str. The name of the study, to be used in the caches.
feature_scaling –
list. Plugin search pool to use in the pipeline for scaling. Defaults to : [‘maxabs_scaler’, ‘scaler’, ‘feature_normalizer’, ‘normal_transform’, ‘uniform_transform’, ‘nop’, ‘minmax_scaler’] Available plugins, retrieved using Preprocessors(category=”feature_scaling”).list_available():
- ’maxabs_scaler’
- ’scaler’
- ’feature_normalizer’
- ’normal_transform’
- ’uniform_transform’
- ’nop’ # empty operation
- ’minmax_scaler’
feature_selection –
list. Plugin search pool to use in the pipeline for feature selection. Defaults [“nop”, “variance_threshold”, “pca”, “fast_ica”] Available plugins, retrieved using Preprocessors(category=”dimensionality_reduction”).list_available():
- ’feature_agglomeration’
- ’fast_ica’
- ’variance_threshold’
- ’gauss_projection’
- ’pca’
- ’nop’ # no operation
imputers –
list. Plugin search pool to use in the pipeline for imputation. Defaults to [“mean”, “ice”, “missforest”, “hyperimpute”]. Available plugins, retrieved using Imputers().list_available():
- ’sinkhorn’
- ’EM’
- ’mice’
- ’ice’
- ’hyperimpute’
- ’most_frequent’
- ’median’
- ’missforest’
- ’softimpute’
- ’nop’
- ’mean’
- ’gain’
risk_estimators –
list. Plugin search pool to use in the pipeline for risk estimation. Defaults to [“survival_xgboost”, “loglogistic_aft”, “deephit”, “cox_ph”, “weibull_aft”, “lognormal_aft”, “coxnet”] Available plugins:
- ’survival_xgboost’
- ’loglogistic_aft’
- ’deephit’
- ’cox_ph’
- ’weibull_aft’
- ’lognormal_aft’
- ’coxnet’
hooks – Hooks. Custom callbacks to be notified about the search progress.
workspace – Path. Where to store the output model.
score_threshold – float. The minimum metric score for a candidate.
random_state – int Random seed
sample_for_search – bool Subsample the evaluation dataset in the search pipeline. Improves the speed of the search.
max_search_sample_size – int Subsample size for the evaluation dataset, if sample is True.

Example

>>> import numpy as np
>>> from pycox import datasets
>>> from autoprognosis.studies.risk_estimation import RiskEstimationStudy
>>> from autoprognosis.utils.serialization import load_model_from_file
>>> from autoprognosis.utils.tester import evaluate_survival_estimator
>>>
>>> df = datasets.gbsg.read_df()
>>> df = df[df["duration"] > 0]
>>>
>>> X = df.drop(columns = ["duration"])
>>> T = df["duration"]
>>> Y = df["event"]
>>>
>>> eval_time_horizons = np.linspace(T.min(), T.max(), 5)[1:-1]
>>>
>>> study_name = "example_risks"
>>> study = RiskEstimationStudy(
>>>     study_name=study_name,
>>>     dataset=df,
>>>     target="event",
>>>     time_to_event="duration",
>>>     time_horizons=eval_time_horizons,
>>> )
>>>
>>> model = study.fit()
>>> # Predict using the model
>>> model.predict(X, eval_time_horizons)

fit() → Any: Run the study and train the model. The call returns the fitted model.

run() → Any: Run the study. The call returns the optimal model architecture - not fitted.