Clinical Prediction Modeling Shortcuts - R functions for automating clinical prediction model development and validation workflows.
This repository provides production-ready R functions that streamline common clinical prediction modeling tasks, with a focus on robust validation strategies for multi-center studies.
Internal-External Cross-Validation (IECV) implementation supporting multiple model types:
| Model | Engine | Description |
|---|---|---|
logistic |
glm | Standard logistic regression with interpretable coefficients |
xgboost |
xgboost | Gradient boosted trees |
lightgbm |
lightgbm | Fast gradient boosted trees |
# Install required packages
install.packages(c(
"tidyverse", "tidymodels", "furrr", "probably",
"dcurves", "bonsai", "shapviz", "cli", "gridExtra"
))
# For XGBoost and LightGBM
install.packages(c("xgboost", "lightgbm"))# Load the function
source("R/iecv_modelling.R")
# Load sample data
data <- read_csv("data/simulated_patient_data.csv")
# Run IECV with logistic regression
result_lr <- iecv_modelling(
data = data,
outcome = "outcome",
predictors = c("age", "sex", "biomarker", "comorbidity"),
cluster = "center",
model = "logistic"
)
# View results
print(result_lr)
summary(result_lr)
# Generate plots
plot(result_lr) # Forest plots of all metrics
plot(result_lr, type = "calibration") # Calibration curve
plot(result_lr, type = "dca") # Decision curve analysisIECV is a validation strategy specifically designed for prediction models developed using multi-center or multi-study data. Instead of random cross-validation splits, IECV:
- Trains the model on all centers except one
- Validates on the held-out center (treating it as "external")
- Repeats for each center, so every center serves as external validation once
This approach provides more realistic estimates of how well your model will perform when applied to new centers not used in model development.
Center A Center B Center C Center D Center E Center F
| | | | | |
v v v v v v
[TRAIN] [TRAIN] [TRAIN] [TRAIN] [TRAIN] [TEST] <- Fold 1
[TRAIN] [TRAIN] [TRAIN] [TRAIN] [TEST] [TRAIN] <- Fold 2
[TRAIN] [TRAIN] [TRAIN] [TEST] [TRAIN] [TRAIN] <- Fold 3
...
# Available metrics
metrics = c("auc", "brier", "cal_intercept", "cal_slope")
# Interpretation
# AUC > 0.7 Good discrimination
# Brier < 0.25 Good overall accuracy
# Cal Intercept ~ 0 No systematic bias
# Cal Slope ~ 1 No overfitting/underfitting# Forest plots showing per-center performance
plot(result)
plot(result, type = "auc")
# Calibration plot (pooled out-of-fold predictions)
plot(result, type = "calibration")
# Decision curve analysis for clinical utility
plot(result, type = "dca")
# SHAP plots for tree models
plot(result_xgb, type = "shap")# Logistic regression: odds ratios with CIs
variable_importance(result_lr)
# Tree models: SHAP-based importance (default)
variable_importance(result_xgb)
# Tree models: native importance (Gain)
variable_importance(result_xgb, type = "native")# Show how a predictor affects model predictions
plot_shap_dependence(result_xgb, feature = "age")iecv_modelling(
data, # Data frame with outcome, predictors, cluster
outcome, # Name of binary outcome variable (0/1)
predictors, # Character vector of predictor names
cluster, # Name of clustering variable (e.g., "center")
model, # "logistic", "xgboost", or "lightgbm"
metrics, # Which metrics to compute
n_boot = 50, # Bootstrap replicates for CIs
conf_level = 0.95,
n_cores = NULL, # Parallel cores (NULL = auto)
verbose = TRUE, # Show progress
seed = 123
)The function returns an iecv_result object containing:
cluster_results- Per-cluster metrics with bootstrap CIssummary- Pooled summary statisticspredictions- Out-of-fold predictionsfinal_model- Fitted workflow on all dataresamples- The rsample object
| Function | Description |
|---|---|
variable_importance() |
Extract variable importance |
tidy_final_model() |
Get model coefficients (logistic) |
get_shap() |
Get shapviz object for custom SHAP plots |
plot_shap_dependence() |
SHAP dependence plot for a feature |
dca_table() |
Decision curve analysis table |
get_dca() |
Get raw dcurves DCA object |
format_ci() |
Format estimate with confidence interval |
The included simulated_patient_data.csv contains 1,346 patients across 6 hospitals:
| Column | Description |
|---|---|
| patient_id | Unique identifier |
| center | Hospital (A-F) |
| age | Patient age |
| sex | Binary (0/1) |
| biomarker | Continuous value |
| comorbidity | Binary (0/1) |
| outcome | Binary outcome (0/1) |
# Run the test suite
testthat::test_file("tests/test-iecv_modelling.R")See demo/iecv_demo.qmd for an interactive tutorial with:
- Step-by-step IECV workflow
- Comparison of all three model types
- Publication-quality figures
- Interpretation guidance
MIT License
Contributions are welcome! Please open an issue or submit a pull request.