Yang Yang, Liao Che-Yi, Keyvanshokooh Esmaeil, Shao Hui, Weber Mary Beth, Pasquel Francisco J, Garcia Gian-Gabriel P
H. Milton Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology, 765 Ferst Dr NW, Atlanta, GA, 30332-0001, United States, 1 404-385-3140.
Department of Information and Operations Management, Mays Business School, Texas A&M University, College Station, TX, United States.
JMIR Med Inform. 2025 Jun 27;13:e66200. doi: 10.2196/66200.
Building machine learning models that are interpretable, explainable, and fair is critical for their trustworthiness in clinical practice. Interpretability, which refers to how easily a human can comprehend the mechanism by which a model makes predictions, is often seen as a primary consideration when adopting a machine learning model in health care. However, interpretability alone does not necessarily guarantee explainability, which offers stakeholders insights into a model's predicted outputs. Moreover, many existing frameworks for model evaluation focus primarily on maximizing predictive accuracy, overlooking the broader need for interpretability, fairness, and explainability.
This study proposes a 3-stage machine learning framework for responsible model development through model assessment, selection, and explanation. We demonstrate the application of this framework for predicting cardiovascular disease (CVD) outcomes, specifically myocardial infarction (MI) and stroke, among people with type 2 diabetes (T2D).
We extracted participant data comprised of people with T2D from the ACCORD (Action to Control Cardiovascular Risk in Diabetes) dataset (N=9635), including demographic, clinical, and biomarker records. Then, we applied hold-out cross-validation to develop several interpretable machine learning models (linear, tree-based, and ensemble) to predict the risks of MI and stroke among patients with diabetes. Our 3-stage framework first assesses these models via predictive accuracy and fairness metrics. Then, in the model selection stage, we quantify the trade-off between accuracy and fairness using area under the curve (AUC) and Relative Parity of Performance Scores (RPPS), wherein RPPS measures the greatest deviation of all subpopulations compared with the population-wide AUC. Finally, we quantify the explainability of the chosen models using methods such as SHAP (Shapley Additive Explanations) and partial dependence plots to investigate the relationship between features and model outputs.
Our proposed framework demonstrates that the GLMnet model offers the best balance between predictive performance and fairness for both MI and stroke. For MI, GLMnet achieves the highest RPPS (0.979 for gender and 0.967 for race), indicating minimal performance disparities, while maintaining a high AUC of 0.705. For stroke, GLMnet has a relatively high AUC of 0.705 and the second-highest RPPS (0.961 for gender and 0.979 for race), suggesting it is effective across both subgroups. Our model explanation method further highlights that the history of CVD and age are the key predictors of MI, while HbA1c and systolic blood pressure significantly influence stroke classification.
This study establishes a responsible framework for assessing, selecting, and explaining machine learning models, emphasizing accuracy-fairness trade-offs in predictive modeling. Key insights include: (1) simple models perform comparably to complex ensembles; (2) models with strong accuracy may harbor substantial differences in accuracy across demographic groups; and (3) explanation methods reveal the relationships between features and risk for MI and stroke. Our results underscore the need for holistic approaches that consider accuracy, fairness, and explainability in interpretable model design and selection, potentially enhancing health care technology adoption.
构建可解释、可说明且公平的机器学习模型对于其在临床实践中的可信度至关重要。可解释性指人类能够轻易理解模型进行预测的机制,在医疗保健领域采用机器学习模型时,通常被视为首要考虑因素。然而,仅可解释性并不一定能保证可说明性,可说明性能为利益相关者提供有关模型预测输出的见解。此外,许多现有的模型评估框架主要侧重于最大化预测准确性,而忽略了对可解释性、公平性和可说明性的更广泛需求。
本研究提出了一个三阶段机器学习框架,用于通过模型评估、选择和解释来进行负责任的模型开发。我们展示了该框架在预测2型糖尿病(T2D)患者心血管疾病(CVD)结局,特别是心肌梗死(MI)和中风方面的应用。
我们从ACCORD(控制糖尿病心血管风险行动)数据集中提取了由T2D患者组成的参与者数据(N = 9635),包括人口统计学、临床和生物标志物记录。然后,我们应用留出法交叉验证来开发几个可解释的机器学习模型(线性、基于树的和集成模型),以预测糖尿病患者发生MI和中风的风险。我们的三阶段框架首先通过预测准确性和公平性指标评估这些模型。然后,在模型选择阶段,我们使用曲线下面积(AUC)和性能得分相对均等性(RPPS)来量化准确性和公平性之间的权衡,其中RPPS衡量所有亚组与总体AUC相比的最大偏差。最后,我们使用诸如SHAP(Shapley加性解释)和部分依赖图等方法来量化所选模型的可说明性,以研究特征与模型输出之间的关系。
我们提出的框架表明,GLMnet模型在MI和中风的预测性能和公平性之间提供了最佳平衡。对于MI,GLMnet实现了最高的RPPS(性别为0.979,种族为0.967),表明性能差异最小,同时保持了0.705的高AUC。对于中风,GLMnet具有相对较高的AUC为0.705和第二高的RPPS(性别为0.961,种族为0.979),表明它在两个亚组中均有效。我们的模型解释方法进一步强调,CVD病史和年龄是MI的关键预测因素,而糖化血红蛋白(HbA1c)和收缩压对中风分类有显著影响。
本研究建立了一个用于评估、选择和解释机器学习模型的负责任框架,强调预测建模中的准确性 - 公平性权衡。主要见解包括:(1)简单模型的性能与复杂集成模型相当;(2)具有高准确性的模型在不同人口统计学群体中的准确性可能存在显著差异;(3)解释方法揭示了特征与MI和中风风险之间的关系。我们的结果强调了在可解释模型设计和选择中考虑准确性、公平性和可说明性的整体方法的必要性,这可能会促进医疗保健技术的采用。