Information Laboratory (InfoLab), Department of Electrical and Computer Engineering, Sungkyunkwan University, Suwon 16419, South Korea.
Technology Management, Stony Brook University, New York 11794, USA.
Comput Methods Programs Biomed. 2023 Jun;234:107495. doi: 10.1016/j.cmpb.2023.107495. Epub 2023 Mar 23.
BACKGROUND AND OBJECTIVES: Parkinson's Disease (PD) is a devastating chronic neurological condition. Machine learning (ML) techniques have been used in the early prediction of PD progression. Fusion of heterogeneous data modalities proved its capability to improve the performance of ML models. Time series data fusion supports the tracking of the disease over time. In addition, the trustworthiness of the resulting models is improved by adding model explainability features. The literature on PD has not sufficiently explored these three points. METHODS: In this work, we proposed an ML pipeline for predicting the progression of PD that is both accurate and explainable. We explore the fusion of different combinations of five time series modalities from the Parkinson's Progression Markers Initiative (PPMI) real-world dataset, including patient characteristics, biosamples, medication history, motor, and non-motor function data. Each patient has six visits. The problem has been formulated in two ways: ❶ a three-class based progression prediction with 953 patients in each time series modality, and ❷ a four-class based progression prediction with 1,060 patients in each time series modality. The statistical features of these six visits were calculated from each modality and diverse feature selection methods were applied to select the most informative feature sets. The extracted features were used to train a set of well-known ML models including Support vector machines (SVM), random forests (RF), extra tree classifier (ETC), light gradient boosting machines (LGBM), and stochastic gradient descent (SGD). We examined a number of data-balancing strategies in the pipeline with different combinations of modalities. ML models have been optimized using the Bayesian optimizer. A comprehensive evaluation of various ML methods has been conducted, and the best models have been extended to provide different explainability features. RESULTS: We compare the performance of ML models before and after optimization and using and without using feature selection. In the three-class experiment and with various modality fusions, the LGBM model produced the most accurate results with a 10-fold cross-validation (10-CV) accuracy of 90.73% using non-motor function modality. RF produced the best results in the four-class experiment with various modality fusions with a 10-CV accuracy of 94.57% using non-motor modality. With the fused dataset of non-motor and motor function modalities, the LGBM model outperformed the other ML models in both the 3-class and 4-class experiments (i.e., 10-CV accuracy of 94.89% and 93.73%, respectively). Using the Shapely Additive Explanations (SHAP) framework, we employed global and instance-based explanations to explain the behavior of each ML classifier. Moreover, we extended the explainability by implementing the LIME and SHAPASH local explainers. The consistency of these explainers has been explored. The resultant classifiers were accurate, explainable, and thus medically more relevant and applicable. CONCLUSIONS: The select modalities and feature sets were confirmed by the literature and medical experts. The various explainers suggest that the bradykinesia (NP3BRADY) feature was the most dominant and consistent. By providing thorough insights into the influence of multiple modalities on the disease risk, the suggested approach is expected to help improve the clinical knowledge of PD progression processes.
背景与目的:帕金森病(PD)是一种严重的慢性神经系统疾病。机器学习(ML)技术已被用于 PD 进展的早期预测。融合异构数据模态已被证明能够提高 ML 模型的性能。时间序列数据融合支持随着时间的推移跟踪疾病。此外,通过添加模型可解释性特征,可以提高生成模型的可信度。PD 相关文献尚未充分探讨这三点。
方法:在这项工作中,我们提出了一个既准确又可解释的用于预测 PD 进展的 ML 管道。我们探索了融合来自帕金森病进展标志物倡议(PPMI)真实世界数据集的五种不同时间序列模态的不同组合,包括患者特征、生物样本、用药史、运动和非运动功能数据。每位患者有六次就诊。该问题以两种方式进行了公式化:❶ 基于 953 名患者的三类进展预测,每种时间序列模态各有 953 名患者;❷ 基于 1,060 名患者的四类进展预测,每种时间序列模态各有 1,060 名患者。从每个模态中计算了这六次就诊的统计特征,并应用了多种特征选择方法来选择最具信息量的特征集。提取的特征被用于训练一组知名的 ML 模型,包括支持向量机(SVM)、随机森林(RF)、额外树分类器(ETC)、轻梯度提升机(LGBM)和随机梯度下降(SGD)。我们在管道中使用不同的模态组合检验了多种数据平衡策略。使用贝叶斯优化器优化了 ML 模型。对各种 ML 方法进行了全面评估,并扩展了最佳模型以提供不同的可解释性特征。
结果:我们比较了优化前后以及使用和不使用特征选择的 ML 模型的性能。在三类实验和各种模态融合中,LGBM 模型在非运动功能模态下使用 10 折交叉验证(10-CV)的准确率最高,为 90.73%。RF 在各种模态融合的四类实验中产生了最佳结果,非运动模态的 10-CV 准确率为 94.57%。使用非运动和运动功能模态的融合数据集,LGBM 模型在 3 类和 4 类实验中均优于其他 ML 模型(即,10-CV 准确率分别为 94.89%和 93.73%)。使用 Shapely Additive Explanations(SHAP)框架,我们使用全局和实例级解释来解释每个 ML 分类器的行为。此外,我们通过实现 LIME 和 SHAPASH 局部解释器扩展了可解释性。探索了这些解释器的一致性。生成的分类器准确、可解释,因此在医学上更相关和适用。
结论:所选模态和特征集得到了文献和医学专家的确认。各种解释器表明,运动迟缓(NP3BRADY)特征是最主要和最一致的。通过提供对多种模态对疾病风险的影响的深入了解,所提出的方法有望帮助提高对 PD 进展过程的临床认识。
Comput Methods Programs Biomed. 2023-6
BMC Med Inform Decis Mak. 2020-9-15
BMC Bioinformatics. 2023-9-12
Sensors (Basel). 2023-2-13
Comput Methods Programs Biomed. 2024-9
Diagnostics (Basel). 2025-8-18
NPJ Parkinsons Dis. 2025-7-1
BMC Med Inform Decis Mak. 2025-3-4
BMC Med Inform Decis Mak. 2024-11-29
BMC Med Inform Decis Mak. 2024-9-27