Department of Health Statistics, School of Public Health, Shanxi Medical University, No. 56 Xinjian South Road, Yingze District, Taiyuan 030001, China.
Department of Epidemiology and Biostatistics, Southeast University, 87 Ding Jiaqiao Road, Nanjing 210009, China.
Biomolecules. 2023 Dec 8;13(12):1761. doi: 10.3390/biom13121761.
The detection of Parkinson's disease (PD) in its early stages is of great importance for its treatment and management, but consensus is lacking on what information is necessary and what models should be used to best predict PD risk. In our study, we first grouped PD-associated factors based on their cost and accessibility, and then gradually incorporated them into risk predictions, which were built using eight commonly used machine learning models to allow for comprehensive assessment. Finally, the Shapley Additive Explanations (SHAP) method was used to investigate the contributions of each factor. We found that models built with demographic variables, hospital admission examinations, clinical assessment, and polygenic risk score achieved the best prediction performance, and the inclusion of invasive biomarkers could not further enhance its accuracy. Among the eight machine learning models considered, penalized logistic regression and XGBoost were the most accurate algorithms for assessing PD risk, with penalized logistic regression achieving an area under the curve of 0.94 and a Brier score of 0.08. Olfactory function and polygenic risk scores were the most important predictors for PD risk. Our research has offered a practical framework for PD risk assessment, where necessary information and efficient machine learning tools were highlighted.
帕金森病(PD)的早期检测对于其治疗和管理非常重要,但对于需要哪些信息以及应该使用哪些模型来最佳预测 PD 风险,目前尚未达成共识。在我们的研究中,我们首先根据成本和可及性对与 PD 相关的因素进行分组,然后逐步将它们纳入风险预测中,这些预测是使用八种常用的机器学习模型构建的,以进行全面评估。最后,使用 Shapley Additive Explanations(SHAP)方法来研究每个因素的贡献。我们发现,使用人口统计学变量、住院检查、临床评估和多基因风险评分构建的模型实现了最佳的预测性能,并且纳入侵入性生物标志物并不能进一步提高其准确性。在所考虑的八种机器学习模型中, penalized logistic regression 和 XGBoost 是评估 PD 风险最准确的算法,其中 penalized logistic regression 的曲线下面积为 0.94,Brier 得分 0.08。嗅觉功能和多基因风险评分是 PD 风险的最重要预测因子。我们的研究为 PD 风险评估提供了一个实用的框架,突出了必要的信息和有效的机器学习工具。