基于全国性健康筛查数据的帕金森病机器学习风险预测。

Machine learning based risk prediction for Parkinson's disease with nationwide health screening data.

机构信息

Department of Biostatistics, Yonsei University, Seoul, Korea.

Department of Rehabilitation Medicine, College of Medicine, Ewha Womans University, Seoul, Korea.

出版信息

Sci Rep. 2022 Nov 14;12(1):19499. doi: 10.1038/s41598-022-24105-9.

DOI:10.1038/s41598-022-24105-9

PMID:36376523

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9663430/

Abstract

Although many studies have been conducted on machine learning (ML) models for Parkinson's disease (PD) prediction using neuroimaging and movement analyses, studies with large population-based datasets are limited. We aimed to propose PD prediction models using ML algorithms based on the National Health Insurance Service-Health Screening datasets. We selected individuals who participated in national health-screening programs > 5 times between 2002 and 2015. PD was defined based on the ICD-code (G20), and a matched cohort of individuals without PD was selected using a 1:1 random sampling method. Various ML algorithms were applied for PD prediction, and the performance of the prediction models was compared. Neural networks, gradient boosting machines, and random forest algorithms exhibited the best average prediction accuracy (average area under the receiver operating characteristic curve (AUC): 0.779, 0.766, and 0.731, respectively) among the algorithms validated in this study. The overall model performance metrics were higher in men than in women (AUC: 0.742 and 0.729, respectively). The most important factor for predicting PD occurrence was body mass index, followed by total cholesterol, glucose, hemoglobin, and blood pressure levels. Smoking and alcohol consumption (in men) and socioeconomic status, physical activity, and diabetes mellitus (in women) were highly correlated with the occurrence of PD. The proposed health-screening dataset-based PD prediction model using ML algorithms is readily applicable, produces validated results, and could be a useful option for PD prediction models.

摘要

尽管已经有许多研究使用神经影像学和运动分析来进行基于机器学习（ML）的帕金森病（PD）预测，但基于大型人群数据集的研究仍然有限。我们旨在基于国民健康保险服务-健康筛查数据集，使用 ML 算法提出 PD 预测模型。我们选择了 2002 年至 2015 年间参与国家健康筛查计划 5 次以上的个体。PD 是根据 ICD 代码（G20）定义的，使用 1:1 随机抽样方法选择无 PD 的匹配队列个体。应用各种 ML 算法进行 PD 预测，并比较预测模型的性能。神经网络、梯度提升机和随机森林算法在本研究中验证的算法中表现出最佳的平均预测准确性（平均接受者操作特征曲线下面积（AUC）：0.779、0.766 和 0.731）。在男性中，整体模型性能指标高于女性（AUC：分别为 0.742 和 0.729）。预测 PD 发生的最重要因素是体重指数，其次是总胆固醇、葡萄糖、血红蛋白和血压水平。吸烟和饮酒（男性）以及社会经济地位、身体活动和糖尿病（女性）与 PD 的发生高度相关。使用 ML 算法基于健康筛查数据集的 PD 预测模型易于应用，可产生验证结果，是 PD 预测模型的一种有用选择。