Hageh Cynthia Al, Henschel Andreas, Zhou Hao, Zubelli Jorge, Nader Moni, Chacar Stephanie, Iakovidou Nantia, Hatzikirou Haralampos, Abchee Antoine, O'Sullivan Siobhán, Zalloua Pierre A
Department of Public Health & Epidemiology, Khalifa University, Abu Dhabi, United Arab Emirates.
Department of Computer Science, Khalifa University, Abu Dhabi, United Arab Emirates.
Comput Struct Biotechnol J. 2025 Jun 23;27:2772-2781. doi: 10.1016/j.csbj.2025.06.038. eCollection 2025.
This study aimed to evaluate whether integrating clinical and genomic data improves the performance of machine learning (ML) models for predicting Type 2 Diabetes (T2D) risk.
Six models-Random Forest, Support Vector Machine, Linear Discriminant Analysis, Logistic Regression, Gradient Boosting Machine, and Decision Tree-were trained and tested on a discovery dataset (N=3,546) and validated in the UK Biobank (N=31,620). Model performance was assessed using clinical data alone, combined clinical and genomic data, and in age-specific groups (>55 and ≤55 years).
The inclusion of genomic data modestly improved model performance across all algorithms in the discovery dataset. Clinical features such as family history of T2D and hypertension consistently ranked as top features. When SNPs were added, T2D-associated variants, including rs2943641 (), rs7903146 (), and rs7756992 (), emerged among the most important features, particularly in younger individuals. These findings demonstrate the translational potential of incorporating genomics for early risk identification. In the UK Biobank, all models achieved AUCs exceeding 91 % with combined clinical and genomic data. Performance was notably better among younger individuals (≤55 years), emphasizing the models' potential for early detection. Integration of a polygenic risk score (PRS) further supported risk prediction, particularly in younger individuals, though incremental gains were modest.
While traditional clinical factors remained the strongest predictors of T2D risk, integration of genomic data produced a modest improvement in model performance, especially among younger adults. Validation across independent datasets confirmed the generalizability of these findings, underscoring the value of multi-dimensional risk-prediction models to refine T2D risk assessment.
本研究旨在评估整合临床和基因组数据是否能提高机器学习(ML)模型预测2型糖尿病(T2D)风险的性能。
在一个发现数据集(N = 3546)上训练并测试了六种模型——随机森林、支持向量机、线性判别分析、逻辑回归、梯度提升机和决策树,并在英国生物银行(N = 31620)中进行验证。仅使用临床数据、临床和基因组数据组合以及特定年龄组(>55岁和≤55岁)评估模型性能。
在发现数据集中,纳入基因组数据适度提高了所有算法的模型性能。T2D家族史和高血压等临床特征一直位列最重要特征。添加单核苷酸多态性(SNP)后,与T2D相关的变异,包括rs2943641()、rs7903146()和rs7756992(),成为最重要的特征之一,尤其是在较年轻个体中。这些发现证明了纳入基因组学进行早期风险识别的转化潜力。在英国生物银行中,所有模型在临床和基因组数据组合的情况下,曲线下面积(AUC)均超过91%。年轻个体(≤55岁)的性能明显更好,强调了这些模型在早期检测方面的潜力。多基因风险评分(PRS)的整合进一步支持了风险预测,特别是在年轻个体中,尽管增量收益不大。
虽然传统临床因素仍然是T2D风险的最强预测因素,但基因组数据的整合使模型性能有适度提高,尤其是在年轻成年人中。在独立数据集上的验证证实了这些发现的普遍性,强调了多维风险预测模型在完善T2D风险评估方面的价值。