Xu Yingke, Wu Qing
Nevada Institute of Personalized Medicine, College of Science, University of Nevada, Las Vegas, Nevada, United States of America.
Department of Epidemiology and Biostatistics, School of Public Health, the University of Nevada Las Vegas, Las Vegas, Nevada, United States of America.
PLOS Digit Health. 2025 Apr 9;4(4):e0000790. doi: 10.1371/journal.pdig.0000790. eCollection 2025 Apr.
Genetic factors contribute to 60-70% of the variability in rheumatoid arthritis (RA). However, few studies have used genetic variants to predict RA risk. This study aimed to enhance RA risk prediction by leveraging single nucleotide polymorphisms (SNPs) through machine-learning algorithms, utilizing Women's Health Initiative data. We developed four predictive models: 1) based on common RA risk factors, 2) model 1 incorporating polygenic risk scores (PRS) with principal components, 3) model 1 and SNPs after feature reduction, and 4) model 1 and SNPs with kernel principal component analysis. Each model was assessed using logistic regression (LR), random forest (RF), eXtreme Gradient Boosting (XGBoost), and support vector machine (SVM). Performance metrics included the area under the receiver operating characteristic curve (AUC), sensitivity, specificity, positive and negative predictive values (PPV and NPV), and F1-score. The fourth model, integrating SNPs with XGBoost, outperformed all other models. In addition, the XGBoost model that combines genomic data with conventional phenotypic predictors significantly enhanced predictive accuracy, achieving the highest AUC of 0.90 and an F1 score of 0.83. The DeLong test confirmed significant differences in AUC between this model and the others (p-values < 0.0001), particularly highlighting its efficacy in utilizing complex genetic information. These findings emphasize the advantage of combining in-depth genomic data with advanced machine learning for RA risk prediction. The most robust performance of the XGBoost model, which integrated both conventional risk factors and individual SNPs, demonstrates its potential as a tool in personalized medicine for complex diseases like RA. This approach offers a more nuanced and effective RA risk assessment strategy, underscoring the need for further studies to extend broader applications.
遗传因素导致类风湿性关节炎(RA)60%-70%的变异性。然而,很少有研究使用基因变异来预测RA风险。本研究旨在通过机器学习算法利用单核苷酸多态性(SNP)来提高RA风险预测能力,使用妇女健康倡议数据。我们开发了四种预测模型:1)基于常见RA风险因素;2)模型1将多基因风险评分(PRS)与主成分相结合;3)模型1和特征约简后的SNP;4)模型1和采用核主成分分析的SNP。每个模型都使用逻辑回归(LR)、随机森林(RF)、极端梯度提升(XGBoost)和支持向量机(SVM)进行评估。性能指标包括受试者操作特征曲线下面积(AUC)、敏感性、特异性、阳性和阴性预测值(PPV和NPV)以及F1分数。第四个模型,即XGBoost与SNP相结合的模型,优于所有其他模型。此外,将基因组数据与传统表型预测因子相结合的XGBoost模型显著提高了预测准确性,达到了最高的AUC为0.90和F1分数为0.83。DeLong检验证实了该模型与其他模型在AUC上的显著差异(p值<0.0001),特别突出了其在利用复杂遗传信息方面的功效。这些发现强调了将深入的基因组数据与先进的机器学习相结合用于RA风险预测的优势。整合了传统风险因素和个体SNP的XGBoost模型的最强性能证明了其作为RA等复杂疾病个性化医疗工具的潜力。这种方法提供了一种更细致、有效的RA风险评估策略,强调了进一步研究以扩大更广泛应用的必要性。