Benson Mikael, Smelik Martin, Li Xinxiu, Loscalzo Joseph, Sysoev Oleg, Mahmud Firoj, Aly Dina Mansour, Zhao Yelin
Karolinska Institute.
Brigham and Women's Hospital.
Res Sq. 2024 Mar 5:rs.3.rs-3921099. doi: 10.21203/rs.3.rs-3921099/v1.
Multiomics analyses have identified multiple potential biomarkers of the incidence and prevalence of complex diseases. However, it is not known which type of biomarker is optimal for clinical purposes. Here, we make a systematic comparison of 90 million genetic variants, 1,453 proteins, and 325 metabolites from 500,000 individuals with complex diseases from the UK Biobank. A machine learning pipeline consisting of data cleaning, data imputation, feature selection, and model training using cross-validation and comparison of the results on holdout test sets showed that proteins were most predictive, followed by metabolites, and genetic variants. Only five proteins per disease resulted in median (min-max) areas under the receiver operating characteristic curves for incidence of 0.79 (0.65-0.86) and 0.84 (0.70-0.91) for prevalence. In summary, our work suggests the potential of predicting complex diseases based on a limited number of proteins. We provide an interactive atlas (macd.shinyapps.io/ShinyApp/) to find genomic, proteomic, or metabolomic biomarkers for different complex diseases.
多组学分析已经确定了复杂疾病发病率和患病率的多种潜在生物标志物。然而,尚不清楚哪种类型的生物标志物最适合临床应用。在此,我们对来自英国生物银行的50万名患有复杂疾病个体的9000万个基因变异、1453种蛋白质和325种代谢物进行了系统比较。一个由数据清理、数据插补、特征选择以及使用交叉验证的模型训练和在保留测试集上比较结果组成的机器学习流程表明,蛋白质的预测能力最强,其次是代谢物,基因变异的预测能力最弱。每种疾病仅需五种蛋白质,就可使预测发病率的受试者工作特征曲线下面积中位数(最小值 - 最大值)达到0.79(0.65 - 0.86),预测患病率的受试者工作特征曲线下面积中位数(最小值 - 最大值)达到0.84(0.70 - 0.91)。总之,我们的研究表明基于有限数量的蛋白质预测复杂疾病具有潜力。我们提供了一个交互式图谱(macd.shinyapps.io/ShinyApp/),用于查找不同复杂疾病的基因组、蛋白质组或代谢组生物标志物。