Medical Digital Twin Research Group, Department of Clinical Science, Intervention and Technology (CLINTEC), Karolinska Institute, Stockholm, Sweden.
Division of Cardiovascular Medicine, Channing Division of Network Medicine, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA.
Sci Rep. 2024 Jun 3;14(1):12710. doi: 10.1038/s41598-024-63399-9.
Multiomics analyses have identified multiple potential biomarkers of the incidence and prevalence of complex diseases. However, it is not known which type of biomarker is optimal for clinical purposes. Here, we make a systematic comparison of 90 million genetic variants, 1453 proteins, and 325 metabolites from 500,000 individuals with complex diseases from the UK Biobank. A machine learning pipeline consisting of data cleaning, data imputation, feature selection, and model training using cross-validation and comparison of the results on holdout test sets showed that proteins were most predictive, followed by metabolites, and genetic variants. Only five proteins per disease resulted in median (min-max) areas under the receiver operating characteristic curves for incidence of 0.79 (0.65-0.86) and 0.84 (0.70-0.91) for prevalence. In summary, our work suggests the potential of predicting complex diseases based on a limited number of proteins. We provide an interactive atlas (macd.shinyapps.io/ShinyApp/) to find genomic, proteomic, or metabolomic biomarkers for different complex diseases.
多组学分析已经确定了多种潜在的复杂疾病发病和流行的生物标志物。然而,目前尚不清楚哪种类型的生物标志物最适合临床应用。在这里,我们对来自英国生物库的 50 万名患有复杂疾病的个体的 9000 万个遗传变异、1453 种蛋白质和 325 种代谢物进行了系统比较。使用交叉验证和验证集比较结果的机器学习管道,包括数据清理、数据插补、特征选择和模型训练,结果表明蛋白质的预测能力最强,其次是代谢物,然后是遗传变异。对于每种疾病,只有 5 种蛋白质的中位(最小-最大)接受者操作特征曲线下面积(area under the receiver operating characteristic curve,AUC)分别为 0.79(0.65-0.86)和 0.84(0.70-0.91)。综上所述,我们的工作表明,基于少数蛋白质预测复杂疾病的潜力。我们提供了一个交互式图谱(macd.shinyapps.io/ShinyApp/),用于查找不同复杂疾病的基因组、蛋白质组或代谢组生物标志物。