Suppr超能文献

多变量全基因组关联研究模型可提高对克罗恩病风险的预测,并鉴定潜在的新变异。

Multivariate genome-wide association study models to improve prediction of Crohn's disease risk and identification of potential novel variants.

机构信息

Tecnologico de Monterrey, Escuela de Medicina, Cátedra de Bioinformática, Av. Morones Prieto No. 3000, Colonia Los Doctores, Monterrey Nuevo León, 64710, Mexico.

Graduate Professional Studies, Brandeis University, Waltham, 02453, MA, USA.

出版信息

Comput Biol Med. 2022 Jun;145:105398. doi: 10.1016/j.compbiomed.2022.105398. Epub 2022 Mar 12.

Abstract

BACKGROUND

Crohn's disease (CD) is a type of inflammatory bowel disease (IBD) that affects the gastrointestinal tract with diverse symptoms. At present, genome-wide association studies (GWAS) has discovered more than 140 genetic loci associated with CD from several datasets. Using the usual univariate GWAS methods, researchers have discovered common variants with small effects. Univariate methods assume independence among the variants that miss subtle combinatorial signals. Multivariate approaches have improved risk prediction and have complemented univariate methods for elucidating the etiology of complex traits and potential novel associations. However, the current multivariate models for CD have been assessed for three datasets (published from 2006 to 2008) under unrelated methodological settings showing a broad performance spectrum. Notably, these multivariate studies do not analyze potential novel variants. Here, we aimed to perform a robust multivariate analysis of a CD dataset different from the one commonly used, and we used the information yielded by the models to identify whether the generated models could provide additional information about the potential novel variants of CD.

METHODS

Therefore, we compared different multivariate methods and models, LASSO (least absolute shrinkage and selection operator), XGBoost, random forest (RF), Bootstrap stage-wise model selection (BSWiMS), and LDpred, using a strict random subsampling approach to predict the CD risk using a recent GWAS dataset, United Kingdom IBD IBD Genetics Consortium (UKIBDGC), made available in 2017, that had not been used for CD prediction studies. In addition, we assessed the effect of common strategies by increasing and decreasing the number of single-nucleotide polymorphism (SNP) markers (using genotype imputation and linkage disequilibrium (LD)-clumping).

RESULTS

We found that the LDpred model without any imputation was the best model among all the tested models for predicting the CD risk (area under the receiver operating characteristic curve (AUROC) = 0.667 ± 0.024) in this dataset. We validated the best models using a second dataset (National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK) IBD Genetics Consortium, which was previously used in CD prediction studies) in which LDpred was also the best method with a similar performance (AUROC = 0.634 ± 0.009). Based on the importance of the variants yielded by the multivariate models, we identified an unnoticed region within chromosome 6, tagged by SNP rs4945943; this region was close to the gene MARCKS, which appeared to contribute to CD risk.

CONCLUSIONS

This research is the first multivariate prediction analysis applied to the UKIBDGC dataset. Our robust multivariate setting analysis enabled us to identify a potential variant that contributed to the CD risk. Multivariate methods are valuable tools for identifying genes that contribute to disease risk.

摘要

背景

克罗恩病(CD)是一种影响胃肠道的炎症性肠病(IBD),具有多种症状。目前,全基因组关联研究(GWAS)已经从多个数据集发现了 140 多个与 CD 相关的遗传位点。使用常用的单变量 GWAS 方法,研究人员发现了具有小效应的常见变体。单变量方法假设变体之间相互独立,从而错过了细微的组合信号。多变量方法提高了风险预测能力,并补充了单变量方法,以阐明复杂特征的病因和潜在的新关联。然而,目前针对 CD 的多变量模型已经在三个数据集(2006 年至 2008 年发表)下进行了评估,这些数据集的方法设置不同,表现出广泛的性能范围。值得注意的是,这些多变量研究并未分析潜在的新变体。在这里,我们旨在对一个不同于常用数据集的 CD 数据集进行稳健的多变量分析,并使用模型产生的信息来确定生成的模型是否可以提供有关 CD 潜在新变体的附加信息。

方法

因此,我们使用严格的随机子采样方法比较了不同的多变量方法和模型,包括 LASSO(最小绝对收缩和选择算子)、XGBoost、随机森林(RF)、Bootstrap 逐步模型选择(BSWiMS)和 LDpred,以使用最近的 GWAS 数据集(2017 年提供的英国 IBD IBD 遗传学联合会(UKIBDGC))预测 CD 风险,该数据集之前未用于 CD 预测研究。此外,我们通过增加和减少单核苷酸多态性(SNP)标记的数量(使用基因型推断和连锁不平衡(LD)聚类)来评估常见策略的效果。

结果

我们发现,在这个数据集,不进行任何推断的 LDpred 模型是所有测试模型中预测 CD 风险的最佳模型(接受者操作特征曲线下的面积(AUROC)= 0.667 ± 0.024)。我们使用第二个数据集(以前用于 CD 预测研究的国家糖尿病、消化和肾脏疾病研究所(NIDDK)IBD 遗传学联合会)验证了最佳模型,LDpred 也是性能相似的最佳方法(AUROC = 0.634 ± 0.009)。基于多变量模型产生的变体的重要性,我们在 6 号染色体上发现了一个被标记为 SNP rs4945943 的未被注意到的区域;该区域靠近 MARCKS 基因,该基因似乎与 CD 风险有关。

结论

这是首次将多变量预测分析应用于 UKIBDGC 数据集。我们稳健的多变量设置分析使我们能够识别出一个潜在的变体,该变体有助于 CD 风险。多变量方法是识别疾病风险相关基因的有价值的工具。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验