Department of Genetics, Yale University, New Haven, Connecticut, 06510, USA.
Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, New York, 10029, USA.
Genes Immun. 2019 Sep;20(7):577-588. doi: 10.1038/s41435-019-0059-y. Epub 2019 Jan 29.
Genome-wide association studies have identified ~170 loci associated with Crohn's disease (CD) and defining which genes drive these association signals is a major challenge. The primary aim of this study was to define which CD locus genes are most likely to be disease related. We developed a gene prioritization regression model (GPRM) by integrating complementary mRNA expression datasets, including bulk RNA-Seq from the terminal ileum of 302 newly diagnosed, untreated CD patients and controls, and in stimulated monocytes. Transcriptome-wide association and co-expression network analyses were performed on the ileal RNA-Seq datasets, identifying 40 genome-wide significant genes. Co-expression network analysis identified a single gene module, which was substantially enriched for CD locus genes and most highly expressed in monocytes. By including expression-based and epigenetic information, we refined likely CD genes to 2.5 prioritized genes per locus from an average of 7.8 total genes. We validated our model structure using cross-validation and our prioritization results by protein-association network analyses, which demonstrated significantly higher CD gene interactions for prioritized compared with non-prioritized genes. Although individual datasets cannot convey all of the information relevant to a disease, combining data from multiple relevant expression-based datasets improves prediction of disease genes and helps to further understanding of disease pathogenesis.
全基因组关联研究已经确定了约 170 个与克罗恩病(CD)相关的基因座,确定哪些基因驱动这些关联信号是一个主要挑战。本研究的主要目的是确定哪些 CD 基因座基因最有可能与疾病相关。我们通过整合互补的 mRNA 表达数据集,包括 302 名新诊断、未经治疗的 CD 患者和对照者的末端回肠的批量 RNA-Seq,以及刺激的单核细胞,开发了一种基因优先级回归模型(GPRM)。对回肠 RNA-Seq 数据集进行了全基因组关联和共表达网络分析,确定了 40 个具有全基因组意义的基因。共表达网络分析确定了一个单一的基因模块,该模块与 CD 基因座基因显著富集,在单核细胞中表达水平最高。通过包括基于表达和表观遗传信息,我们将每个基因座的可能 CD 基因细化为 2.5 个优先基因,而平均有 7.8 个总基因。我们使用交叉验证验证了我们的模型结构,并使用蛋白质关联网络分析验证了我们的优先级结果,结果表明,与非优先级基因相比,优先级基因的 CD 基因相互作用显著更高。尽管单个数据集不能传达与疾病相关的所有信息,但整合来自多个相关表达数据集的数据可以提高疾病基因的预测能力,并有助于进一步了解疾病发病机制。