MOE Key Laboratory of Bioinformatics; Bioinformatics Division and Center for Synthetic & Systems Biology, TNLIST; Department of Automation, Tsinghua University, Beijing 100084, China Department of Statistics, Stanford University, Stanford, CA 94305, USA
J Mol Cell Biol. 2015 Jun;7(3):214-30. doi: 10.1093/jmcb/mjv008. Epub 2015 Feb 13.
Uncovering causal genes for human inherited diseases, as the primary step toward understanding the pathogenesis of these diseases, requires a combined analysis of genetic and genomic data. Although bioinformatics methods have been designed to prioritize candidate genes resulting from genetic linkage analysis or association studies, the coverage of both diseases and genes in existing methods is quite limited, thereby preventing the scan of causal genes for a significant proportion of diseases at the whole-genome level. To overcome this limitation, we propose a method named pgWalk to prioritize candidate genes by integrating multiple phenomic and genomic data. We derive three types of phenotype similarities among 7719 diseases and nine types of functional similarities among 20327 genes. Based on a pair of phenotype and gene similarities, we construct a disease-gene network and then simulate the process that a random walker wanders on such a heterogeneous network to quantify the strength of association between a candidate gene and a query disease. A weighted version of the Fisher's method with dependent correction is adopted to integrate 27 scores obtained in this way, and a final q-value is calibrated for prioritizing candidate genes. A series of validation experiments are conducted to demonstrate the superior performance of this approach. We further show the effectiveness of this method in exome sequencing studies of autism and epileptic encephalopathies. An online service and the standalone software of pgWalk can be found at http://bioinfo.au.tsinghua.edu.cn/jianglab/pgwalk.
揭示人类遗传性疾病的因果基因是理解这些疾病发病机制的首要步骤,需要对遗传连锁分析或关联研究产生的基因和基因组数据进行综合分析。尽管已经设计了生物信息学方法来优先考虑候选基因,但现有方法中疾病和基因的覆盖范围相当有限,从而阻止了在全基因组水平上对很大一部分疾病的因果基因进行扫描。为了克服这一限制,我们提出了一种名为 pgWalk 的方法,通过整合多种表型和基因组数据来优先考虑候选基因。我们从 7719 种疾病中推导出三种类型的表型相似性,从 20327 种基因中推导出九种功能相似性。基于一对表型和基因相似性,我们构建了疾病-基因网络,然后模拟随机游走者在这种异质网络上的游走过程,以量化候选基因与查询疾病之间的关联强度。采用带有依赖校正的 Fisher 方法的加权版本来整合以这种方式获得的 27 个得分,并为候选基因的优先级排序校准一个最终的 q 值。进行了一系列验证实验来证明这种方法的优越性能。我们还展示了该方法在自闭症和癫痫性脑病外显子组测序研究中的有效性。pgWalk 的在线服务和独立软件可在 http://bioinfo.au.tsinghua.edu.cn/jianglab/pgwalk 找到。