MOE Key Laboratory of Bioinformatics and Bioinformatics Division, TNLIST/Department of Automation, Tsinghua University, Beijing 10084, China.
BMC Bioinformatics. 2011 Feb 15;12 Suppl 1(Suppl 1):S11. doi: 10.1186/1471-2105-12-S1-S11.
The identification of genes responsible for human inherited diseases is one of the most challenging tasks in human genetics. Recent studies based on phenotype similarity and gene proximity have demonstrated great success in prioritizing candidate genes for human diseases. However, most of these methods rely on a single protein-protein interaction (PPI) network to calculate similarities between genes, and thus greatly restrict the scope of application of such methods. Meanwhile, independently constructed and maintained PPI networks are usually quite diverse in coverage and quality, making the selection of a suitable PPI network inevitable but difficult.
We adopt a linear model to explain similarities between disease phenotypes using gene proximities that are quantified by diffusion kernels of one or more PPI networks. We solve this model via a Bayesian approach, and we derive an analytic form for Bayes factor that naturally measures the strength of association between a query disease and a candidate gene and thus can be used as a score to prioritize candidate genes. This method is intrinsically capable of integrating multiple PPI networks.
We show that gene proximities calculated from PPI networks imply phenotype similarities. We demonstrate the effectiveness of the Bayesian regression approach on five PPI networks via large scale leave-one-out cross-validation experiments and summarize the results in terms of the mean rank ratio of known disease genes and the area under the receiver operating characteristic curve (AUC). We further show the capability of our approach in integrating multiple PPI networks.
The Bayesian regression approach can achieve much higher performance than the existing CIPHER approach and the ordinary linear regression method. The integration of multiple PPI networks can greatly improve the scope of application of the proposed method in the inference of disease genes.
鉴定导致人类遗传性疾病的基因是人类遗传学中最具挑战性的任务之一。基于表型相似性和基因邻近性的最新研究已经证明,优先考虑人类疾病候选基因的方法取得了巨大成功。然而,这些方法大多依赖于单个蛋白质-蛋白质相互作用(PPI)网络来计算基因之间的相似性,因此极大地限制了这些方法的应用范围。同时,独立构建和维护的 PPI 网络在覆盖范围和质量上通常差异很大,因此选择合适的 PPI 网络是必不可少的,但也很困难。
我们采用线性模型,使用通过一个或多个 PPI 网络的扩散核量化的基因邻近度来解释疾病表型之间的相似性。我们通过贝叶斯方法求解该模型,并推导出贝叶斯因子的解析形式,该形式自然地衡量了查询疾病与候选基因之间的关联强度,因此可作为优先考虑候选基因的分数。这种方法本质上能够整合多个 PPI 网络。
我们表明,从 PPI 网络计算出的基因邻近度暗示了表型相似性。我们通过大规模的留一法交叉验证实验展示了贝叶斯回归方法在五个 PPI 网络上的有效性,并以已知疾病基因的平均秩比和接收者操作特征曲线(ROC)下的面积(AUC)来总结结果。我们进一步展示了我们的方法整合多个 PPI 网络的能力。
贝叶斯回归方法可以比现有的 CIPHER 方法和普通线性回归方法实现更高的性能。多个 PPI 网络的整合可以极大地提高所提出方法在推断疾病基因方面的应用范围。