Jiang Rui, Gan Mingxin, He Peng
MOE Key Laboratory of Bioinformatics and Bioinformatics Division, TNLIST/Department of Automation, Tsinghua University, Beijing 100084, China.
BMC Syst Biol. 2011;5 Suppl 2(Suppl 2):S2. doi: 10.1186/1752-0509-5-S2-S2. Epub 2011 Dec 14.
The inference of genes that are truly associated with inherited human diseases from a set of candidates resulting from genetic linkage studies has been one of the most challenging tasks in human genetics. Although several computational approaches have been proposed to prioritize candidate genes relying on protein-protein interaction (PPI) networks, these methods can usually cover less than half of known human genes.
We propose to rely on the biological process domain of the gene ontology to construct a gene semantic similarity network and then use the network to infer disease genes. We show that the constructed network covers about 50% more genes than a typical PPI network. By analyzing the gene semantic similarity network with the PPI network, we show that gene pairs tend to have higher semantic similarity scores if the corresponding proteins are closer to each other in the PPI network. By analyzing the gene semantic similarity network with a phenotype similarity network, we show that semantic similarity scores of genes associated with similar diseases are significantly different from those of genes selected at random, and that genes with higher semantic similarity scores tend to be associated with diseases with higher phenotype similarity scores. We further use the gene semantic similarity network with a random walk with restart model to infer disease genes. Through a series of large-scale leave-one-out cross-validation experiments, we show that the gene semantic similarity network can achieve not only higher coverage but also higher accuracy than the PPI network in the inference of disease genes.
从基因连锁研究产生的一组候选基因中推断出与人类遗传性疾病真正相关的基因,一直是人类遗传学中最具挑战性的任务之一。尽管已经提出了几种计算方法,依靠蛋白质 - 蛋白质相互作用(PPI)网络对候选基因进行优先级排序,但这些方法通常只能覆盖不到一半的已知人类基因。
我们建议依靠基因本体的生物过程领域构建基因语义相似性网络,然后使用该网络推断疾病基因。我们表明,构建的网络比典型的PPI网络覆盖的基因多约50%。通过将基因语义相似性网络与PPI网络进行分析,我们发现如果相应蛋白质在PPI网络中彼此更接近,则基因对往往具有更高的语义相似性得分。通过将基因语义相似性网络与表型相似性网络进行分析,我们表明与相似疾病相关的基因的语义相似性得分与随机选择的基因的语义相似性得分有显著差异,并且语义相似性得分较高的基因往往与表型相似性得分较高的疾病相关。我们进一步使用带有重启模型的随机游走的基因语义相似性网络来推断疾病基因。通过一系列大规模的留一法交叉验证实验,我们表明在疾病基因的推断中,基因语义相似性网络不仅可以实现比PPI网络更高的覆盖率,还可以实现更高的准确性。