Zhang Lingang, Pavlovic Vladimir, Cantor Charles R, Kasif Simon
Center for Advanced Biotechnology, Boston University, Boston, Massachusetts 02215, USA.
Genome Res. 2003 Jun;13(6A):1190-202. doi: 10.1101/gr.703903. Epub 2003 May 12.
The identification of genes in the human genome remains a challenge, as the actual predictions appear to disagree tremendously and vary dramatically on the basis of the specific gene-finding methodology used. Because the pattern of conservation in coding regions is expected to be different from intronic or intergenic regions, a comparative computational analysis can lead, in principle, to an improved computational identification of genes in the human genome by using a reference, such as mouse genome. However, this comparative methodology critically depends on three important factors: (1) the selection of the most appropriate reference genome. In particular, it is not clear whether the mouse is at the correct evolutionary distance from the human to provide sufficiently distinctive conservation levels in different genomic regions, (2) the selection of comparative features that provide the most benefit to gene recognition, and (3) the selection of evidence integration architecture that effectively interprets the comparative features. We address the first question by a novel evolutionary analysis that allows us to explicitly correlate the performance of the gene recognition system with the evolutionary distance (time) between the two genomes. Our simulation results indicate that there is a wide range of reference genomes at different evolutionary time points that appear to deliver reasonable comparative prediction of human genes. In particular, the evolutionary time between human and mouse generally falls in the region of good performance; however, better accuracy might be achieved with a reference genome further than mouse. To address the second question, we propose several natural comparative measures of conservation for identifying exons and exon boundaries. Finally, we experiment with Bayesian networks for the integration of comparative and compositional evidence.
在人类基因组中识别基因仍然是一项挑战,因为实际预测结果似乎差异极大,并且会因所使用的特定基因发现方法而有显著不同。由于编码区域的保守模式预计与内含子或基因间区域不同,原则上,通过使用如小鼠基因组这样的参考基因组进行比较计算分析,能够改进对人类基因组中基因的计算识别。然而,这种比较方法严重依赖于三个重要因素:(1)选择最合适的参考基因组。特别是,尚不清楚小鼠与人类的进化距离是否合适,能否在不同基因组区域提供足够独特的保守水平;(2)选择对基因识别最有帮助的比较特征;(3)选择能有效解释比较特征的证据整合架构。我们通过一种新颖的进化分析来解决第一个问题,这种分析使我们能够明确地将基因识别系统的性能与两个基因组之间的进化距离(时间)关联起来。我们的模拟结果表明,在不同进化时间点存在广泛的参考基因组,它们似乎能对人类基因进行合理的比较预测。特别是,人与小鼠之间的进化时间通常处于性能良好的区域;然而,使用比小鼠更远的参考基因组可能会获得更高的准确性。为了解决第二个问题,我们提出了几种用于识别外显子和外显子边界的自然保守比较度量。最后,我们试验了用于整合比较证据和组成证据的贝叶斯网络。