Mignone Flavio, Anselmo Anna, Donvito Giacinto, Maggi Giorgio P, Grillo Giorgio, Pesole Graziano
Department of Structural Chemistry and Inorganic Stereochemistry, School of Pharmacy, University of Milan, Italy.
BMC Genomics. 2008 Jun 11;9:277. doi: 10.1186/1471-2164-9-277.
The accurate detection of genes and the identification of functional regions is still an open issue in the annotation of genomic sequences. This problem affects new genomes but also those of very well studied organisms such as human and mouse where, despite the great efforts, the inventory of genes and regulatory regions is far from complete. Comparative genomics is an effective approach to address this problem. Unfortunately it is limited by the computational requirements needed to perform genome-wide comparisons and by the problem of discriminating between conserved coding and non-coding sequences. This discrimination is often based (thus dependent) on the availability of annotated proteins.
In this paper we present the results of a comprehensive comparison of human and mouse genomes performed with a new high throughput grid-based system which allows the rapid detection of conserved sequences and accurate assessment of their coding potential. By detecting clusters of coding conserved sequences the system is also suitable to accurately identify potential gene loci. Following this analysis we created a collection of human-mouse conserved sequence tags and carefully compared our results to reliable annotations in order to benchmark the reliability of our classifications. Strikingly we were able to detect several potential gene loci supported by EST sequences but not corresponding to as yet annotated genes.
Here we present a new system which allows comprehensive comparison of genomes to detect conserved coding and non-coding sequences and the identification of potential gene loci. Our system does not require the availability of any annotated sequence thus is suitable for the analysis of new or poorly annotated genomes.
在基因组序列注释中,基因的准确检测和功能区域的识别仍是一个未解决的问题。这个问题不仅影响新的基因组,也影响那些已被深入研究的生物体的基因组,如人类和小鼠的基因组,尽管付出了巨大努力,但基因和调控区域的清单仍远未完整。比较基因组学是解决这个问题的有效方法。不幸的是,它受到全基因组比较所需的计算要求以及区分保守编码和非编码序列问题的限制。这种区分通常基于(因此依赖于)注释蛋白质的可用性。
在本文中,我们展示了使用一种基于网格的新高通量系统对人类和小鼠基因组进行全面比较的结果,该系统能够快速检测保守序列并准确评估其编码潜力。通过检测编码保守序列的簇,该系统也适用于准确识别潜在的基因位点。经过此分析,我们创建了一组人鼠保守序列标签,并将我们的结果与可靠注释进行仔细比较,以便对我们分类的可靠性进行基准测试。令人惊讶的是,我们能够检测到几个由EST序列支持但尚未对应于已注释基因的潜在基因位点。
在这里,我们展示了一个新系统,它允许对基因组进行全面比较,以检测保守的编码和非编码序列并识别潜在的基因位点。我们的系统不需要任何注释序列的可用性,因此适用于分析新的或注释不佳的基因组。