Kellis Manolis, Patterson Nick, Birren Bruce, Berger Bonnie, Lander Eric S
Whitehead Institute Center for Genome Research, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, MA 02139, USA.
J Comput Biol. 2004;11(2-3):319-55. doi: 10.1089/1066527041410319.
In Kellis et al. (2003), we reported the genome sequences of S. paradoxus, S. mikatae, and S. bayanus and compared these three yeast species to their close relative, S. cerevisiae. Genomewide comparative analysis allowed the identification of functionally important sequences, both coding and noncoding. In this companion paper we describe the mathematical and algorithmic results underpinning the analysis of these genomes. (1) We present methods for the automatic determination of genome correspondence. The algorithms enabled the automatic identification of orthologs for more than 90% of genes and intergenic regions across the four species despite the large number of duplicated genes in the yeast genome. The remaining ambiguities in the gene correspondence revealed recent gene family expansions in regions of rapid genomic change. (2) We present methods for the identification of protein-coding genes based on their patterns of nucleotide conservation across related species. We observed the pressure to conserve the reading frame of functional proteins and developed a test for gene identification with high sensitivity and specificity. We used this test to revisit the genome of S. cerevisiae, reducing the overall gene count by 500 genes (10% of previously annotated genes) and refining the gene structure of hundreds of genes. (3) We present novel methods for the systematic de novo identification of regulatory motifs. The methods do not rely on previous knowledge of gene function and in that way differ from the current literature on computational motif discovery. Based on genomewide conservation patterns of known motifs, we developed three conservation criteria that we used to discover novel motifs. We used an enumeration approach to select strongly conserved motif cores, which we extended and collapsed into a small number of candidate regulatory motifs. These include most previously known regulatory motifs as well as several noteworthy novel motifs. The majority of discovered motifs are enriched in functionally related genes, allowing us to infer a candidate function for novel motifs. Our results demonstrate the power of comparative genomics to further our understanding of any species. Our methods are validated by the extensive experimental knowledge in yeast and will be invaluable in the study of complex genomes like that of the human.
在凯利斯等人(2003年)的研究中,我们公布了奇异酵母、米卡塔酵母和贝酵母的基因组序列,并将这三种酵母物种与其近亲酿酒酵母进行了比较。全基因组比较分析有助于识别功能上重要的序列,包括编码序列和非编码序列。在这篇配套论文中,我们描述了支撑这些基因组分析的数学和算法结果。(1)我们提出了自动确定基因组对应关系的方法。尽管酵母基因组中存在大量重复基因,但这些算法能够自动识别出四种物种中超过90%的基因和基因间区域的直系同源物。基因对应关系中剩余的模糊之处揭示了基因组快速变化区域中近期的基因家族扩张。(2)我们提出了基于相关物种间核苷酸保守模式来识别蛋白质编码基因的方法。我们观察到了保持功能蛋白阅读框的压力,并开发了一种具有高灵敏度和特异性的基因识别测试方法。我们使用该测试方法重新审视酿酒酵母的基因组,使总体基因数量减少了500个基因(占先前注释基因的10%),并完善了数百个基因的基因结构。(3)我们提出了系统地从头识别调控基序的新方法。这些方法不依赖于基因功能的先验知识,因此与当前关于计算基序发现的文献有所不同。基于已知基序的全基因组保守模式,我们制定了三个保守标准,用于发现新的基序。我们采用枚举方法选择高度保守的基序核心,将其扩展并合并为少数候选调控基序。这些基序包括大多数先前已知的调控基序以及几个值得注意的新基序。大多数发现的基序在功能相关基因中富集,这使我们能够推断出新基序的候选功能。我们的结果证明了比较基因组学在深化我们对任何物种理解方面的强大作用。我们的方法通过酵母中广泛的实验知识得到了验证,在研究像人类这样的复杂基因组时将具有重要价值。