Lin Michael F, Carlson Joseph W, Crosby Madeline A, Matthews Beverley B, Yu Charles, Park Soo, Wan Kenneth H, Schroeder Andrew J, Gramates L Sian, St Pierre Susan E, Roark Margaret, Wiley Kenneth L, Kulathinal Rob J, Zhang Peili, Myrick Kyl V, Antone Jerry V, Celniker Susan E, Gelbart William M, Kellis Manolis
Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02139, USA.
Genome Res. 2007 Dec;17(12):1823-36. doi: 10.1101/gr.6679507. Epub 2007 Nov 7.
The availability of sequenced genomes from 12 Drosophila species has enabled the use of comparative genomics for the systematic discovery of functional elements conserved within this genus. We have developed quantitative metrics for the evolutionary signatures specific to protein-coding regions and applied them genome-wide, resulting in 1193 candidate new protein-coding exons in the D. melanogaster genome. We have reviewed these predictions by manual curation and validated a subset by directed cDNA screening and sequencing, revealing both new genes and new alternative splice forms of known genes. We also used these evolutionary signatures to evaluate existing gene annotations, resulting in the validation of 87% of genes lacking descriptive names and identifying 414 poorly conserved genes that are likely to be spurious predictions, noncoding, or species-specific genes. Furthermore, our methods suggest a variety of refinements to hundreds of existing gene models, such as modifications to translation start codons and exon splice boundaries. Finally, we performed directed genome-wide searches for unusual protein-coding structures, discovering 149 possible examples of stop codon readthrough, 125 new candidate ORFs of polycistronic mRNAs, and several candidate translational frameshifts. These results affect >10% of annotated fly genes and demonstrate the power of comparative genomics to enhance our understanding of genome organization, even in a model organism as intensively studied as Drosophila melanogaster.
12种果蝇物种的测序基因组的可得性,使得利用比较基因组学来系统发现该属内保守的功能元件成为可能。我们针对蛋白质编码区域特有的进化特征开发了定量指标,并在全基因组范围内应用这些指标,在黑腹果蝇基因组中得到了1193个新的候选蛋白质编码外显子。我们通过人工审核对这些预测结果进行了评估,并通过定向cDNA筛选和测序对其中一部分进行了验证,发现了新基因以及已知基因的新可变剪接形式。我们还利用这些进化特征来评估现有的基因注释,结果验证了87%缺乏描述性名称的基因,并识别出414个保守性较差的基因,这些基因可能是错误预测、非编码基因或物种特异性基因。此外,我们的方法还对数百个现有的基因模型提出了各种改进建议,比如对翻译起始密码子和外显子剪接边界的修改。最后,我们在全基因组范围内定向搜索异常的蛋白质编码结构,发现了149个可能的终止密码子通读实例、125个新的多顺反子mRNA候选开放阅读框以及几个候选翻译移码。这些结果影响了超过10%的已注释果蝇基因,证明了比较基因组学在增强我们对基因组组织理解方面的强大作用,即使是在像黑腹果蝇这样经过深入研究的模式生物中也是如此。