Bioinformatics Research Unit, Division of Genome and Biodiversity Research, National Institute of Agrobiological Sciences, 2-1-2 Kannondai, Tsukuba, Ibaraki 305-8602, Japan.
DNA Res. 2010 Oct;17(5):271-9. doi: 10.1093/dnares/dsq017. Epub 2010 Jul 28.
We present an annotation pipeline that accurately predicts exon-intron structures and protein-coding sequences (CDSs) on the basis of full-length cDNAs (FLcDNAs). This annotation pipeline was used to identify genes in 10 plant genomes. In particular, we show that interspecies mapping of FLcDNAs to genomes is of great value in fully utilizing FLcDNA resources whose availability is limited to several species. Because low sequence conservation at 5'- and 3'-ends of FLcDNAs between different species tends to result in truncated CDSs, we developed an improved algorithm to identify complete CDSs by the extension of both ends of truncated CDSs. Interspecies mapping of 71 801 monocot FLcDNAs to the Oryza sativa genome led to the detection of 22 142 protein-coding regions. Moreover, in comparing two mapping programs and three ab initio prediction programs, we found that our pipeline was more capable of identifying complete CDSs. As demonstrated by monocot interspecies mapping, in which nucleotide identity between FLcDNAs and the genome was ∼80%, the resultant inferred CDSs were sufficiently accurate. Finally, we applied both inter- and intraspecies mapping to 10 monocot and dicot genomes and identified genes in 210 551 loci. Interspecies mapping of FLcDNAs is expected to effectively predict genes and CDSs in newly sequenced genomes.
我们提出了一个注释流水线,该流水线能够基于全长 cDNA(FLcDNA)准确预测外显子-内含子结构和蛋白质编码序列(CDS)。该注释流水线用于鉴定 10 种植物基因组中的基因。特别是,我们表明,FLcDNA 与基因组的种间映射对于充分利用 FLcDNA 资源具有重要价值,而这些资源的可用性仅限于少数几种物种。由于不同物种之间的 FLcDNA 在 5'和 3'末端的序列保守性较低,往往会导致 CDS 截断,因此我们开发了一种改进的算法,通过延伸截断 CDS 的两端来识别完整的 CDS。将 71801 条单子叶植物 FLcDNA 映射到水稻基因组上,共检测到 22142 个蛋白质编码区。此外,在比较两种映射程序和三种从头预测程序时,我们发现我们的流水线更能够识别完整的 CDS。如单子叶植物种间映射所示,FLcDNA 与基因组之间的核苷酸同一性约为 80%,由此推断出的 CDS 足够准确。最后,我们将种间和种内映射应用于 10 种单子叶植物和双子叶植物基因组,并在 210551 个基因座中鉴定了基因。FLcDNA 的种间映射有望有效地预测新测序基因组中的基因和 CDS。