Gotoh Osamu, Morita Mariko, Nelson David R
Computational Biology Research Center (CBRC), National Institute of Advanced Industrial Science and Technology (AIST), Koto-ku, Tokyo 135-0064, Japan.
BMC Bioinformatics. 2014 Jun 14;15:189. doi: 10.1186/1471-2105-15-189.
Accurate computational identification of eukaryotic gene organization is a long-standing problem. Despite the fundamental importance of precise annotation of genes encoded in newly sequenced genomes, the accuracy of predicted gene structures has not been critically evaluated, mostly due to the scarcity of proper assessment methods.
We present a gene-structure-aware multiple sequence alignment method for gene prediction using amino acid sequences translated from homologous genes from many genomes. The approach provides rich information concerning the reliability of each predicted gene structure. We have also devised an iterative method that attempts to improve the structures of suspiciously predicted genes based on a spliced alignment algorithm using consensus sequences or reliable homologs as templates. Application of our methods to cytochrome P450 and ribosomal proteins from 47 plant genomes indicated that 50 ~ 60 % of the annotated gene structures are likely to contain some defects. Whereas more than half of the defect-containing genes may be intrinsically broken, i.e. they are pseudogenes or gene fragments, located in unfinished sequencing areas, or corresponding to non-productive isoforms, the defects found in a majority of the remaining gene candidates can be remedied by our iterative refinement method.
Refinement of eukaryotic gene structures mediated by gene-structure-aware multiple protein sequence alignment is a useful strategy to dramatically improve the overall prediction quality of a set of homologous genes. Our method will be applicable to various families of protein-coding genes if their domain structures are evolutionarily stable. It is also feasible to apply our method to gene families from all kingdoms of life, not just plants.
真核生物基因结构的准确计算识别是一个长期存在的问题。尽管对新测序基因组中编码的基因进行精确注释至关重要,但预测基因结构的准确性尚未得到严格评估,这主要是由于缺乏合适的评估方法。
我们提出了一种基于基因结构感知的多序列比对方法,用于利用从多个基因组的同源基因翻译而来的氨基酸序列进行基因预测。该方法提供了有关每个预测基因结构可靠性的丰富信息。我们还设计了一种迭代方法,该方法基于剪接比对算法,以共有序列或可靠的同源物为模板,尝试改进可疑预测基因的结构。将我们的方法应用于47个植物基因组中的细胞色素P450和核糖体蛋白,结果表明50%至60%的注释基因结构可能存在一些缺陷。虽然超过一半的含缺陷基因可能本质上是断裂的,即它们是假基因或基因片段,位于未完成测序区域,或对应于无功能的异构体,但在大多数其余基因候选物中发现的缺陷可以通过我们的迭代优化方法得到纠正。
由基因结构感知的多蛋白序列比对介导的真核生物基因结构优化是一种显著提高一组同源基因整体预测质量的有用策略。如果其结构域结构在进化上稳定,我们的方法将适用于各种蛋白质编码基因家族。将我们的方法应用于所有生命王国的基因家族,而不仅仅是植物,也是可行的。