Usuka J, Brendel V
Department of Chemistry, Stanford University, Stanford, CA, 94305, USA.
J Mol Biol. 2000 Apr 14;297(5):1075-85. doi: 10.1006/jmbi.2000.3641.
Gene identification in genomic DNA from eukaryotes is complicated by the vast combinatorial possibilities of potential exon assemblies. If the gene encodes a protein that is closely related to known proteins, gene identification is aided by matching similarity of potential translation products to those target proteins. The genomic DNA and protein sequences can be aligned directly by scoring the implied residues of in-frame nucleotide triplets against the protein residues in conventional ways, while allowing for long gaps in the alignment corresponding to introns in the genomic DNA. We describe a novel method for such spliced alignment. The method derives an optimal alignment based on scoring for both sequence similarity of the predicted gene product to the protein sequence and intrinsic splice site strength of the predicted introns. Application of the method to a representative set of 50 known genes from Arabidopsis thaliana showed significant improvement in prediction accuracy compared to previous spliced alignment methods. The method is also more accurate than ab initio gene prediction methods, provided sufficiently close target proteins are available. In view of the fast growth of public sequence repositories, we argue that close targets will be available for the majority of novel genes, making spliced alignment an excellent practical tool for high-throughput automated genome annotation.
真核生物基因组DNA中的基因识别因潜在外显子组装的巨大组合可能性而变得复杂。如果该基因编码一种与已知蛋白质密切相关的蛋白质,那么通过将潜在翻译产物与那些目标蛋白质的相似性进行匹配,有助于基因识别。基因组DNA和蛋白质序列可以通过以常规方式将读框内核苷酸三联体的隐含残基与蛋白质残基进行计分来直接比对,同时允许比对中出现与基因组DNA中的内含子相对应的长缺口。我们描述了一种用于这种剪接比对的新方法。该方法基于对预测基因产物与蛋白质序列的序列相似性以及预测内含子的内在剪接位点强度进行计分,得出最优比对。将该方法应用于来自拟南芥的一组50个已知基因的代表性样本时,与之前的剪接比对方法相比,预测准确性有了显著提高。如果有足够接近的目标蛋白质,该方法也比从头预测基因的方法更准确。鉴于公共序列库的快速增长,我们认为大多数新基因都将有接近的目标蛋白质,这使得剪接比对成为高通量自动基因组注释的一种出色实用工具。