基于 DNA 序列改进图形表示的蛋白质编码基因重新注释。

Reannotation of protein-coding genes based on an improved graphical representation of DNA sequence.

机构信息

State Key Laboratory of Bioelectronics, School of Biological Science and Medical Engineering, Southeast University, Nanjing 210096, People's Republic of China.

出版信息

J Comput Chem. 2010 Aug;31(11):2126-35. doi: 10.1002/jcc.21500.

DOI:10.1002/jcc.21500

PMID:20175214

Abstract

Over annotation of protein coding genes is common phenomenon in microbial genomes, the genome of Amsacta moorei entomopoxvirus (AmEPV) is a typical case, because more than 63% of its annotated ORFs are hypothetical. In this article, we propose an improved graphical representation titled I-TN (improved curve based on trinucleotides) curve, which allows direct inspection of composition and distribution of codons and asymmetric gene structure. This improved graphical representation can also provide convenient tools for genome analysis. From this presentation, 18 variables are exploited as numerical descriptors to represent the specific features of protein coding genes quantitatively, with which we reannotate the protein coding genes in several viral genomes. Using the parameters trained on the experimentally validated genes, all of the 30 experimentally validated genes and 63 putative genes in AmEPV genome are recognized correctly as protein coding, the accuracies of the present method for self-test and cross-validation are 100%, respectively. Twenty-eight annotated hypothetical genes are predicted as noncoding, and then the number of reannotated protein coding genes in AmEPV should be 266 instead of 294 reported in the original annotations. Extending the present method trained in AmEPV to other entomopoxvirus genomes directly, such as Melanoplus sanguinipes entomopoxvirus (MsEPV), all of the 123 annotated function-known and putative genes are recognized correctly as protein coding, and 17 hypothetical genes are recognized as noncoding. The present method could also be extended to other genomes with or without adaptation of training sets with high accuracy.

摘要

在微生物基因组中，蛋白质编码基因的过度注释是一种常见现象，Amsacta moorei 昆虫痘病毒（AmEPV）的基因组就是一个典型的例子，因为其注释的 ORF 中有超过 63%是假设的。在本文中，我们提出了一种改进的图形表示方法，称为 I-TN（基于三核苷酸的改进曲线）曲线，它可以直接检查密码子的组成和分布以及不对称的基因结构。这种改进的图形表示方法也可以为基因组分析提供方便的工具。从这个表示方法中，我们利用了 18 个变量作为数值描述符，对蛋白质编码基因进行定量表示，用这些数值描述符重新注释了几个病毒基因组中的蛋白质编码基因。使用在实验验证基因上训练的参数，AmEPV 基因组中 30 个经过实验验证的基因和 63 个假定基因都被正确地识别为蛋白质编码基因，本方法的自我测试和交叉验证的准确率分别为 100%。28 个注释的假设基因被预测为非编码基因，因此，在原始注释中报告的 AmEPV 中的重新注释的蛋白质编码基因的数量应该是 266 个，而不是 294 个。将在 AmEPV 中训练的本方法扩展到其他昆虫痘病毒基因组，如 Melanoplus sanguinipes 昆虫痘病毒（MsEPV），123 个注释的功能已知和假定基因都被正确地识别为蛋白质编码基因，17 个假设基因被识别为非编码基因。该方法也可以扩展到其他具有或不具有训练集适应性的基因组，具有很高的准确性。