用于原核基因识别的多变量熵距离方法

Multivariate entropy distance method for prokaryotic gene identification.

作者信息

Ouyang Zhengqing, Zhu Huaiqiu, Wang Jin, She Zhen-Su

机构信息

State Key Lab for Turbulence and Complex Systems and Center for Theoretical Biology, Peking University, Beijing 100871, China.

出版信息

J Bioinform Comput Biol. 2004 Jun;2(2):353-73. doi: 10.1142/s0219720004000624.

DOI:10.1142/s0219720004000624

PMID:15297987

Abstract

A new simple method is found for efficient and accurate identification of coding sequences in prokaryotic genome. The method employs a Shannon description of artificial language for DNA sequences. It consists in translating a DNA sequence into a pseudo-amino acid sequence with 20 fundamental words according to the universal genetic code. With an entropy-density profile (EDP), the method maps a sequence of finite length to a vector and then analyzes its position in the 20-dimensional phase space depending on its nature. It is found that the ratio of the relative distance to an averaged coding and non-coding EDP over a small number (up to one) of open reading frames (ORFs) can serve as a good coding potential. An iterative algorithm is designed for finding a set of "root" sequences using this coding potential. A multivariate entropy distance (MED) algorithm is then proposed for the identification of prokaryotic genes; it has a feature to combine the use of a coding potential and an EDP-based sequence similarity analysis. The current version of MED is unsupervised, parameter-free and simple to implement. It is demonstrated to be able to detect 95-99% genes with 10-30% of additional genes when tested against the RefSeq database of NCBI and to detect 97.5-99.8% of confirmed genes with known functions. It is also shown to be able to find a set of (functionally known) genes that are missed by other well-known gene finding algorithms. All measurements show that the MED algorithm reaches a similar performance level as the algorithms like GeneMark and Glimmer for prokaryotic gene prediction.

摘要

一种用于高效准确识别原核生物基因组中编码序列的新的简单方法被发现。该方法采用对DNA序列的人工语言进行香农描述。它包括根据通用遗传密码将DNA序列翻译成具有20个基本单词的伪氨基酸序列。通过熵密度分布图（EDP），该方法将有限长度的序列映射为一个向量，然后根据其性质分析其在20维相空间中的位置。研究发现，在少数（最多一个）开放阅读框（ORF）上，相对距离与平均编码和非编码EDP的比值可作为良好的编码潜力。设计了一种迭代算法，利用这种编码潜力来寻找一组“根”序列。然后提出了一种多变量熵距离（MED）算法用于原核生物基因的识别；它具有结合使用编码潜力和基于EDP的序列相似性分析的特点。MED的当前版本是无监督的、无参数的且易于实现。当针对NCBI的RefSeq数据库进行测试时，它能够检测出95 - 99%的基因，同时还有10 - 30%的额外基因，并且能够检测出97.5 - 99.8%的已知功能的已确认基因。它还被证明能够找到一组其他著名基因发现算法遗漏的（功能已知）基因。所有测量结果表明，MED算法在原核生物基因预测方面达到了与GeneMark和Glimmer等算法相似的性能水平。