School of Science, Hangzhou Dianzi Unviersity, Hangzhou 310018, People's Republic of China.
J Biomol Struct Dyn. 2011 Apr;28(5):833-43. doi: 10.1080/07391102.2011.10508611.
Sequence comparison is one of the major tasks in bioinformatics, which could serve as evidence of structural and functional conservation, as well as of evolutionary relations. There are several similarity/dissimilarity measures for sequence comparison, but challenges remains. This paper presented a binomial model-based measure to analyze biological sequences. With help of a random indicator, the occurrence of a word at any position of sequence can be regarded as a random Bernoulli variable, and the distribution of a sum of the word occurrence is well known to be a binomial one. By using a recursive formula, we computed the binomial probability of the word count and proposed a binomial model-based measure based on the relative entropy. The proposed measure was tested by extensive experiments including classification of HEV genotypes and phylogenetic analysis, and further compared with alignment-based and alignment-free measures. The results demonstrate that the proposed measure based on binomial model is more efficient.
序列比对是生物信息学中的主要任务之一,它可以作为结构和功能保守性以及进化关系的证据。有几种用于序列比较的相似性/相异性度量方法,但仍然存在挑战。本文提出了一种基于二项式模型的度量方法来分析生物序列。借助随机指标,可以将序列中任何位置的单词出现视为随机伯努利变量,并且众所周知,单词出现的和的分布是二项式的。通过使用递归公式,我们计算了单词计数的二项式概率,并基于相对熵提出了一种基于二项式模型的度量方法。通过包括 HEV 基因型分类和系统发育分析在内的广泛实验对所提出的度量方法进行了测试,并与基于比对和无比对的度量方法进行了进一步比较。结果表明,基于二项式模型的度量方法更为高效。