The Chinese University of Hong Kong, Hong Kong.
IEEE/ACM Trans Comput Biol Bioinform. 2012 Jan-Feb;9(1):79-87. doi: 10.1109/TCBB.2011.45. Epub 2011 Mar 3.
The composition vector (CV) method is an alignment-free method for sequence comparison. Because of its simplicity when compared with multiple sequence alignment methods, the method has been widely discussed lately; and some formulas based on probabilistic models, like Hao’s and Yu’s formulas, have been proposed. In this paper, we improve these formulas by using the entropy principle which can quantify the nonrandomness occurrence of patterns in the sequences. More precisely, existing formulas are used to generate a set of possible formulas from which we choose the one that maximizes the entropy. We give the closed-form solution to the resulting optimization problem. Hence, from any given CV formula, we can find the corresponding one that maximizes the entropy. In particular, we show that Hao’s formula is itself maximizing the entropy and we derive a new entropy-maximizing formula from Yu’s formula. We illustrate the accuracy of our new formula by using both simulated and experimental data sets. For the simulated data sets, our new formula gives the best consensus and significant values for three different kinds of evolution models. For the data set of tetrapod 18S rRNA sequences, our new formula groups the clades of bird and reptile together correctly, where Hao’s and Yu’s formulas failed. Using real data sets with different sizes, we show that our formula is more accurate than Hao’s and Yu’s formulas even for small data sets.
组成向量(CV)方法是一种用于序列比较的无比对方法。由于与多序列比对方法相比,该方法具有简单性,因此最近受到了广泛讨论;并且已经提出了一些基于概率模型的公式,如郝氏和于氏公式。在本文中,我们使用可以量化序列中模式非随机性出现的熵原理来改进这些公式。更准确地说,我们使用现有公式从一组可能的公式中生成,然后从这些公式中选择熵最大化的公式。我们给出了由此产生的优化问题的闭式解。因此,我们可以从任何给定的 CV 公式中找到最大化熵的对应公式。特别是,我们证明了郝氏公式本身就是最大化熵的,并且我们从于氏公式推导出了一个新的熵最大化公式。我们使用模拟数据集和实验数据集来证明我们新公式的准确性。对于模拟数据集,我们的新公式在三种不同的进化模型下给出了最佳共识和显著值。对于四足动物 18S rRNA 序列数据集,我们的新公式正确地将鸟类和爬行动物的分支聚类在一起,而郝氏公式和于氏公式则未能做到这一点。使用不同大小的真实数据集,我们表明,即使对于小数据集,我们的公式也比郝氏公式和于氏公式更准确。