Aita Takuyo, Husimi Yuzuru, Nishigaki Koichi
Graduate School of Science and Engineering, Saitama University, 255 Shimo-okubo, Saitama 338-8570, Japan.
Biosystems. 2011 Nov;106(2-3):67-75. doi: 10.1016/j.biosystems.2011.06.009. Epub 2011 Jul 1.
To measure the similarity or dissimilarity between two given biological sequences, several papers proposed metrics based on the "word-composition vector". The essence of these metrics is as follows. First, we count the appearance frequencies of all the K-tuple words throughout each of two given sequences. Then, the two given sequences are transformed into their respective word-composition vectors. Next, the distance metrics, for example the angle between the two vectors, are calculated. A significant issue is to determine the optimal word size K. With a mathematical model of mutational events (including substitutions, insertions, deletions and duplications) that occur in sequences, we analyzed how the angle between the composition vectors depends on the mutational events. We also considered the optimal word size (=resolution) from our original approach. Our results were verified by computational experiments using artificially generated sequences, amino acid sequences of hemoglobin and nucleotide sequences of 16S ribosomal RNA.
为了测量两个给定生物序列之间的相似性或相异性,几篇论文提出了基于“词组成向量”的度量标准。这些度量标准的本质如下。首先,我们统计两个给定序列中所有K元组词的出现频率。然后,将两个给定序列转换为它们各自的词组成向量。接下来,计算距离度量,例如两个向量之间的夹角。一个重要的问题是确定最佳词大小K。利用序列中发生的突变事件(包括替换、插入、缺失和重复)的数学模型,我们分析了组成向量之间的夹角如何依赖于突变事件。我们还从我们原来的方法考虑了最佳词大小(=分辨率)。我们的结果通过使用人工生成的序列、血红蛋白的氨基酸序列和16S核糖体RNA的核苷酸序列的计算实验得到了验证。