序列的局部解码与无比对比较。

Local decoding of sequences and alignment-free comparison.

作者信息

Didier Gilles, Laprevotte Ivan, Pupin Maude, Hénaut Alain

机构信息

Institut de Mathématiques de Luminy, UMR 6206, Campus de Luminay, Case 907, 13288 Marseille, France.

出版信息

J Comput Biol. 2006 Oct;13(8):1465-76. doi: 10.1089/cmb.2006.13.1465.

DOI:10.1089/cmb.2006.13.1465

PMID:17061922

Abstract

Subword composition plays an important role in a lot of analyses of sequences. Here we define and study the "local decoding of order N of sequences," an alternative that avoids some drawbacks of "subwords of length N" approaches while keeping informations about environments of length N in the sequences ("decoding" is taken here in the sense of hidden Markov modeling, i.e., associating some state to all positions of the sequence). We present an algorithm for computing the local decoding of order N of a given set of sequences. Its complexity is linear in the total length of the set (whatever the order N) both in time and memory space. In order to show a use of local decoding, we propose a very basic dissimilarity measure between sequences which can be computed both from local decoding of order N and composition in subwords of length N. The accuracies of these two dissimilarities are evaluated, over several datasets, by computing their linear correlations with a reference alignment-based distance. These accuracies are also compared to the one obtained from another recent alignment-free comparison.

摘要

子词构成在许多序列分析中起着重要作用。在此，我们定义并研究“序列的N阶局部解码”，这是一种替代方法，它避免了“长度为N的子词”方法的一些缺点，同时保留了序列中长度为N的环境信息（这里的“解码”是在隐马尔可夫模型的意义上，即给序列的所有位置关联某种状态）。我们提出一种算法来计算给定序列集的N阶局部解码。其复杂度在时间和内存空间上对于序列集的总长度而言都是线性的（无论阶数N是多少）。为了展示局部解码的一种用途，我们提出一种非常基本的序列间差异度量，它既可以从N阶局部解码计算得到，也可以从长度为N的子词构成计算得到。通过计算这两种差异与基于参考比对的距离之间的线性相关性，在几个数据集上评估了它们的准确性。这些准确性也与从另一种最近的无比对比较中获得的准确性进行了比较。