Georg-August-Universität, Institut für Mikrobiologie und Genetik, Göttingen, Germany.
BMC Bioinformatics. 2010 Jul 30;11:406. doi: 10.1186/1471-2105-11-406.
While multiple alignment is the first step of usual classification schemes for biological sequences, alignment-free methods are being increasingly used as alternatives when multiple alignments fail. Subword-based combinatorial methods are popular for their low algorithmic complexity (suffix trees ...) or exhaustivity (motif search), in general with fixed length word and/or number of mismatches. We developed previously a method to detect local similarities (the N-local decoding) based on the occurrences of repeated subwords of fixed length, which does not impose a fixed number of mismatches. The resulting similarities are, for some "good" values of N, sufficiently relevant to form the basis of a reliable alignment-free classification. The aim of this paper is to develop a method that uses the similarities detected by N-local decoding while not imposing a fixed value of N. We present a procedure that selects for every position in the sequences an adaptive value of N, and we implement it as the MS4 classification tool.
Among the equivalence classes produced by the N-local decodings for all N, we select a (relatively) small number of "relevant" classes corresponding to variable length subwords that carry enough information to perform the classification. The parameter N, for which correct values are data-dependent and thus hard to guess, is here replaced by the average repetitivity kappa of the sequences. We show that our approach yields classifications of several sets of HIV/SIV sequences that agree with the accepted taxonomy, even on usually discarded repetitive regions (like the non-coding part of LTR).
The method MS4 satisfactorily classifies a set of sequences that are notoriously hard to align. This suggests that our approach forms the basis of a reliable alignment-free classification tool. The only parameter kappa of MS4 seems to give reasonable results even for its default value, which can be a great advantage for sequence sets for which little information is available.
虽然多序列比对是生物序列常用分类方案的第一步,但当多序列比对失败时,人们越来越倾向于使用无比对方法作为替代方法。基于子词的组合方法因其算法复杂度低(后缀树等)或详尽性(基序搜索)而受到广泛关注,通常使用固定长度的单词和/或错配数量。我们之前开发了一种基于固定长度重复子词出现次数来检测局部相似性的方法(N-局部解码),该方法不强制设定固定数量的错配。对于某些“良好”的 N 值,由此产生的相似性足以构成可靠的无比对分类的基础。本文的目的是开发一种使用 N-局部解码检测到的相似性而不强制设定 N 值的方法。我们提出了一种为序列中的每个位置选择自适应 N 值的程序,并将其实现为 MS4 分类工具。
在针对所有 N 的 N-局部解码产生的等价类中,我们选择了相对较少的“相关”类,这些类对应于具有足够信息进行分类的可变长度子词。参数 N 是数据相关的,因此难以猜测,在此用序列的平均重复率 kappa 代替。我们表明,我们的方法对几类 HIV/SIV 序列进行了分类,这些分类与公认的分类法一致,甚至在通常被丢弃的重复区域(如 LTR 的非编码部分)也是如此。
MS4 方法令人满意地对一组难以对齐的序列进行了分类。这表明我们的方法为可靠的无比对分类工具奠定了基础。MS4 的唯一参数 kappa 甚至在其默认值下似乎也能得到合理的结果,这对于可用信息很少的序列集来说是一个巨大的优势。