Department of Software engineering, College of Mathematics and Informatics, Fujian Normal University, Fuzhou 350108, China.
Department of Computer Science & Electrical Engineering, West Virginia University, Morgantown, WV 26506, USA.
Bioinformatics. 2018 May 15;34(10):1682-1689. doi: 10.1093/bioinformatics/btx809.
Alignment-free sequence comparison methods can compute the pairwise similarity between a huge number of sequences much faster than sequence-alignment based methods.
We propose a new non-parametric alignment-free sequence comparison method, called K2, based on the Kendall statistics. Comparing to the other state-of-the-art alignment-free comparison methods, K2 demonstrates competitive performance in generating the phylogenetic tree, in evaluating functionally related regulatory sequences, and in computing the edit distance (similarity/dissimilarity) between sequences. Furthermore, the K2 approach is much faster than the other methods. An improved method, K2*, is also proposed, which is able to determine the appropriate algorithmic parameter (length) automatically, without first considering different values. Comparative analysis with the state-of-the-art alignment-free sequence similarity methods demonstrates the superiority of the proposed approaches, especially with increasing sequence length, or increasing dataset sizes.
The K2 and K2* approaches are implemented in the R language as a package and is freely available for open access (http://community.wvu.edu/daadjeroh/projects/K2/K2_1.0.tar.gz).
Supplementary data are available at Bioinformatics online.
无比对序列比对方法可以比基于序列比对的方法更快地计算大量序列之间的两两相似度。
我们提出了一种新的基于 Kendall 统计的无参数非比对序列比对方法,称为 K2。与其他最先进的无比对比较方法相比,K2 在生成系统发育树、评估功能相关调控序列以及计算序列之间的编辑距离(相似性/相异性)方面表现出了竞争力。此外,K2 方法比其他方法快得多。我们还提出了一种改进的方法 K2*,它能够自动确定适当的算法参数(长度),而无需先考虑不同的值。与最先进的无比对序列相似性方法的比较分析表明了所提出方法的优越性,尤其是随着序列长度的增加或数据集大小的增加。
K2 和 K2* 方法以 R 语言实现为一个包,并可免费开放获取(http://community.wvu.edu/daadjeroh/projects/K2/K2_1.0.tar.gz)。
补充数据可在 Bioinformatics 在线获取。