Leimeister Chris-André, Sohrabi-Jahromi Salma, Morgenstern Burkhard
Department of Bioinformatics, University of Göttingen, Institute of Microbiology and Genetics, Goldschmidtstr. 1, 37077?Göttingen, Germany.
Department of Bioinformatics, University of Göttingen, Institute of Microbiology and Genetics, Goldschmidtstr. 1, 37077 Göttingen, Germany.
Bioinformatics. 2017 Apr 1;33(7):971-979. doi: 10.1093/bioinformatics/btw776.
Word-based or 'alignment-free' algorithms are increasingly used for phylogeny reconstruction and genome comparison, since they are much faster than traditional approaches that are based on full sequence alignments. Existing alignment-free programs, however, are less accurate than alignment-based methods.
We propose Filtered Spaced Word Matches (FSWM) , a fast alignment-free approach to estimate phylogenetic distances between large genomic sequences. For a pre-defined binary pattern of match and don't-care positions, FSWM rapidly identifies spaced word-matches between input sequences, i.e. gap-free local alignments with matching nucleotides at the match positions and with mismatches allowed at the don't-care positions. We then estimate the number of nucleotide substitutions per site by considering the nucleotides aligned at the don't-care positions of the identified spaced-word matches. To reduce the noise from spurious random matches, we use a filtering procedure where we discard all spaced-word matches for which the overall similarity between the aligned segments is below a threshold. We show that our approach can accurately estimate substitution frequencies even for distantly related sequences that cannot be analyzed with existing alignment-free methods; phylogenetic trees constructed with FSWM distances are of high quality. A program run on a pair of eukaryotic genomes of a few hundred Mb each takes a few minutes.
The program source code for FSWM including a documentation, as well as the software that we used to generate artificial genome sequences are freely available at http://fswm.gobics.de/.
chris.leimeister@stud.uni-goettingen.de.
Supplementary data are available at Bioinformatics online.
基于单词或“无比对”的算法越来越多地用于系统发育重建和基因组比较,因为它们比基于全序列比对的传统方法快得多。然而,现有的无比对程序不如基于比对的方法准确。
我们提出了过滤间隔单词匹配(FSWM),这是一种快速的无比对方法,用于估计大型基因组序列之间的系统发育距离。对于预定义的匹配和无关位置的二元模式,FSWM能快速识别输入序列之间的间隔单词匹配,即在匹配位置具有匹配核苷酸且在无关位置允许错配的无间隙局部比对。然后,我们通过考虑在已识别的间隔单词匹配的无关位置对齐的核苷酸来估计每个位点的核苷酸替换数。为了减少虚假随机匹配产生的噪声,我们使用一种过滤程序,即丢弃所有比对片段之间总体相似度低于阈值的间隔单词匹配。我们表明,即使对于现有无比对方法无法分析的远缘相关序列,我们的方法也能准确估计替换频率;用FSWM距离构建的系统发育树质量很高。在一对大小均为几百兆字节的真核基因组上运行该程序只需几分钟。
FSWM的程序源代码(包括文档)以及我们用于生成人工基因组序列的软件可在http://fswm.gobics.de/上免费获取。
chris.leimeister@stud.uni-goettingen.de。
补充数据可在《生物信息学》在线版获取。