Suppr超能文献

两个 DNA 序列之间 k-mer 匹配的数量作为 k 的函数,以及在估计系统发育距离中的应用。

The number of k-mer matches between two DNA sequences as a function of k and applications to estimate phylogenetic distances.

机构信息

University of Göttingen, Department of Bioinformatics, Göttingen, Germany.

IEETA, University of Aveiro, Aveiro, Portugal.

出版信息

PLoS One. 2020 Feb 10;15(2):e0228070. doi: 10.1371/journal.pone.0228070. eCollection 2020.

Abstract

We study the number Nk of length-k word matches between pairs of evolutionarily related DNA sequences, as a function of k. We show that the Jukes-Cantor distance between two genome sequences-i.e. the number of substitutions per site that occurred since they evolved from their last common ancestor-can be estimated from the slope of a function F that depends on Nk and that is affine-linear within a certain range of k. Integers kmin and kmax can be calculated depending on the length of the input sequences, such that the slope of F in the relevant range can be estimated from the values F(kmin) and F(kmax). This approach can be generalized to so-called Spaced-word Matches (SpaM), where mismatches are allowed at positions specified by a user-defined binary pattern. Based on these theoretical results, we implemented a prototype software program for alignment-free sequence comparison called Slope-SpaM. Test runs on real and simulated sequence data show that Slope-SpaM can accurately estimate phylogenetic distances for distances up to around 0.5 substitutions per position. The statistical stability of our results is improved if spaced words are used instead of contiguous words. Unlike previous alignment-free methods that are based on the number of (spaced) word matches, Slope-SpaM produces accurate results, even if sequences share only local homologies.

摘要

我们研究了进化相关 DNA 序列对之间长度为 k 的单词匹配数 Nk,作为 k 的函数。我们表明,两个基因组序列之间的 Jukes-Cantor 距离——即自它们从最后一个共同祖先进化以来每个位置发生的替换数量——可以通过依赖 Nk 的函数 F 的斜率来估计,并且在一定范围内,F 是仿射线性的。可以根据输入序列的长度计算出整数 kmin 和 kmax,使得在相关范围内 F 的斜率可以从 F(kmin)和 F(kmax)的值来估计。这种方法可以推广到所谓的间隔字匹配(SpaM),其中允许在用户定义的二进制模式指定的位置出现不匹配。基于这些理论结果,我们为无比对序列比较实现了一个名为 Slope-SpaM 的原型软件程序。在真实和模拟序列数据上的测试运行表明,Slope-SpaM 可以准确估计距离为每位置约 0.5 个替换的系统发育距离。如果使用间隔字而不是连续字,我们的结果的统计稳定性会提高。与基于(间隔)单词匹配数的先前无比对方法不同,即使序列仅共享局部同源性,Slope-SpaM 也能产生准确的结果。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/16b8/7010260/13d1e913e166/pone.0228070.g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验