Frith Martin C, Noé Laurent, Kucherov Gregory
Artificial Intelligence Research Center, AIST, Tokyo, Japan.
Graduate School of Frontier Sciences, University of Tokyo, Chiba, Japan.
Bioinformatics. 2021 Apr 1;36(22-23):5344-5350. doi: 10.1093/bioinformatics/btaa1054.
Analysis of genetic sequences is usually based on finding similar parts of sequences, e.g. DNA reads and/or genomes. For big data, this is typically done via 'seeds': simple similarities (e.g. exact matches) that can be found quickly. For huge data, sparse seeding is useful, where we only consider seeds at a subset of positions in a sequence.
Here, we study a simple sparse-seeding method: using seeds at positions of certain 'words' (e.g. ac, at, gc or gt). Sensitivity is maximized by using words with minimal overlaps. That is because, in a random sequence, minimally overlapping words are anti-clumped. We provide evidence that this is often superior to acclaimed 'minimizer' sparse-seeding methods. Our approach can be unified with design of inexact (spaced and subset) seeds, further boosting sensitivity. Thus, we present a promising approach to sequence similarity search, with open questions on how to optimize it.
Software to design and test minimally overlapping words is freely available at https://gitlab.com/mcfrith/noverlap.
Supplementary data are available at Bioinformatics online.
基因序列分析通常基于寻找序列的相似部分,例如DNA读数和/或基因组。对于大数据,这通常通过“种子”来完成:即可以快速找到的简单相似性(例如完全匹配)。对于海量数据,稀疏播种是有用的,在这种情况下,我们只考虑序列中一个子集位置处的种子。
在这里,我们研究一种简单的稀疏播种方法:在某些“单词”(例如ac、at、gc或gt)的位置使用种子。通过使用重叠最少的单词可使灵敏度最大化。这是因为,在随机序列中,重叠最少的单词是反聚集的。我们提供的证据表明,这通常优于广受赞誉的“最小化器”稀疏播种方法。我们的方法可以与不精确(间隔和子集)种子的设计统一起来,进一步提高灵敏度。因此,我们提出了一种有前途的序列相似性搜索方法,但在如何对其进行优化方面仍存在一些未解决的问题。
用于设计和测试重叠最少单词的软件可在https://gitlab.com/mcfrith/noverlap上免费获取。
补充数据可在《生物信息学》在线版上获取。