用于序列相似性搜索的最小重叠词。

Minimally overlapping words for sequence similarity search.

作者信息

Frith Martin C, Noé Laurent, Kucherov Gregory

机构信息

Artificial Intelligence Research Center, AIST, Tokyo, Japan.

Graduate School of Frontier Sciences, University of Tokyo, Chiba, Japan.

出版信息

Bioinformatics. 2021 Apr 1;36(22-23):5344-5350. doi: 10.1093/bioinformatics/btaa1054.

MOTIVATION

Analysis of genetic sequences is usually based on finding similar parts of sequences, e.g. DNA reads and/or genomes. For big data, this is typically done via 'seeds': simple similarities (e.g. exact matches) that can be found quickly. For huge data, sparse seeding is useful, where we only consider seeds at a subset of positions in a sequence.

RESULTS

Here, we study a simple sparse-seeding method: using seeds at positions of certain 'words' (e.g. ac, at, gc or gt). Sensitivity is maximized by using words with minimal overlaps. That is because, in a random sequence, minimally overlapping words are anti-clumped. We provide evidence that this is often superior to acclaimed 'minimizer' sparse-seeding methods. Our approach can be unified with design of inexact (spaced and subset) seeds, further boosting sensitivity. Thus, we present a promising approach to sequence similarity search, with open questions on how to optimize it.

AVAILABILITY AND IMPLEMENTATION

Software to design and test minimally overlapping words is freely available at https://gitlab.com/mcfrith/noverlap.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

动机

基因序列分析通常基于寻找序列的相似部分，例如DNA读数和/或基因组。对于大数据，这通常通过“种子”来完成：即可以快速找到的简单相似性（例如完全匹配）。对于海量数据，稀疏播种是有用的，在这种情况下，我们只考虑序列中一个子集位置处的种子。

结果

在这里，我们研究一种简单的稀疏播种方法：在某些“单词”（例如ac、at、gc或gt）的位置使用种子。通过使用重叠最少的单词可使灵敏度最大化。这是因为，在随机序列中，重叠最少的单词是反聚集的。我们提供的证据表明，这通常优于广受赞誉的“最小化器”稀疏播种方法。我们的方法可以与不精确（间隔和子集）种子的设计统一起来，进一步提高灵敏度。因此，我们提出了一种有前途的序列相似性搜索方法，但在如何对其进行优化方面仍存在一些未解决的问题。

可用性和实现方式

用于设计和测试重叠最少单词的软件可在https://gitlab.com/mcfrith/noverlap上免费获取。

补充信息

补充数据可在《生物信息学》在线版上获取。

Minimally overlapping words for sequence similarity search.

作者信息

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY AND IMPLEMENTATION

SUPPLEMENTARY INFORMATION

动机

结果

可用性和实现方式

补充信息

相似文献

引用本文的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献