• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

熵预测伪随机种子的敏感性。

Entropy predicts sensitivity of pseudorandom seeds.

机构信息

Department of Mathematics, Stockholm University, 106 91 Stockholm, Sweden.

Department of Mathematics, Stockholm University, 106 91 Stockholm, Sweden

出版信息

Genome Res. 2023 Jul;33(7):1162-1174. doi: 10.1101/gr.277645.123. Epub 2023 May 22.

DOI:10.1101/gr.277645.123
PMID:37217253
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10538493/
Abstract

Seed design is important for sequence similarity search applications such as read mapping and average nucleotide identity (ANI) estimation. Although -mers and spaced -mers are likely the most well-known and used seeds, sensitivity suffers at high error rates, particularly when indels are present. Recently, we developed a pseudorandom seeding construct, strobemers, which was empirically shown to have high sensitivity also at high indel rates. However, the study lacked a deeper understanding of why. In this study, we propose a model to estimate the entropy of a seed and find that seeds with high entropy, according to our model, in most cases have high match sensitivity. Our discovered seed randomness-sensitivity relationship explains why some seeds perform better than others, and the relationship provides a framework for designing even more sensitive seeds. We also present three new strobemer seed constructs: mixedstrobes, altstrobes, and multistrobes. We use both simulated and biological data to show that our new seed constructs improve sequence-matching sensitivity to other strobemers. We show that the three new seed constructs are useful for read mapping and ANI estimation. For read mapping, we implement strobemers into minimap2 and observe 30% faster alignment time and 0.2% higher accuracy than using -mers when mapping reads at high error rates. As for ANI estimation, we find that higher entropy seeds have a higher rank correlation between estimated and true ANI.

摘要

种子设计对于序列相似性搜索应用程序非常重要,例如读映射和平均核苷酸同一性 (ANI) 估计。尽管 -mers 和间隔 -mers 可能是最著名和常用的种子,但在高错误率下,尤其是在存在插入和缺失的情况下,敏感性会受到影响。最近,我们开发了一种伪随机播种结构,即 strobe 种子,经验表明,即使在高插入和缺失率下,它也具有很高的敏感性。然而,该研究缺乏对原因的更深入理解。在本研究中,我们提出了一种估计种子熵的模型,并发现根据我们的模型,熵高的种子在大多数情况下具有较高的匹配敏感性。我们发现的种子随机性-敏感性关系解释了为什么某些种子比其他种子表现更好,并且该关系为设计更敏感的种子提供了框架。我们还提出了三种新的 strobe 种子结构:混合 strobe、替代 strobe 和多 strobe。我们使用模拟和生物数据表明,我们的新种子结构可以提高其他 strobe 的序列匹配敏感性。我们表明,这三种新的种子结构对于读映射和 ANI 估计都很有用。对于读映射,我们将 strobe 集成到 minimap2 中,并观察到在高错误率下映射读取时,与使用 -mers 相比,对齐时间快 30%,准确性高 0.2%。对于 ANI 估计,我们发现更高熵的种子在估计和真实 ANI 之间具有更高的等级相关系数。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/af0d/10538493/73001e080a30/1162f08.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/af0d/10538493/d4fd019e28dc/1162f01.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/af0d/10538493/c7740f424c37/1162f02.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/af0d/10538493/fe249de176d1/1162f03.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/af0d/10538493/d12deca7b7a1/1162f04.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/af0d/10538493/3b823b0b735e/1162f05.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/af0d/10538493/0abb3ad25f2d/1162f06.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/af0d/10538493/2fee07e6d6e8/1162f07.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/af0d/10538493/73001e080a30/1162f08.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/af0d/10538493/d4fd019e28dc/1162f01.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/af0d/10538493/c7740f424c37/1162f02.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/af0d/10538493/fe249de176d1/1162f03.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/af0d/10538493/d12deca7b7a1/1162f04.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/af0d/10538493/3b823b0b735e/1162f05.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/af0d/10538493/0abb3ad25f2d/1162f06.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/af0d/10538493/2fee07e6d6e8/1162f07.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/af0d/10538493/73001e080a30/1162f08.jpg

相似文献

1
Entropy predicts sensitivity of pseudorandom seeds.熵预测伪随机种子的敏感性。
Genome Res. 2023 Jul;33(7):1162-1174. doi: 10.1101/gr.277645.123. Epub 2023 May 22.
2
Effective sequence similarity detection with strobemers.利用频闪体进行有效的序列相似性检测。
Genome Res. 2021 Nov;31(11):2080-2094. doi: 10.1101/gr.275648.121. Epub 2021 Oct 19.
3
Strobealign: flexible seed size enables ultra-fast and accurate read alignment.Strobealign:灵活的种子大小可实现超快速和准确的读取对齐。
Genome Biol. 2022 Dec 15;23(1):260. doi: 10.1186/s13059-022-02831-7.
4
Designing efficient randstrobes for sequence similarity analyses.设计用于序列相似性分析的高效随机频闪仪。
Bioinformatics. 2024 Mar 29;40(4). doi: 10.1093/bioinformatics/btae187.
5
Indel seeds for homology search.用于同源性搜索的插入缺失种子。
Bioinformatics. 2006 Jul 15;22(14):e341-9. doi: 10.1093/bioinformatics/btl263.
6
Effects of spaced k-mers on alignment-free genotyping.间隔 k-mer 对无比对基因分型的影响。
Bioinformatics. 2023 Jun 30;39(39 Suppl 1):i213-i221. doi: 10.1093/bioinformatics/btad202.
7
PerFSeeB: designing long high-weight single spaced seeds for full sensitivity alignment with a given number of mismatches.PerFSeeB:设计长的高权重单间隔种子,以在给定数量的错配下实现全灵敏度比对。
BMC Bioinformatics. 2023 Oct 24;24(1):396. doi: 10.1186/s12859-023-05517-4.
8
Efficient Seeding for Error-Prone Sequences with SubseqHash2.使用SubseqHash2对易出错序列进行高效播种
bioRxiv. 2024 Jun 3:2024.05.30.596711. doi: 10.1101/2024.05.30.596711.
9
SAKE: Strobemer-assisted k-mer extraction.SAKE:频闪辅助 k-mer 提取。
PLoS One. 2023 Nov 29;18(11):e0294415. doi: 10.1371/journal.pone.0294415. eCollection 2023.
10
Designing multiple simultaneous seeds for DNA similarity search.设计用于DNA相似性搜索的多个同时种子。
J Comput Biol. 2005 Jul-Aug;12(6):847-61. doi: 10.1089/cmb.2005.12.847.

引用本文的文献

1
Efficient seeding for error-prone sequences with SubseqHash2.使用SubseqHash2对易错序列进行高效播种。
Bioinformatics. 2025 Aug 2;41(8). doi: 10.1093/bioinformatics/btaf418.
2
Designing efficient randstrobes for sequence similarity analyses.设计用于序列相似性分析的高效随机频闪仪。
Bioinformatics. 2024 Mar 29;40(4). doi: 10.1093/bioinformatics/btae187.
3
LexicHash: sequence similarity estimation via lexicographic comparison of hashes.LexicHash:通过字典序比较哈希值进行序列相似性估计。

本文引用的文献

1
A Randomized Parallel Algorithm for Efficiently Finding Near-Optimal Universal Hitting Sets.一种用于高效找到近似最优通用命中集的随机并行算法。
Res Comput Mol Biol. 2020 May;12074:37-53. doi: 10.1007/978-3-030-45257-5_3. Epub 2020 Apr 21.
2
A survey of mapping algorithms in the long-reads era.长读时代的图谱算法研究综述。
Genome Biol. 2023 Jun 1;24(1):133. doi: 10.1186/s13059-023-02972-3.
3
Reference-free assembly of long-read transcriptome sequencing data with RNA-Bloom2.无参考组装长读转录组测序数据的 RNA-Bloom2 方法。
Bioinformatics. 2023 Nov 1;39(11). doi: 10.1093/bioinformatics/btad652.
Nat Commun. 2023 May 22;14(1):2940. doi: 10.1038/s41467-023-38553-y.
4
How to optimally sample a sequence for rapid analysis.如何最优地采样序列以进行快速分析。
Bioinformatics. 2023 Feb 3;39(2). doi: 10.1093/bioinformatics/btad057.
5
BLEND: a fast, memory-efficient and accurate mechanism to find fuzzy seed matches in genome analysis.BLEND:一种在基因组分析中快速、节省内存且准确地查找模糊种子匹配项的机制。
NAR Genom Bioinform. 2023 Jan 20;5(1):lqad004. doi: 10.1093/nargab/lqad004. eCollection 2023 Mar.
6
Strobealign: flexible seed size enables ultra-fast and accurate read alignment.Strobealign:灵活的种子大小可实现超快速和准确的读取对齐。
Genome Biol. 2022 Dec 15;23(1):260. doi: 10.1186/s13059-022-02831-7.
7
Mapping-friendly sequence reductions: Going beyond homopolymer compression.映射友好型序列缩减:超越同聚物压缩
iScience. 2022 Oct 13;25(11):105305. doi: 10.1016/j.isci.2022.105305. eCollection 2022 Nov 18.
8
Theory of local k-mer selection with applications to long-read alignment.基于局部 k-mer 选择的理论及其在长读测序比对中的应用。
Bioinformatics. 2022 Oct 14;38(20):4659-4669. doi: 10.1093/bioinformatics/btab790.
9
Long-read mapping to repetitive reference sequences using Winnowmap2.使用Winnowmap2将长读段映射到重复参考序列。
Nat Methods. 2022 Jun;19(6):705-710. doi: 10.1038/s41592-022-01457-8. Epub 2022 Apr 1.
10
The Statistics of -mers from a Sequence Undergoing a Simple Mutation Process Without Spurious Matches.无伪匹配情况下简单突变过程中序列的 -mers 统计。
J Comput Biol. 2022 Feb;29(2):155-168. doi: 10.1089/cmb.2021.0431. Epub 2022 Feb 1.