• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于草图的精确读段映射

Exact Sketch-Based Read Mapping.

作者信息

Schulz Tizian, Medvedev Paul

机构信息

Faculty of Technology and Center for Biotechnology (CeBiTec), Bielefeld University, Germany.

Bielefeld Institute for Bioinformatics Infrastructure (BIBI), Bielefeld University, Germany.

出版信息

Lebniz Int Proc Inform. 2023 Sep;273. doi: 10.4230/LIPIcs.WABI.2023.14. Epub 2023 Aug 29.

DOI:10.4230/LIPIcs.WABI.2023.14
PMID:38831964
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11146199/
Abstract

Given a sequencing read, the broad goal of read mapping is to find the location(s) in the reference genome that have a "similar sequence". Traditionally, "similar sequence" was defined as having a high alignment score and read mappers were viewed as heuristic solutions to this well-defined problem. For sketch-based mappers, however, there has not been a problem formulation to capture what problem an exact sketch-based mapping algorithm should solve. Moreover, there is no sketch-based method that can find all possible mapping positions for a read above a certain score threshold. In this paper, we formulate the problem of read mapping at the level of sequence sketches. We give an exact dynamic programming algorithm that finds all hits above a given similarity threshold. It runs in time and space, where is the number of -mers inside the sketch of the reference, is the number of -mers inside the read's sketch and is the number of times that -mers from the pattern sketch occur in the sketch of the text. We evaluate our algorithm's performance in mapping long reads to the T2T assembly of human chromosome Y, where ampliconic regions make it desirable to find all good mapping positions. For an equivalent level of precision as minimap2, the recall of our algorithm is 0.88, compared to only 0.76 of minimap2.

摘要

给定一个测序读段,读段映射的总体目标是在参考基因组中找到具有“相似序列”的位置。传统上,“相似序列”被定义为具有高比对分数,并且读段映射器被视为解决这个明确定义问题的启发式方法。然而,对于基于草图的映射器,尚未有一个问题表述来明确一个精确的基于草图的映射算法应该解决什么问题。此外,没有基于草图的方法能够找到读段在某个分数阈值以上的所有可能映射位置。在本文中,我们在序列草图层面上阐述了读段映射问题。我们给出了一个精确的动态规划算法,该算法能找到高于给定相似性阈值的所有匹配。它的运行时间和空间复杂度分别为 ,其中 是参考草图中 - 聚体的数量, 是读段草图中 - 聚体的数量, 是模式草图中的 - 聚体在文本草图中出现的次数。我们评估了我们算法在将长读段映射到人类 Y 染色体的 T2T 组装上的性能,在该组装中扩增区域使得找到所有良好映射位置变得很有必要。对于与 minimap2 相当的精度水平,我们算法的召回率为 0.88,而 minimap2 仅为 0.76。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/be2f/11146199/eecc8daca8b4/nihms-1985085-f0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/be2f/11146199/8ce637fb6930/nihms-1985085-f0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/be2f/11146199/eecc8daca8b4/nihms-1985085-f0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/be2f/11146199/8ce637fb6930/nihms-1985085-f0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/be2f/11146199/eecc8daca8b4/nihms-1985085-f0004.jpg

相似文献

1
Exact Sketch-Based Read Mapping.基于草图的精确读段映射
Lebniz Int Proc Inform. 2023 Sep;273. doi: 10.4230/LIPIcs.WABI.2023.14. Epub 2023 Aug 29.
2
ESKEMAP: exact sketch-based read mapping.ESKEMAP:基于草图的精确读段映射。
Algorithms Mol Biol. 2024 May 4;19(1):19. doi: 10.1186/s13015-024-00261-7.
3
Sketching and sampling approaches for fast and accurate long read classification.快速准确的长读分类的草图和采样方法。
BMC Bioinformatics. 2022 Oct 31;23(1):452. doi: 10.1186/s12859-022-05014-0.
4
Efficient mapping of accurate long reads in minimizer space with mapquik.使用 mapquik 在 minimizer 空间中高效映射准确的长读段。
Genome Res. 2023 Jul;33(7):1188-1197. doi: 10.1101/gr.277679.123. Epub 2023 Jun 30.
5
Assessing the impact of exact reads on reducing the error rate of read mapping.评估精确读取对降低读取映射错误率的影响。
BMC Bioinformatics. 2018 Nov 6;19(1):406. doi: 10.1186/s12859-018-2432-7.
6
Parameterized syncmer schemes improve long-read mapping.参数化同步mers 方案提高了长读测序数据的比对效率。
PLoS Comput Biol. 2022 Oct 28;18(10):e1010638. doi: 10.1371/journal.pcbi.1010638. eCollection 2022 Oct.
7
Sketching Methods with Small Window Guarantee Using Minimum Decycling Sets.使用最小去环集保证小窗口的草图方法。
J Comput Biol. 2024 Jul;31(7):597-615. doi: 10.1089/cmb.2024.0544. Epub 2024 Jul 9.
8
Assessing graph-based read mappers against a baseline approach highlights strengths and weaknesses of current methods.评估基于图的读映射器相对于基线方法,可以突出当前方法的优缺点。
BMC Genomics. 2020 Apr 6;21(1):282. doi: 10.1186/s12864-020-6685-y.
9
Fast and Accurate Algorithms for Mapping and Aligning Long Reads.快速准确的长读映射和对齐算法。
J Comput Biol. 2021 Aug;28(8):789-803. doi: 10.1089/cmb.2020.0603. Epub 2021 Jun 23.
10
Sketching methods with small window guarantee using minimum decycling sets.使用最小去环集保证小窗口的绘制方法。
ArXiv. 2023 Nov 6:arXiv:2311.03592v1.

本文引用的文献

1
Deriving confidence intervals for mutation rates across a wide range of evolutionary distances using FracMinHash.使用 FracMinHash 在广泛的进化距离范围内推导突变率的置信区间。
Genome Res. 2023 Jul;33(7):1061-1068. doi: 10.1101/gr.277651.123. Epub 2023 Jun 21.
2
A survey of mapping algorithms in the long-reads era.长读时代的图谱算法研究综述。
Genome Biol. 2023 Jun 1;24(1):133. doi: 10.1186/s13059-023-02972-3.
3
The minimizer Jaccard estimator is biased and inconsistent.最小化 Jaccard 估计量有偏且不一致。
Bioinformatics. 2022 Jun 24;38(Suppl 1):i169-i176. doi: 10.1093/bioinformatics/btac244.
4
Long-read mapping to repetitive reference sequences using Winnowmap2.使用Winnowmap2将长读段映射到重复参考序列。
Nat Methods. 2022 Jun;19(6):705-710. doi: 10.1038/s41592-022-01457-8. Epub 2022 Apr 1.
5
The complete sequence of a human genome.人类基因组的完整序列。
Science. 2022 Apr;376(6588):44-53. doi: 10.1126/science.abj6987. Epub 2022 Mar 31.
6
The Statistics of -mers from a Sequence Undergoing a Simple Mutation Process Without Spurious Matches.无伪匹配情况下简单突变过程中序列的 -mers 统计。
J Comput Biol. 2022 Feb;29(2):155-168. doi: 10.1089/cmb.2021.0431. Epub 2022 Feb 1.
7
Syncmers are more sensitive than minimizers for selecting conserved ‑mers in biological sequences.同步寡聚体在选择生物序列中的保守寡聚体方面比最小寡聚体更敏感。
PeerJ. 2021 Feb 5;9:e10805. doi: 10.7717/peerj.10805. eCollection 2021.
8
Highly accurate long-read HiFi sequencing data for five complex genomes.针对五个复杂基因组的高度精确的长读长HiFi测序数据。
Sci Data. 2020 Nov 17;7(1):399. doi: 10.1038/s41597-020-00743-4.
9
Dynamic evolution of great ape Y chromosomes.巨猿 Y 染色体的动态进化。
Proc Natl Acad Sci U S A. 2020 Oct 20;117(42):26273-26280. doi: 10.1073/pnas.2001749117. Epub 2020 Oct 5.
10
PBSIM2: a simulator for long-read sequencers with a novel generative model of quality scores.PBSIM2:一种带有新型质量评分生成模型的长读测序模拟软件。
Bioinformatics. 2021 May 5;37(5):589-595. doi: 10.1093/bioinformatics/btaa835.