• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

通过最小化器分桶为大量短读段构建编辑距离图。

Construction of edit-distance graphs for large sets of short reads through minimizer-bucketing.

作者信息

Ping Pengyao, Li Jinyan

机构信息

School of Computer Science, Faculty of Engineering and Information Technology, University of Technology Sydney, Ultimo, NSW 2007, Australia.

School of Computer Science and Control Engineering, Shenzhen University of Advanced Technology, Shenzhen, Guangdong 518000, China.

出版信息

Bioinform Adv. 2025 Apr 10;5(1):vbaf081. doi: 10.1093/bioadv/vbaf081. eCollection 2025.

DOI:10.1093/bioadv/vbaf081
PMID:40303904
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12040381/
Abstract

MOTIVATION

Pairs of short reads with small edit distances, along with their unique molecular identifier tags, have been exploited to correct sequencing errors in both reads and tags. However, brute-force identification of these pairs is impractical for large datasets containing ten million or more reads due to its quadratic complexity. Minimizer-bucketing and locality-sensitive hashing have been used to partition read sets into buckets of similar reads, allowing edit-distance calculations only within each bucket. However, challenges like minimizing missing pairs, optimizing bucketing parameters, and exploring combination bucketing to improve pair detection remain.

RESULTS

We define an edit-distance graph for a set of short reads, where nodes represent reads, and edges connect reads with small edit distances, and present a heuristic method, reads2graph, for high completeness of edge detection. Reads2graph uses three techniques: minimizer-bucketing, an improved Order-Min-Hash technique to divide large bins, and a novel graph neighbourhood multi-hop traversal within large bins to detect more edges. We then establish optimal bucketing settings to maximize ground truth edge coverage per bin. Extensive testing demonstrates that read2graph can achieve 97%-100% completeness in most cases, outperforming brute-force identification in speed while providing a superior speed-completeness balance compared to using a single bucketing method like Miniception or Order-Min-Hash.

AVAILABILITY AND IMPLEMENTATION

reads2graph is publicly available at https://github.com/JappyPing/reads2graph.

摘要

动机

具有小编辑距离的短读段对,连同其独特的分子标识符标签,已被用于校正读段和标签中的测序错误。然而,由于其二次复杂度,对于包含一千万或更多读段的大型数据集,暴力识别这些读段对是不切实际的。最小化器分桶和局部敏感哈希已被用于将读段集划分为相似读段的桶,仅允许在每个桶内计算编辑距离。然而,诸如最小化缺失对、优化分桶参数以及探索组合分桶以改善对检测等挑战仍然存在。

结果

我们为一组短读段定义了一个编辑距离图,其中节点表示读段,边连接具有小编辑距离的读段,并提出了一种启发式方法reads2graph,以实现高边缘检测完整性。reads2graph使用三种技术:最小化器分桶、一种改进的顺序最小哈希技术来划分大桶,以及在大桶内进行新颖的图邻域多跳遍历以检测更多边。然后,我们建立最佳分桶设置,以最大化每个桶的真实边缘覆盖率。广泛的测试表明,在大多数情况下,reads2graph可以实现97%-100%的完整性,在速度上优于暴力识别,同时与使用Miniception或顺序最小哈希等单一分桶方法相比,提供了更好的速度-完整性平衡。

可用性和实现

reads2graph可在https://github.com/JappyPing/reads2graph上公开获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8bdb/12040381/2db77f865deb/vbaf081f4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8bdb/12040381/3965d68c3930/vbaf081f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8bdb/12040381/d1b265d23d9d/vbaf081f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8bdb/12040381/6a170fddd69d/vbaf081f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8bdb/12040381/2db77f865deb/vbaf081f4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8bdb/12040381/3965d68c3930/vbaf081f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8bdb/12040381/d1b265d23d9d/vbaf081f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8bdb/12040381/6a170fddd69d/vbaf081f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8bdb/12040381/2db77f865deb/vbaf081f4.jpg

相似文献

1
Construction of edit-distance graphs for large sets of short reads through minimizer-bucketing.通过最小化器分桶为大量短读段构建编辑距离图。
Bioinform Adv. 2025 Apr 10;5(1):vbaf081. doi: 10.1093/bioadv/vbaf081. eCollection 2025.
2
Locality-sensitive bucketing functions for the edit distance.用于编辑距离的局部敏感桶函数。
Algorithms Mol Biol. 2023 Jul 24;18(1):7. doi: 10.1186/s13015-023-00234-2.
3
Learning locality-sensitive bucketing functions.学习位置敏感的分桶函数。
Bioinformatics. 2024 Jun 28;40(Suppl 1):i318-i327. doi: 10.1093/bioinformatics/btae228.
4
MBG: Minimizer-based sparse de Bruijn Graph construction.MBG:基于最小化器的稀疏德布鲁因图构建。
Bioinformatics. 2021 Aug 25;37(16):2476-2478. doi: 10.1093/bioinformatics/btab004.
5
Approximate Graph Edit Distance in Quadratic Time.二次时间内的近似图编辑距离。
IEEE/ACM Trans Comput Biol Bioinform. 2020 Mar-Apr;17(2):483-494. doi: 10.1109/TCBB.2015.2478463. Epub 2015 Sep 14.
6
Minimizer-space de Bruijn graphs: Whole-genome assembly of long reads in minutes on a personal computer.最小化空间 de Bruijn 图:在个人计算机上数分钟内完成长读段的全基因组组装。
Cell Syst. 2021 Oct 20;12(10):958-968.e6. doi: 10.1016/j.cels.2021.08.009. Epub 2021 Sep 14.
7
Index suffix-prefix overlaps by (w, k)-minimizer to generate long contigs for reads compression.通过 (w, k)-最小化子索引后缀-前缀重叠来生成用于读取压缩的长连续体。
Bioinformatics. 2019 Jun 1;35(12):2066-2074. doi: 10.1093/bioinformatics/bty936.
8
On the Maximal Independent Sets of -mers with the Edit Distance.关于具有编辑距离的 - 聚体的最大独立集
ACM BCB. 2023 Sep;2023. doi: 10.1145/3584371.3612982. Epub 2023 Oct 4.
9
The effect of genome graph expressiveness on the discrepancy between genome graph distance and string set distance.基因组图谱表达能力对基因组图谱距离与字符串集距离差异的影响。
Bioinformatics. 2022 Jun 24;38(Suppl 1):i404-i412. doi: 10.1093/bioinformatics/btac264.
10
Compacting de Bruijn graphs from sequencing data quickly and in low memory.从测序数据中快速且低内存地压缩德布鲁因图。
Bioinformatics. 2016 Jun 15;32(12):i201-i208. doi: 10.1093/bioinformatics/btw279.

本文引用的文献

1
Correcting PCR amplification errors in unique molecular identifiers to generate accurate numbers of sequencing molecules.校正独特分子标识符中的PCR扩增错误以生成准确的测序分子数量。
Nat Methods. 2024 Mar;21(3):401-405. doi: 10.1038/s41592-024-02168-y. Epub 2024 Feb 5.
2
Creating and Using Minimizer Sketches in Computational Genomics.在计算基因组学中创建和使用最小草图。
J Comput Biol. 2023 Dec;30(12):1251-1276. doi: 10.1089/cmb.2023.0094. Epub 2023 Aug 30.
3
Locality-sensitive bucketing functions for the edit distance.用于编辑距离的局部敏感桶函数。
Algorithms Mol Biol. 2023 Jul 24;18(1):7. doi: 10.1186/s13015-023-00234-2.
4
Locality-preserving minimal perfect hashing of k-mers.保局最小完美哈希的 k- -mer。
Bioinformatics. 2023 Jun 30;39(Suppl 1):i534-i543. doi: 10.1093/bioinformatics/btad219.
5
Long-read mapping to repetitive reference sequences using Winnowmap2.使用Winnowmap2将长读段映射到重复参考序列。
Nat Methods. 2022 Jun;19(6):705-710. doi: 10.1038/s41592-022-01457-8. Epub 2022 Apr 1.
6
High-throughput and high-dimensional single-cell analysis of antigen-specific CD8 T cells.高通量和高维的抗原特异性 CD8 T 细胞单细胞分析。
Nat Immunol. 2021 Dec;22(12):1590-1598. doi: 10.1038/s41590-021-01073-2. Epub 2021 Nov 22.
7
lra: A long read aligner for sequences and contigs.lra:一种用于序列和重叠群的长读比对工具。
PLoS Comput Biol. 2021 Jun 21;17(6):e1009078. doi: 10.1371/journal.pcbi.1009078. eCollection 2021 Jun.
8
UMIc: A Preprocessing Method for UMI Deduplication and Reads Correction.UMIc:一种用于UMI去重和读段校正的预处理方法。
Front Genet. 2021 May 28;12:660366. doi: 10.3389/fgene.2021.660366. eCollection 2021.
9
Minirmd: accurate and fast duplicate removal tool for short reads via multiple minimizers.Minirmd:一种通过多个 minimizers 进行短读段准确快速去重的工具。
Bioinformatics. 2021 Jul 12;37(11):1604-1606. doi: 10.1093/bioinformatics/btaa915.
10
Improved design and analysis of practical minimizers.实用极小化器的改进设计与分析。
Bioinformatics. 2020 Jul 1;36(Suppl_1):i119-i127. doi: 10.1093/bioinformatics/btaa472.