通过最小化器分桶为大量短读段构建编辑距离图。

Construction of edit-distance graphs for large sets of short reads through minimizer-bucketing.

作者信息

Ping Pengyao, Li Jinyan

机构信息

School of Computer Science, Faculty of Engineering and Information Technology, University of Technology Sydney, Ultimo, NSW 2007, Australia.

School of Computer Science and Control Engineering, Shenzhen University of Advanced Technology, Shenzhen, Guangdong 518000, China.

出版信息

Bioinform Adv. 2025 Apr 10;5(1):vbaf081. doi: 10.1093/bioadv/vbaf081. eCollection 2025.

DOI:10.1093/bioadv/vbaf081

PMID:40303904

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12040381/

Abstract

MOTIVATION

Pairs of short reads with small edit distances, along with their unique molecular identifier tags, have been exploited to correct sequencing errors in both reads and tags. However, brute-force identification of these pairs is impractical for large datasets containing ten million or more reads due to its quadratic complexity. Minimizer-bucketing and locality-sensitive hashing have been used to partition read sets into buckets of similar reads, allowing edit-distance calculations only within each bucket. However, challenges like minimizing missing pairs, optimizing bucketing parameters, and exploring combination bucketing to improve pair detection remain.

RESULTS

We define an edit-distance graph for a set of short reads, where nodes represent reads, and edges connect reads with small edit distances, and present a heuristic method, reads2graph, for high completeness of edge detection. Reads2graph uses three techniques: minimizer-bucketing, an improved Order-Min-Hash technique to divide large bins, and a novel graph neighbourhood multi-hop traversal within large bins to detect more edges. We then establish optimal bucketing settings to maximize ground truth edge coverage per bin. Extensive testing demonstrates that read2graph can achieve 97%-100% completeness in most cases, outperforming brute-force identification in speed while providing a superior speed-completeness balance compared to using a single bucketing method like Miniception or Order-Min-Hash.

AVAILABILITY AND IMPLEMENTATION

reads2graph is publicly available at https://github.com/JappyPing/reads2graph.

摘要

动机

具有小编辑距离的短读段对，连同其独特的分子标识符标签，已被用于校正读段和标签中的测序错误。然而，由于其二次复杂度，对于包含一千万或更多读段的大型数据集，暴力识别这些读段对是不切实际的。最小化器分桶和局部敏感哈希已被用于将读段集划分为相似读段的桶，仅允许在每个桶内计算编辑距离。然而，诸如最小化缺失对、优化分桶参数以及探索组合分桶以改善对检测等挑战仍然存在。

结果

我们为一组短读段定义了一个编辑距离图，其中节点表示读段，边连接具有小编辑距离的读段，并提出了一种启发式方法reads2graph，以实现高边缘检测完整性。reads2graph使用三种技术：最小化器分桶、一种改进的顺序最小哈希技术来划分大桶，以及在大桶内进行新颖的图邻域多跳遍历以检测更多边。然后，我们建立最佳分桶设置，以最大化每个桶的真实边缘覆盖率。广泛的测试表明，在大多数情况下，reads2graph可以实现97%-100%的完整性，在速度上优于暴力识别，同时与使用Miniception或顺序最小哈希等单一分桶方法相比，提供了更好的速度-完整性平衡。

可用性和实现

reads2graph可在https://github.com/JappyPing/reads2graph上公开获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8bdb/12040381/3965d68c3930/vbaf081f1.jpg

相似文献

Construction of edit-distance graphs for large sets of short reads through minimizer-bucketing.

Bioinform Adv. 2025 Apr 10;5(1):vbaf081. doi: 10.1093/bioadv/vbaf081. eCollection 2025.

Locality-sensitive bucketing functions for the edit distance.

Algorithms Mol Biol. 2023 Jul 24;18(1):7. doi: 10.1186/s13015-023-00234-2.

Learning locality-sensitive bucketing functions.

Bioinformatics. 2024 Jun 28;40(Suppl 1):i318-i327. doi: 10.1093/bioinformatics/btae228.

MBG: Minimizer-based sparse de Bruijn Graph construction.

Bioinformatics. 2021 Aug 25;37(16):2476-2478. doi: 10.1093/bioinformatics/btab004.

Approximate Graph Edit Distance in Quadratic Time.

IEEE/ACM Trans Comput Biol Bioinform. 2020 Mar-Apr;17(2):483-494. doi: 10.1109/TCBB.2015.2478463. Epub 2015 Sep 14.

Minimizer-space de Bruijn graphs: Whole-genome assembly of long reads in minutes on a personal computer.

Cell Syst. 2021 Oct 20;12(10):958-968.e6. doi: 10.1016/j.cels.2021.08.009. Epub 2021 Sep 14.

Index suffix-prefix overlaps by (w, k)-minimizer to generate long contigs for reads compression.

Bioinformatics. 2019 Jun 1;35(12):2066-2074. doi: 10.1093/bioinformatics/bty936.

On the Maximal Independent Sets of -mers with the Edit Distance.

ACM BCB. 2023 Sep;2023. doi: 10.1145/3584371.3612982. Epub 2023 Oct 4.

The effect of genome graph expressiveness on the discrepancy between genome graph distance and string set distance.

Bioinformatics. 2022 Jun 24;38(Suppl 1):i404-i412. doi: 10.1093/bioinformatics/btac264.

Compacting de Bruijn graphs from sequencing data quickly and in low memory.

Bioinformatics. 2016 Jun 15;32(12):i201-i208. doi: 10.1093/bioinformatics/btw279.

本文引用的文献

Correcting PCR amplification errors in unique molecular identifiers to generate accurate numbers of sequencing molecules.

Nat Methods. 2024 Mar;21(3):401-405. doi: 10.1038/s41592-024-02168-y. Epub 2024 Feb 5.

Creating and Using Minimizer Sketches in Computational Genomics.

J Comput Biol. 2023 Dec;30(12):1251-1276. doi: 10.1089/cmb.2023.0094. Epub 2023 Aug 30.

Locality-sensitive bucketing functions for the edit distance.

Algorithms Mol Biol. 2023 Jul 24;18(1):7. doi: 10.1186/s13015-023-00234-2.

Locality-preserving minimal perfect hashing of k-mers.

Bioinformatics. 2023 Jun 30;39(Suppl 1):i534-i543. doi: 10.1093/bioinformatics/btad219.

Long-read mapping to repetitive reference sequences using Winnowmap2.

Nat Methods. 2022 Jun;19(6):705-710. doi: 10.1038/s41592-022-01457-8. Epub 2022 Apr 1.

High-throughput and high-dimensional single-cell analysis of antigen-specific CD8 T cells.

Nat Immunol. 2021 Dec;22(12):1590-1598. doi: 10.1038/s41590-021-01073-2. Epub 2021 Nov 22.

lra: A long read aligner for sequences and contigs.

PLoS Comput Biol. 2021 Jun 21;17(6):e1009078. doi: 10.1371/journal.pcbi.1009078. eCollection 2021 Jun.

UMIc: A Preprocessing Method for UMI Deduplication and Reads Correction.

Front Genet. 2021 May 28;12:660366. doi: 10.3389/fgene.2021.660366. eCollection 2021.

Minirmd: accurate and fast duplicate removal tool for short reads via multiple minimizers.

Bioinformatics. 2021 Jul 12;37(11):1604-1606. doi: 10.1093/bioinformatics/btaa915.

Improved design and analysis of practical minimizers.

Bioinformatics. 2020 Jul 1;36(Suppl_1):i119-i127. doi: 10.1093/bioinformatics/btaa472.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

通过最小化器分桶为大量短读段构建编辑距离图。

Construction of edit-distance graphs for large sets of short reads through minimizer-bucketing.

作者信息

Ping Pengyao, Li Jinyan

机构信息

School of Computer Science, Faculty of Engineering and Information Technology, University of Technology Sydney, Ultimo, NSW 2007, Australia.

School of Computer Science and Control Engineering, Shenzhen University of Advanced Technology, Shenzhen, Guangdong 518000, China.