Ping Pengyao, Li Jinyan
School of Computer Science, Faculty of Engineering and Information Technology, University of Technology Sydney, Ultimo, NSW 2007, Australia.
School of Computer Science and Control Engineering, Shenzhen University of Advanced Technology, Shenzhen, Guangdong 518000, China.
Bioinform Adv. 2025 Apr 10;5(1):vbaf081. doi: 10.1093/bioadv/vbaf081. eCollection 2025.
Pairs of short reads with small edit distances, along with their unique molecular identifier tags, have been exploited to correct sequencing errors in both reads and tags. However, brute-force identification of these pairs is impractical for large datasets containing ten million or more reads due to its quadratic complexity. Minimizer-bucketing and locality-sensitive hashing have been used to partition read sets into buckets of similar reads, allowing edit-distance calculations only within each bucket. However, challenges like minimizing missing pairs, optimizing bucketing parameters, and exploring combination bucketing to improve pair detection remain.
We define an edit-distance graph for a set of short reads, where nodes represent reads, and edges connect reads with small edit distances, and present a heuristic method, reads2graph, for high completeness of edge detection. Reads2graph uses three techniques: minimizer-bucketing, an improved Order-Min-Hash technique to divide large bins, and a novel graph neighbourhood multi-hop traversal within large bins to detect more edges. We then establish optimal bucketing settings to maximize ground truth edge coverage per bin. Extensive testing demonstrates that read2graph can achieve 97%-100% completeness in most cases, outperforming brute-force identification in speed while providing a superior speed-completeness balance compared to using a single bucketing method like Miniception or Order-Min-Hash.
reads2graph is publicly available at https://github.com/JappyPing/reads2graph.
具有小编辑距离的短读段对,连同其独特的分子标识符标签,已被用于校正读段和标签中的测序错误。然而,由于其二次复杂度,对于包含一千万或更多读段的大型数据集,暴力识别这些读段对是不切实际的。最小化器分桶和局部敏感哈希已被用于将读段集划分为相似读段的桶,仅允许在每个桶内计算编辑距离。然而,诸如最小化缺失对、优化分桶参数以及探索组合分桶以改善对检测等挑战仍然存在。
我们为一组短读段定义了一个编辑距离图,其中节点表示读段,边连接具有小编辑距离的读段,并提出了一种启发式方法reads2graph,以实现高边缘检测完整性。reads2graph使用三种技术:最小化器分桶、一种改进的顺序最小哈希技术来划分大桶,以及在大桶内进行新颖的图邻域多跳遍历以检测更多边。然后,我们建立最佳分桶设置,以最大化每个桶的真实边缘覆盖率。广泛的测试表明,在大多数情况下,reads2graph可以实现97%-100%的完整性,在速度上优于暴力识别,同时与使用Miniception或顺序最小哈希等单一分桶方法相比,提供了更好的速度-完整性平衡。
reads2graph可在https://github.com/JappyPing/reads2graph上公开获取。