College of Information Science and Engineering, Hunan University, Changsha, Hunan 410012, China.
Advanced Analytics Institute, University of Technology Sydney, Broadway, NSW 2007, Australia.
Bioinformatics. 2021 Jul 12;37(11):1604-1606. doi: 10.1093/bioinformatics/btaa915.
Removing duplicate and near-duplicate reads, generated by high-throughput sequencing technologies, is able to reduce computational resources in downstream applications. Here we develop minirmd, a de novo tool to remove duplicate reads via multiple rounds of clustering using different length of minimizer. Experiments demonstrate that minirmd removes more near-duplicate reads than existing clustering approaches and is faster than existing multi-core tools. To the best of our knowledge, minirmd is the first tool to remove near-duplicates on reverse-complementary strand.
https://github.com/yuansliu/minirmd.
Supplementary data are available at Bioinformatics online.
使用高通量测序技术产生的重复和近似重复reads 可以减少下游应用中的计算资源。本文开发了 minirmd,这是一种通过使用不同长度的 minimizer 进行多轮聚类来去除重复reads 的全新工具。实验表明,minirmd 比现有的聚类方法去除了更多的近似重复reads,并且比现有的多核工具更快。据我们所知,minirmd 是第一个去除反向互补链上的近似重复reads 的工具。
https://github.com/yuansliu/minirmd。
补充数据可在 Bioinformatics 在线获得。