通过 (w, k)-最小化子索引后缀-前缀重叠来生成用于读取压缩的长连续体。

Index suffix-prefix overlaps by (w, k)-minimizer to generate long contigs for reads compression.

机构信息

Advanced Analytics Institute, Faculty of Engineering and IT, University of Technology Sydney, Ultimo, Australia.

Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Hunan Key Laboratory for Computation and Simulation in Science and Engineering, Xiangtan University, Hunan, China.

出版信息

Bioinformatics. 2019 Jun 1;35(12):2066-2074. doi: 10.1093/bioinformatics/bty936.

DOI:10.1093/bioinformatics/bty936

PMID:30407482

Abstract

MOTIVATION

Advanced high-throughput sequencing technologies have produced massive amount of reads data, and algorithms have been specially designed to contract the size of these datasets for efficient storage and transmission. Reordering reads with regard to their positions in de novo assembled contigs or in explicit reference sequences has been proven to be one of the most effective reads compression approach. As there is usually no good prior knowledge about the reference sequence, current focus is on the novel construction of de novo assembled contigs.

RESULTS

We introduce a new de novo compression algorithm named minicom. This algorithm uses large k-minimizers to index the reads and subgroup those that have the same minimizer. Within each subgroup, a contig is constructed. Then some pairs of the contigs derived from the subgroups are merged into longer contigs according to a (w, k)-minimizer-indexed suffix-prefix overlap similarity between two contigs. This merging process is repeated after the longer contigs are formed until no pair of contigs can be merged. We compare the performance of minicom with two reference-based methods and four de novo methods on 18 datasets (13 RNA-seq datasets and 5 whole genome sequencing datasets). In the compression of single-end reads, minicom obtained the smallest file size for 22 of 34 cases with significant improvement. In the compression of paired-end reads, minicom achieved 20-80% compression gain over the best state-of-the-art algorithm. Our method also achieved a 10% size reduction of compressed files in comparison with the best algorithm under the reads-order preserving mode. These excellent performances are mainly attributed to the exploit of the redundancy of the repetitive substrings in the long contigs.

AVAILABILITY AND IMPLEMENTATION

https://github.com/yuansliu/minicom.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

先进的高通量测序技术产生了大量的读取数据，并且专门设计了算法来缩小这些数据集的大小，以实现高效的存储和传输。根据从头组装的 contigs 或显式参考序列中的位置重新排序读取已被证明是最有效的读取压缩方法之一。由于通常没有关于参考序列的良好先验知识，因此当前的重点是构建新的从头组装 contigs。

结果

我们引入了一种新的从头压缩算法，名为 minicom。该算法使用大 k-最小器来索引读取，并将具有相同最小器的读取分组。在每个子组中，构建一个 contig。然后，根据两个 contig 之间的 (w, k)-最小器索引后缀-前缀重叠相似度，将来自子组的一些 contig 对合并成更长的 contig。在形成更长的 contig 之后，重复此合并过程，直到没有 pair 可以合并。我们在 18 个数据集（13 个 RNA-seq 数据集和 5 个全基因组测序数据集）上比较了 minicom 与两种基于参考的方法和四种从头构建方法的性能。在单端读取的压缩中，minicom 在 34 个案例中有 22 个获得了最小的文件大小，并且具有显著的改进。在配对端读取的压缩中，minicom 实现了 20-80%的压缩增益，优于最佳的最先进算法。与在保留读取顺序模式下最佳算法相比，我们的方法还实现了压缩文件大小减少 10%。这些优异的性能主要归因于利用长 contigs 中的重复子串的冗余。