同时压缩多个纠错后的短读段，以实现更快的数据传输和更好的从头组装。

Simultaneous compression of multiple error-corrected short-read sets for faster data transmission and better de novo assemblies.

机构信息

Data Science Institute, University of Technology Sydney, 81 Broadway, Ultimo, 2007, NSW, Australia.

School of Mordern Posts, Nanjing University of Posts and Telecommunications, 9 Wenyuan Rd, Qixia District, 210003, Jiangsu, China.

出版信息

Brief Funct Genomics. 2022 Sep 16;21(5):387-398. doi: 10.1093/bfgp/elac016.

DOI:10.1093/bfgp/elac016

PMID:35848773

Abstract

Next-Generation Sequencing has produced incredible amounts of short-reads sequence data for de novo genome assembly over the last decades. For efficient transmission of these huge datasets, high-performance compression algorithms have been intensively studied. As both the de novo assembly and error correction methods utilize the overlaps between reads data, a concern is that the will the sequencing errors bring up negative effects on genome assemblies also affect the compression of the NGS data. This work addresses two problems: how current error correction algorithms can enable the compression algorithms to make the sequence data much more compact, and whether the sequence-modified reads by the error-correction algorithms will lead to quality improvement for de novo contig assembly. As multiple sets of short reads are often produced by a single biomedical project in practice, we propose a graph-based method to reorder the files in the collection of multiple sets and then compress them simultaneously for a further compression improvement after error correction. We use examples to illustrate that accurate error correction algorithms can significantly reduce the number of mismatched nucleotides in the reference-free compression, hence can greatly improve the compression performance. Extensive test on practical collections of multiple short-read sets does confirm that the compression performance on the error-corrected data (with unchanged size) significantly outperforms that on the original data, and that the file reordering idea contributes furthermore. The error correction on the original reads has also resulted in quality improvements of the genome assemblies, sometimes remarkably. However, it is still an open question that how to combine appropriate error correction methods with an assembly algorithm so that the assembly performance can be always significantly improved.

摘要

在过去几十年中，下一代测序技术产生了大量的用于从头组装基因组的短读序列数据。为了高效传输这些庞大的数据集，人们已经深入研究了高性能的压缩算法。由于从头组装和纠错方法都利用了读取数据之间的重叠，人们担心测序错误会对基因组组装产生负面影响，也会影响 NGS 数据的压缩。这项工作解决了两个问题：当前的纠错算法如何使压缩算法使序列数据更加紧凑，以及纠错算法修改的读取序列是否会导致从头拼接组装的质量提高。由于在实践中，单个生物医学项目通常会产生多组短读序列，因此我们提出了一种基于图的方法，对多组文件进行重新排序，然后对它们进行同时压缩，以便在纠错后进一步提高压缩性能。我们用实例说明了准确的纠错算法可以显著减少无参考压缩中不匹配核苷酸的数量，从而大大提高压缩性能。对多个短读集的实际集合进行广泛测试确实证实了纠错后数据（大小不变）的压缩性能显著优于原始数据，而且文件重排的想法还有助于进一步提高压缩性能。原始读取的纠错也导致了基因组组装质量的提高，有时甚至非常显著。然而，如何将适当的纠错方法与组装算法结合起来，以使组装性能始终得到显著提高，仍然是一个悬而未决的问题。