Department of Computer Science, Xiamen University, Xiamen, 316005, China.
Aginome Scientific, Xiamen, 316005, China.
BMC Bioinformatics. 2020 Jul 20;21(1):321. doi: 10.1186/s12859-020-03658-4.
Recent advancements in high-throughput sequencing technologies have generated an unprecedented amount of genomic data that must be stored, processed, and transmitted over the network for sharing. Lossy genomic data compression, especially of the base quality values of sequencing data, is emerging as an efficient way to handle this challenge due to its superior compression performance compared to lossless compression methods. Many lossy compression algorithms have been developed for and evaluated using DNA sequencing data. However, whether these algorithms can be used on RNA sequencing (RNA-seq) data remains unclear.
In this study, we evaluated the impacts of lossy quality value compression on common RNA-seq data analysis pipelines including expression quantification, transcriptome assembly, and short variants detection using RNA-seq data from different species and sequencing platforms. Our study shows that lossy quality value compression could effectively improve RNA-seq data compression. In some cases, lossy algorithms achieved up to 1.2-3 times further reduction on the overall RNA-seq data size compared to existing lossless algorithms. However, lossy quality value compression could affect the results of some RNA-seq data processing pipelines, and hence its impacts to RNA-seq studies cannot be ignored in some cases. Pipelines using HISAT2 for alignment were most significantly affected by lossy quality value compression, while the effects of lossy compression on pipelines that do not depend on quality values, e.g., STAR-based expression quantification and transcriptome assembly pipelines, were not observed. Moreover, regardless of using either STAR or HISAT2 as the aligner, variant detection results were affected by lossy quality value compression, albeit to a lesser extent when STAR-based pipeline was used. Our results also show that the impacts of lossy quality value compression depend on the compression algorithms being used and the compression levels if the algorithm supports setting of multiple compression levels.
Lossy quality value compression can be incorporated into existing RNA-seq analysis pipelines to alleviate the data storage and transmission burdens. However, care should be taken on the selection of compression tools and levels based on the requirements of the downstream analysis pipelines to avoid introducing undesirable adverse effects on the analysis results.
高通量测序技术的最新进展产生了前所未有的基因组数据,这些数据必须存储、处理并通过网络传输以进行共享。与无损压缩方法相比,基于损失的基因组数据压缩(尤其是测序数据的碱基质量值的压缩)作为一种有效的处理方法正在兴起,因为它具有卓越的压缩性能。已经开发出许多用于 DNA 测序数据的并且已经对其进行了评估的有损压缩算法。然而,这些算法是否可以用于 RNA 测序(RNA-seq)数据尚不清楚。
在这项研究中,我们使用来自不同物种和测序平台的 RNA-seq 数据,评估了基于损失的质量值压缩对常见 RNA-seq 数据分析流程(包括表达定量、转录组组装和短变异检测)的影响。我们的研究表明,基于损失的质量值压缩可以有效地改善 RNA-seq 数据的压缩。在某些情况下,与现有的无损算法相比,基于损失的算法可以使整个 RNA-seq 数据大小进一步减少 1.2-3 倍。然而,基于损失的质量值压缩可能会影响某些 RNA-seq 数据处理流程的结果,因此在某些情况下,它对 RNA-seq 研究的影响不容忽视。使用 HISAT2 进行比对的流程受基于损失的质量值压缩的影响最大,而基于损失的压缩对不依赖质量值的流程(例如,基于 STAR 的表达定量和转录组组装流程)没有影响。此外,无论使用 STAR 还是 HISAT2 作为比对器,基于损失的质量值压缩都会影响变异检测结果,尽管当使用基于 STAR 的流程时,影响程度较小。我们的结果还表明,基于损失的质量值压缩的影响取决于所使用的压缩算法以及算法支持设置多个压缩级别时的压缩级别。
可以将基于损失的质量值压缩纳入现有的 RNA-seq 分析流程中,以减轻数据存储和传输负担。但是,应根据下游分析流程的要求选择压缩工具和级别,并谨慎操作,以避免对分析结果引入不必要的不良影响。