Perdomo Jonathan Elliot, Ahsan Mian Umair, Liu Qian, Fang Li, Wang Kai
bioRxiv. 2024 Aug 7:2024.08.05.606643. doi: 10.1101/2024.08.05.606643.
While several well-established quality control (QC) tools are available for short reads sequencing data, there is a general paucity of computational tools that provide long read metrics in a fast and comprehensive manner across all major sequencing platforms (such as PacBio, Oxford Nanopore, Illumina Complete Long Read) and data formats (such as ONT POD5, FAST5, basecall summary files and PacBio unaligned BAM). Additionally, none of the current tools provide support for summarizing Oxford Nanopore basecall signal or comprehensive base modification (methylation) information from genomic data. Furthermore, nowadays a single PromethION flowcell on the Oxford Nanopore platform can generate terabytes of signal data, which cannot be handled by existing tools designed for small-scale flowcells. To address these challenges, here we present LongReadSum, a multi-threaded C++ tool which provides fast and comprehensive QC reports on all major aspects of sequencing data (such as read, base, base quality, alignment, and base modification metrics) and produce basecalling signal intensity information from the Oxford Nanopore platform. We demonstrate use cases to analyze cDNA sequencing, direct mRNA sequencing, reduced representation methylation sequencing (RRMS) through adaptive sequencing, as well as whole genome sequencing (WGS) data using diverse long-read platforms.
虽然有几种成熟的质量控制(QC)工具可用于短读长测序数据,但普遍缺乏能够在所有主要测序平台(如PacBio、牛津纳米孔、Illumina全基因组长读长)和数据格式(如ONT POD5、FAST5、碱基识别摘要文件和PacBio未比对的BAM)上快速且全面地提供长读长指标的计算工具。此外,目前没有工具支持从基因组数据中总结牛津纳米孔碱基识别信号或全面的碱基修饰(甲基化)信息。而且,如今牛津纳米孔平台上的单个PromethION流动槽可以生成数TB的信号数据,这是为小规模流动槽设计的现有工具无法处理的。为应对这些挑战,我们在此展示LongReadSum,这是一个多线程的C++工具,它能对测序数据的所有主要方面(如读长、碱基、碱基质量、比对和碱基修饰指标)提供快速且全面的质量控制报告,并从牛津纳米孔平台生成碱基识别信号强度信息。我们展示了使用不同长读长平台分析cDNA测序、直接mRNA测序、通过适应性测序进行的简化代表性甲基化测序(RRMS)以及全基因组测序(WGS)数据的用例。