Formenti Giulio, Koo Bonhwang, Sollitto Marco, Balacco Jennifer, Brajuka Nadolina, Burhans Richard, Duarte Erick, Giani Alice M, McCaffrey Kirsty, Medico Jack A, Myers Eugene W, Smeds Patrik, Nekrutenko Anton, Jarvis Erich D
The Vertebrate Genome Laboratory, The Rockefeller University, New York United States.
Department of Biology, University of Florence, Sesto Fiorentino, FI 50019, Italy.
Bioinformatics. 2025 Jul 22. doi: 10.1093/bioinformatics/btaf416.
Large sequencing data sets are being produced and deposited into public archives at unprecedented rates. The availability of tools that can reliably and efficiently generate and store sequencing read summary statistics has become critical.
As part of the effort by the Vertebrate Genomes Project (VGP) to generate high-quality reference genomes at scale, we sought to address the community's need for efficient sequence data evaluation by developing rdeval, a standalone tool to quickly compute and interactively display sequencing read metrics. Rdeval can either run on the fly or store key sequence data metrics in tiny read 'snapshot' files. Statistics can then be efficiently recalled from snapshots for additional processing. Rdeval can convert fa*[.gz] files to and from other popular formats including BAM and CRAM for better compression. Overall, while CRAM achieves the best compression, the gain compared to BAM is marginal, and BAM achieves the best compromise between data compression and access speed. Rdeval also generates a detailed visual report with multiple data analytics that can be exported in various formats. We showcase rdeval's functionalities using long read data from different sequencing platforms and species, including human. For PacBio long-read sequencing, our analysis shows dramatic improvements in both read length and quality over time, as well as the benefit of increased coverage for genome assembly, though magnitude varies upon taxa.
Rdeval is implemented in C ++ for data processivity and in R for data visualization. Precompiled releases (Linux, MacOS, Windows) and commented source code for rdeval are available under MIT license at https://github.com/vgl-hub/rdeval . Documentation is available on ReadTheDocs ( https://rdeval-documentation.readthedocs.io ). Rdeval is also available in Bioconda and in Galaxy ( https://usegalaxy.org ). An automated test workflow ensures the consistency of software updates.
Supplementary data are available at Bioinformatics online.
大规模测序数据集正以前所未有的速度产生并存入公共档案库。能够可靠且高效地生成和存储测序读数汇总统计信息的工具的可用性变得至关重要。
作为脊椎动物基因组计划(VGP)大规模生成高质量参考基因组工作的一部分,我们试图通过开发rdeval来满足社区对高效序列数据评估的需求,rdeval是一个独立工具,用于快速计算并交互式显示测序读数指标。Rdeval既可以即时运行,也可以将关键序列数据指标存储在微小的读数“快照”文件中。然后可以从快照中高效地调出统计信息进行进一步处理。Rdeval可以将fa*[.gz]文件与包括BAM和CRAM在内的其他流行格式进行相互转换,以实现更好的压缩。总体而言,虽然CRAM实现了最佳压缩,但与BAM相比增益很小,并且BAM在数据压缩和访问速度之间实现了最佳折衷。Rdeval还会生成带有多个数据分析的详细可视化报告,该报告可以以各种格式导出。我们使用来自不同测序平台和物种(包括人类)的长读长数据展示了rdeval的功能。对于PacBio长读长测序,我们的分析表明,随着时间的推移,读长和质量都有显著提高,以及增加覆盖度对基因组组装的益处,尽管幅度因分类群而异。
Rdeval用C++实现数据处理,用R实现数据可视化。预编译版本(Linux、MacOS、Windows)以及带有注释的rdeval源代码可在https://github.com/vgl-hub/rdeval上根据MIT许可获取。文档可在ReadTheDocs(https://rdeval-documentation.readthedocs.io)上获取。Rdeval也可在Bioconda和Galaxy(https://usegalaxy.org)中使用。自动化测试工作流程可确保软件更新的一致性。
补充数据可在《生物信息学》在线版获取。