Liberati Franco, Pose Marino Taiel Maximiliano, Bottoni Paolo, Canestrelli Daniele, Castrignanò Tiziana
Department of Ecological and Biological Sciences, University of Tuscia, Viale dell'Università s.n.c., 01100, Viterbo, Italy.
Department of Computer Science, Sapienza University of Rome, Viale Regina Elena 295, 00166, Rome, Italy.
BMC Bioinformatics. 2025 Apr 28;26(1):113. doi: 10.1186/s12859-025-06121-4.
Recent years have seen a substantial increase in RNA-seq data production, with this technique becoming the primary approach for gene expression studies across a wide range of non-model organisms. The majority of these organisms lack a well-annotated reference genome to serve as a basis for studying differentially expressed genes (DEGs). As an alternative cost-effective protocol to using a reference genome, the assembly of RNA-seq raw reads is performed to produce what is referred to as a 'de novo transcriptome,' serving as a reference for subsequent DEGs' analysis. This assembly step for conventional DEGs analysis pipelines for non-model organisms is a computationally expensive task. Furthermore, the complexity of the de novo transcriptome assembly workflows poses a challenge for researchers in implementing best-practice techniques and the most recent software versions, particularly when applied to various organisms of interest.
To address computational challenges in transcriptomic analyses of non-model organisms, we present HPC-T-Assembly, a tool for de novo transcriptome assembly from RNA-seq data on high-performance computing (HPC) infrastructures. It is designed for straightforward setup via a Web-oriented interface, allowing analysis configuration for several species. Once configuration data is provided, the entire parallel computing software for assembly is automatically generated and can be launched on a supercomputer with a simple command line. Intermediate and final outputs of the assembly pipeline include additional post-processing steps, such as assembly quality control, ORF prediction, and transcript count matrix construction.
HPC-T-Assembly allows users, through a user-friendly Web-oriented interface, to configure a run for simultaneous assemblies of RNA-seq data from multiple species. The parallel pipeline, launched on HPC infrastructures, significantly reduces computational load and execution times, enabling large-scale transcriptomic and meta-transcriptomics analysis projects.
近年来,RNA测序数据量大幅增加,该技术已成为广泛的非模式生物基因表达研究的主要方法。这些生物中的大多数缺乏注释良好的参考基因组,无法作为研究差异表达基因(DEG)的基础。作为使用参考基因组的一种经济高效的替代方案,可对RNA测序原始读段进行组装,以生成所谓的“从头转录组”,作为后续DEG分析的参考。对于非模式生物的传统DEG分析流程而言,这一组装步骤是一项计算成本高昂的任务。此外,从头转录组组装工作流程的复杂性给研究人员实施最佳实践技术和最新软件版本带来了挑战,尤其是在应用于各种目标生物时。
为解决非模式生物转录组分析中的计算挑战,我们推出了HPC-T-Assembly,这是一种用于在高性能计算(HPC)基础设施上从RNA测序数据进行从头转录组组装的工具。它旨在通过面向Web的界面进行简单设置,允许对多个物种进行分析配置。一旦提供配置数据,用于组装的整个并行计算软件将自动生成,并可通过简单的命令行在超级计算机上启动。组装流程的中间和最终输出包括额外的后处理步骤,如组装质量控制、开放阅读框预测和转录本计数矩阵构建。
HPC-T-Assembly允许用户通过用户友好的面向Web的界面配置运行,以同时组装来自多个物种的RNA测序数据。在HPC基础设施上启动的并行流程显著降低了计算负载和执行时间,使大规模转录组学和宏转录组学分析项目成为可能。