Mpangase Phelelani T, Frost Jacqueline, Tikly Mohammed, Ramsay Michèle, Hazelhurst Scott
Sydney Brenner Institute for Molecular Bioscience, University of the Witwatersrand, Johannesburg.
Division of Human Genetics, National Health Laboratory Service and School of Pathology, Faculty of Health Sciences, University of the Witwatersrand, Johannesburg.
S Afr Comput J. 2021 Dec;33(2). doi: 10.18489/sacj.v33i2.830. Epub 2021 Dec 20.
The rate of raw sequence production through Next-Generation Sequencing (NGS) has been growing exponentially due to improved technology and reduced costs. This has enabled researchers to answer many biological questions through "multi-omics" data analyses. Even though such data promises new insights into how biological systems function and understanding disease mechanisms, computational analyses performed on such large datasets comes with its challenges and potential pitfalls. The aim of this study was to develop a robust portable and reproducible bioinformatic pipeline for the automation of RNA sequencing (RNA-seq) data analyses. Using Nextflow as a workflow management system and Singularity for application containerisation, the nf-rnaSeqCount pipeline was developed for mapping raw RNA-seq reads to a reference genome and quantifying abundance of identified genomic features for differential gene expression analyses. The pipeline provides a quick and efficient way to obtain a matrix of read counts that can be used with tools such as DESeq2 and edgeR for differential expression analysis. Robust and flexible bioinformatic and computational pipelines for RNA-seq data analysis, from QC to sequence alignment and comparative analyses, will reduce analysis time, and increase accuracy and reproducibility of findings to promote transcriptome research.
由于技术改进和成本降低,通过下一代测序(NGS)产生原始序列的速度呈指数级增长。这使研究人员能够通过“多组学”数据分析回答许多生物学问题。尽管此类数据有望为生物系统如何运作以及理解疾病机制提供新的见解,但对如此大型数据集进行的计算分析也伴随着挑战和潜在陷阱。本研究的目的是开发一种强大的、便携式且可重复的生物信息学流程,用于RNA测序(RNA-seq)数据分析的自动化。使用Nextflow作为工作流程管理系统,并使用Singularity进行应用程序容器化,开发了nf-rnaSeqCount流程,用于将原始RNA-seq读数映射到参考基因组,并量化已识别基因组特征的丰度以进行差异基因表达分析。该流程提供了一种快速有效的方法来获得读数计数矩阵,该矩阵可与DESeq2和edgeR等工具一起用于差异表达分析。从质量控制到序列比对和比较分析,用于RNA-seq数据分析的强大且灵活的生物信息学和计算流程将减少分析时间,并提高研究结果的准确性和可重复性,以促进转录组研究。