Peri Sateesh, Roberts Sarah, Kreko Isabella R, McHan Lauren B, Naron Alexandra, Ram Archana, Murphy Rebecca L, Lyons Eric, Gregory Brian D, Devisetty Upendra K, Nelson Andrew D L
Genetics Graduate Interdisciplinary Group, University of Arizona, Tucson, AZ, United States.
CyVerse, University of Arizona, Tucson, AZ, United States.
Front Genet. 2020 Jan 24;10:1361. doi: 10.3389/fgene.2019.01361. eCollection 2019.
Next-generation RNA-sequencing is an incredibly powerful means of generating a snapshot of the transcriptomic state within a cell, tissue, or whole organism. As the questions addressed by RNA-sequencing (RNA-seq) become both more complex and greater in number, there is a need to simplify RNA-seq processing workflows, make them more efficient and interoperable, and capable of handling both large and small datasets. This is especially important for researchers who need to process hundreds to tens of thousands of RNA-seq datasets. To address these needs, we have developed a scalable, user-friendly, and easily deployable analysis suite called RMTA (Read Mapping, Transcript Assembly). RMTA can easily process thousands of RNA-seq datasets with features that include automated read quality analysis, filters for lowly expressed transcripts, and read counting for differential expression analysis. RMTA is containerized using Docker for easy deployment within any compute environment [cloud, local, or high-performance computing (HPC)] and is available as two apps in CyVerse's Discovery Environment, one for normal use and one specifically designed for introducing undergraduates and high school to RNA-seq analysis. For extremely large datasets (tens of thousands of FASTq files) we developed a high-throughput, scalable, and parallelized version of RMTA optimized for launching on the Open Science Grid (OSG) from within the Discovery Environment. OSG-RMTA allows users to utilize the Discovery Environment for data management, parallelization, and submitting jobs to OSG, and finally, employ the OSG for distributed, high throughput computing. Alternatively, OSG-RMTA can be run directly on the OSG through the command line. RMTA is designed to be useful for data scientists, of any skill level, interested in rapidly and reproducibly analyzing their large RNA-seq data sets.
下一代RNA测序是一种极其强大的手段,可生成细胞、组织或整个生物体中转录组状态的快照。随着RNA测序(RNA-seq)所解决的问题在复杂性和数量上都不断增加,有必要简化RNA-seq处理工作流程,使其更高效、更具互操作性,并能够处理大小数据集。这对于需要处理数百到数万个RNA-seq数据集的研究人员尤为重要。为满足这些需求,我们开发了一个名为RMTA(读取映射、转录本组装)的可扩展、用户友好且易于部署的分析套件。RMTA能够轻松处理数千个RNA-seq数据集,其功能包括自动读取质量分析、低表达转录本过滤器以及用于差异表达分析的读取计数。RMTA使用Docker进行容器化,以便在任何计算环境(云、本地或高性能计算(HPC))中轻松部署,并且作为两个应用程序在CyVerse的发现环境中可用,一个用于正常使用,另一个专门为向本科生和高中生介绍RNA-seq分析而设计。对于超大型数据集(数万个FASTq文件),我们开发了一个针对在发现环境中从开放科学网格(OSG)上启动进行优化的高通量、可扩展且并行化的RMTA版本。OSG-RMTA允许用户利用发现环境进行数据管理、并行化以及向OSG提交作业,最后,利用OSG进行分布式高通量计算。或者,OSG-RMTA可以通过命令行直接在OSG上运行。RMTA旨在对任何技能水平、对快速且可重复地分析其大型RNA-seq数据集感兴趣的数据科学家有用。