Department of Ecology and Evolutionary Biology, Brown University, Providence, Rhode Island, USA.
BMC Bioinformatics. 2013 Nov 19;14:330. doi: 10.1186/1471-2105-14-330.
In the past decade, transcriptome data have become an important component of many phylogenetic studies. They are a cost-effective source of protein-coding gene sequences, and have helped projects grow from a few genes to hundreds or thousands of genes. Phylogenetic studies now regularly include genes from newly sequenced transcriptomes, as well as publicly available transcriptomes and genomes. Implementing such a phylogenomic study, however, is computationally intensive, requires the coordinated use of many complex software tools, and includes multiple steps for which no published tools exist. Phylogenomic studies have therefore been manual or semiautomated. In addition to taking considerable user time, this makes phylogenomic analyses difficult to reproduce, compare, and extend. In addition, methodological improvements made in the context of one study often cannot be easily applied and evaluated in the context of other studies.
We present Agalma, an automated tool that constructs matrices for phylogenomic analyses. The user provides raw Illumina transcriptome data, and Agalma produces annotated assemblies, aligned gene sequence matrices, a preliminary phylogeny, and detailed diagnostics that allow the investigator to make extensive assessments of intermediate analysis steps and the final results. Sequences from other sources, such as externally assembled genomes and transcriptomes, can also be incorporated in the analyses. Agalma is built on the BioLite bioinformatics framework, which tracks provenance, profiles processor and memory use, records diagnostics, manages metadata, installs dependencies, logs version numbers and calls to external programs, and enables rich HTML reports for all stages of the analysis. Agalma includes a small test data set and a built-in test analysis of these data. In addition to describing Agalma, we here present a sample analysis of a larger seven-taxon data set. Agalma is available for download at https://bitbucket.org/caseywdunn/agalma.
Agalma allows complex phylogenomic analyses to be implemented and described unambiguously as a series of high-level commands. This will enable phylogenomic studies to be readily reproduced, modified, and extended. Agalma also facilitates methods development by providing a complete modular workflow, bundled with test data, that will allow further optimization of each step in the context of a full phylogenomic analysis.
在过去的十年中,转录组数据已成为许多系统发育研究的重要组成部分。它们是一种经济高效的蛋白质编码基因序列来源,并帮助项目从少数基因扩展到数百或数千个基因。现在,系统发育研究经常包括来自新测序转录组的基因,以及公开的转录组和基因组。然而,实施这样的基因组研究在计算上是密集的,需要协调使用许多复杂的软件工具,并包含多个没有发布工具的步骤。因此,基因组研究是手动或半自动的。除了需要大量用户时间外,这使得基因组分析难以重现、比较和扩展。此外,在一项研究中进行的方法改进通常难以在其他研究中轻松应用和评估。
我们提出了 Agalma,这是一种用于构建基因组分析矩阵的自动化工具。用户提供 Illumina 转录组的原始数据,Agalma 会生成带注释的组装、对齐的基因序列矩阵、初步系统发育以及详细的诊断,这些都允许研究人员对中间分析步骤和最终结果进行广泛评估。来自其他来源的序列,例如外部组装的基因组和转录组,也可以包含在分析中。Agalma 构建在 BioLite 生物信息学框架上,该框架跟踪来源、分析处理器和内存使用情况、记录诊断、管理元数据、安装依赖项、记录版本号和对外部程序的调用,并为分析的所有阶段提供丰富的 HTML 报告。Agalma 包括一个小型测试数据集和对这些数据的内置测试分析。除了描述 Agalma 之外,我们还在此展示了对更大的七分类群数据集的示例分析。Agalma 可在 https://bitbucket.org/caseywdunn/agalma 下载。
Agalma 允许明确地将复杂的基因组分析实现和描述为一系列高级命令。这将使基因组研究能够轻松重现、修改和扩展。Agalma 还通过提供带有测试数据的完整模块化工作流程来促进方法开发,这将允许在完整基因组分析的背景下进一步优化每个步骤。