Fundación para el Fomento de la Investigación Sanitaria y Biomédica de la Comunidad Valencia (FISABIO)-Salud Pública, Avenida de Cataluña 21, 46020 Valencia, Spain.
BMC Genomics. 2014 Jan 18;15:37. doi: 10.1186/1471-2164-15-37.
The main limitations in the analysis of viral metagenomes are perhaps the high genetic variability and the lack of information in extant databases. To address these issues, several bioinformatic tools have been specifically designed or adapted for metagenomics by improving read assembly and creating more sensitive methods for homology detection. This study compares the performance of different available assemblers and taxonomic annotation software using simulated viral-metagenomic data.
We simulated two 454 viral metagenomes using genomes from NCBI's RefSeq database based on the list of actual viruses found in previously published metagenomes. Three different assembly strategies, spanning six assemblers, were tested for performance: overlap-layout-consensus algorithms Newbler, Celera and Minimo; de Bruijn graphs algorithms Velvet and MetaVelvet; and read probabilistic model Genovo. The performance of the assemblies was measured by the length of resulting contigs (using N50), the percentage of reads assembled and the overall accuracy when comparing against corresponding reference genomes. Additionally, the number of chimeras per contig and the lowest common ancestor were estimated in order to assess the effect of assembling on taxonomic and functional annotation. The functional classification of the reads was evaluated by counting the reads that correctly matched the functional data previously reported for the original genomes and calculating the number of over-represented functional categories in chimeric contigs. The sensitivity and specificity of tBLASTx, PhymmBL and the k-mer frequencies were measured by accurate predictions when comparing simulated reads against the NCBI Virus genomes RefSeq database.
Assembling improves functional annotation by increasing accurate assignations and decreasing ambiguous hits between viruses and bacteria. However, the success is limited by the chimeric contigs occurring at all taxonomic levels. The assembler and its parameters should be selected based on the focus of each study. Minimo's non-chimeric contigs and Genovo's long contigs excelled in taxonomy assignation and functional annotation, respectively.tBLASTx stood out as the best approach for taxonomic annotation for virus identification. PhymmBL proved useful in datasets in which no related sequences are present as it uses genomic features that may help identify distant taxa. The k-frequencies underperformed in all viral datasets.
病毒宏基因组分析的主要限制因素可能是遗传变异性高和现有数据库中信息缺乏。为了解决这些问题,已经专门设计或改编了几种生物信息学工具,通过改进读段组装和创建更敏感的同源检测方法来进行宏基因组学分析。本研究使用模拟的病毒宏基因组数据比较了不同可用组装器和分类注释软件的性能。
我们根据之前发表的宏基因组中实际病毒的列表,使用 NCBI 的 RefSeq 数据库中的基因组模拟了两个 454 病毒宏基因组。我们测试了三种不同的组装策略(共涉及 6 个组装器):重叠布局共识算法 Newbler、Celera 和 Minimo;de Bruijn 图算法 Velvet 和 MetaVelvet;以及读段概率模型 Genovo。通过比较组装结果的 contig 长度(使用 N50)、组装读段的百分比和与相应参考基因组的整体准确性来衡量组装的性能。此外,还估计了每个 contig 的嵌合体数量和最低共同祖先,以评估组装对分类和功能注释的影响。通过计算与原始基因组先前报道的功能数据相匹配的读段数量,并计算嵌合体 contig 中过度代表的功能类别数量,评估了读段的功能分类。通过将模拟读段与 NCBI Virus genomes RefSeq 数据库进行准确比较,测量了 tBLASTx、PhymmBL 和 k-mer 频率的灵敏度和特异性。
组装通过增加准确分配和减少病毒和细菌之间的模糊命中来提高功能注释。然而,成功受到所有分类水平嵌合体 contig 的限制。应根据每个研究的重点选择组装器及其参数。Minimo 的非嵌合体 contig 和 Genovo 的长 contig 在分类学分配和功能注释方面表现出色,而 tBLASTx 在病毒鉴定的分类注释方面表现突出。PhymmBL 在没有相关序列的数据集上很有用,因为它使用可能有助于识别远缘分类群的基因组特征。在所有病毒数据集上,k-frequencies 的性能都不佳。