Evolutionary Bioinformatics, Institute for Evolution and Biodiversity, Westfaelische-Wilhelms-University, Muenster, Germany.
PLoS One. 2012;7(2):e31410. doi: 10.1371/journal.pone.0031410. Epub 2012 Feb 27.
The quantity of transcriptome data is rapidly increasing for non-model organisms. As sequencing technology advances, focus shifts towards solving bioinformatic challenges, of which sequence read assembly is the first task. Recent studies have compared the performance of different software to establish a best practice for transcriptome assembly. Here, we adapted a simulation approach to evaluate specific features of assembly programs on 454 data. The novelty of our study is that the simulation allows us to calculate a model assembly as reference point for comparison.
The simulation approach allows us to compare basic metrics of assemblies computed by different software applications (CAP3, MIRA, Newbler, and Oases) to a known optimal solution. We found MIRA and CAP3 are conservative in merging reads. This resulted in comparably high number of short contigs. In contrast, Newbler more readily merged reads into longer contigs, while Oases produced the overall shortest assembly. Due to the simulation approach, reads could be traced back to their correct placement within the transcriptome. Together with mapping reads onto the assembled contigs, we were able to evaluate ambiguity in the assemblies. This analysis further supported the conservative nature of MIRA and CAP3, which resulted in low proportions of chimeric contigs, but high redundancy. Newbler produced less redundancy, but the proportion of chimeric contigs was higher.
Our evaluation of four assemblers suggested that MIRA and Newbler slightly outperformed the other programs, while showing contrasting characteristics. Oases did not perform very well on the 454 reads. Our evaluation indicated that the software was either conservative (MIRA) or liberal (Newbler) about merging reads into contigs. This suggested that in choosing an assembly program researchers should carefully consider their follow up analysis and consequences of the chosen approach to gain an assembly.
非模式生物的转录组数据量正在迅速增加。随着测序技术的进步,研究重点转向解决生物信息学挑战,其中序列读取组装是首要任务。最近的研究比较了不同软件的性能,以建立转录组组装的最佳实践。在这里,我们采用模拟方法来评估 454 数据上组装程序的特定特征。本研究的新颖之处在于,模拟允许我们计算模型组装作为比较的参考点。
模拟方法允许我们将不同软件应用程序(CAP3、MIRA、Newbler 和 Oases)计算的基本指标与已知的最佳解决方案进行比较。我们发现 MIRA 和 CAP3 在合并读取时较为保守,这导致了相当多的短序列。相比之下,Newbler 更容易将读取合并成长序列,而 Oases 则产生了总体最短的组装。由于模拟方法,读取可以追溯到它们在转录本中的正确位置。结合将读取映射到组装的 contigs 上,我们能够评估组装中的歧义。这种分析进一步支持了 MIRA 和 CAP3 的保守性质,它们导致了低比例的嵌合 contigs,但冗余度高。Newbler 产生的冗余度较低,但嵌合 contigs 的比例较高。
我们对四个组装程序的评估表明,MIRA 和 Newbler 略微优于其他程序,同时表现出不同的特点。Oases 在 454 读取上的性能不是很好。我们的评估表明,软件在将读取合并到 contigs 时要么保守(MIRA),要么自由(Newbler)。这表明在选择组装程序时,研究人员应仔细考虑其后续分析以及所选方法的后果,以获得组装结果。