School of Biological Sciences, University of Nebraska-Lincoln, Lincoln, NE, 68588, USA.
Department of Computer Science and Engineering, University of Nebraska-Lincoln, Lincoln, NE, 68588, USA.
BMC Bioinformatics. 2021 Oct 21;22(1):513. doi: 10.1186/s12859-021-04434-8.
Systems-level analyses, such as differential gene expression analysis, co-expression analysis, and metabolic pathway reconstruction, depend on the accuracy of the transcriptome. Multiple tools exist to perform transcriptome assembly from RNAseq data. However, assembling high quality transcriptomes is still not a trivial problem. This is especially the case for non-model organisms where adequate reference genomes are often not available. Different methods produce different transcriptome models and there is no easy way to determine which are more accurate. Furthermore, having alternative-splicing events exacerbates such difficult assembly problems. While benchmarking transcriptome assemblies is critical, this is also not trivial due to the general lack of true reference transcriptomes.
In this study, we first provide a pipeline to generate a set of the simulated benchmark transcriptome and corresponding RNAseq data. Using the simulated benchmarking datasets, we compared the performance of various transcriptome assembly approaches including both de novo and genome-guided methods. The results showed that the assembly performance deteriorates significantly when alternative transcripts (isoforms) exist or for genome-guided methods when the reference is not available from the same genome. To improve the transcriptome assembly performance, leveraging the overlapping predictions between different assemblies, we present a new consensus-based ensemble transcriptome assembly approach, ConSemble.
Without using a reference genome, ConSemble using four de novo assemblers achieved an accuracy up to twice as high as any de novo assemblers we compared. When a reference genome is available, ConSemble using four genome-guided assemblies removed many incorrectly assembled contigs with minimal impact on correctly assembled contigs, achieving higher precision and accuracy than individual genome-guided methods. Furthermore, ConSemble using de novo assemblers matched or exceeded the best performing genome-guided assemblers even when the transcriptomes included isoforms. We thus demonstrated that the ConSemble consensus strategy both for de novo and genome-guided assemblers can improve transcriptome assembly. The RNAseq simulation pipeline, the benchmark transcriptome datasets, and the script to perform the ConSemble assembly are all freely available from: http://bioinfolab.unl.edu/emlab/consemble/ .
系统水平分析,如差异基因表达分析、共表达分析和代谢途径重建,都依赖于转录组的准确性。有多种工具可用于从 RNAseq 数据中进行转录组组装。然而,组装高质量的转录组仍然不是一个简单的问题。对于没有足够参考基因组的非模式生物来说尤其如此。不同的方法会产生不同的转录组模型,并且没有简单的方法来确定哪个更准确。此外,存在可变剪接事件会加剧这种困难的组装问题。尽管对转录组组装进行基准测试至关重要,但由于普遍缺乏真正的参考转录组,这也不是一件简单的事情。
在这项研究中,我们首先提供了一个生成一组模拟基准转录组和相应 RNAseq 数据的流程。使用模拟的基准数据集,我们比较了各种转录组组装方法的性能,包括从头和基于基因组的方法。结果表明,当存在替代转录本(异构体)时,组装性能会显著恶化,或者对于基于基因组的方法,当无法从同一基因组获得参考时,组装性能也会恶化。为了提高转录组组装性能,我们利用不同组装之间的重叠预测,提出了一种新的基于共识的组合转录组组装方法,即 ConSemble。
在不使用参考基因组的情况下,使用四个从头组装器的 ConSemble 达到了高达我们比较的任何从头组装器两倍的准确性。当有参考基因组可用时,使用四个基于基因组的组装器的 ConSemble 去除了许多错误组装的连续体,而对正确组装的连续体的影响最小,比单个基于基因组的方法具有更高的精度和准确性。此外,即使转录组包含异构体,使用从头组装器的 ConSemble 也能达到或超过表现最好的基于基因组的组装器。因此,我们证明了 ConSemble 共识策略既适用于从头组装器,也适用于基于基因组的组装器,可以改进转录组组装。RNAseq 模拟流程、基准转录组数据集以及执行 ConSemble 组装的脚本均可从以下网址免费获取:http://bioinfolab.unl.edu/emlab/consemble/ 。