Departmento de Biología Molecular y Bioquímica, Universidad de Málaga, Campus de Teatinos s/n, Malaga, 29071, Spain.
Plant Reproductive Biology Laboratory, Department of Biochemistry, Cell and Molecular Biology of Plants. Estación Experimental del Zaidín. CSIC, Prof. Albareda, 1, Granada, 18160, Spain.
BMC Bioinformatics. 2018 Nov 20;19(Suppl 14):416. doi: 10.1186/s12859-018-2384-y.
The advances in high-throughput sequencing technologies are allowing more and more de novo assembling of transcriptomes from many new organisms. Some degree of automation and evaluation is required to warrant reproducibility, repetitivity and the selection of the best possible transcriptome. Workflows and pipelines are becoming an absolute requirement for such a purpose, but the issue of assembling evaluation for de novo transcriptomes in organisms lacking a sequenced genome remains unsolved. An automated, reproducible and flexible framework called TransFlow to accomplish this task is described.
TransFlow with its five independent modules was designed to build different workflows depending on the nature of the original reads. This architecture enables different combinations of Illumina and Roche/454 sequencing data, and can be extended to other sequencing platforms. Its capabilities are illustrated with the selection of reliable plant reference transcriptomes and the assembling six transcriptomes (three case studies for grapevine leaves, olive tree pollen, and chestnut stem, and other three for haustorium, epiphytic structures and their combination for the phytopathogenic fungus Podosphaera xanthii). Arabidopsis and poplar transcriptomes revealed to be the best references. A common result regarding de novo assemblies is that Illumina paired-end reads of 100 nt in length assembled with OASES can provide reliable transcriptomes, while the contribution of longer reads is noticeable only when they complement a set of short, single-reads.
TransFlow can handle up to 181 different assembling strategies. Evaluation based on principal component analyses allows its self-adaptation to different sets of reads to provide a suitable transcriptome for each combination of reads and assemblers. As a result, each case study has its own behaviour, prioritises evaluation parameters, and gives an objective and automated way for detecting the best transcriptome within a pool of them. Sequencing data type and quantity (preferably several hundred millions of 2×100 nt or longer), assemblers (OASES for Illumina, MIRA4 and EULER-SR reconciled with CAP3 for Roche/454) and strategy (preferably scaffolding with OASES, and probably merging with Roche/454 when available) arise as the most impacting factors.
高通量测序技术的进步使得越来越多的新生物的转录组可以进行从头组装。为了保证重现性、重复性和选择最佳转录组,需要一定程度的自动化和评估。为此,工作流程和流水线成为绝对必要的,但对于缺乏测序基因组的生物的从头转录组组装评估问题仍然没有得到解决。本文描述了一种名为 TransFlow 的自动化、可重复和灵活的框架,用于完成这项任务。
TransFlow 有五个独立的模块,旨在根据原始读取的性质构建不同的工作流程。这种架构支持 Illumina 和 Roche/454 测序数据的不同组合,并且可以扩展到其他测序平台。其功能通过选择可靠的植物参考转录组和组装六个转录组(三个案例研究是葡萄叶、橄榄花粉和板栗茎,另外三个是寄生结构及其与植物病原菌 Podosphaera xanthii 的组合)来演示。拟南芥和杨树转录组被证明是最好的参考。关于从头组装的一个共同结果是,使用 OASES 组装长度为 100nt 的 Illumina 配对末端读取可以提供可靠的转录组,而较长读取的贡献只有在它们补充一组短的单读时才会明显。
TransFlow 可以处理多达 181 种不同的组装策略。基于主成分分析的评估允许其自适应不同的读取集,为每个读取和组装器组合提供合适的转录组。因此,每个案例研究都有其自身的行为,优先考虑评估参数,并为在一组转录组中检测最佳转录组提供客观和自动化的方法。测序数据类型和数量(最好是几百个 2×100nt 或更长)、组装器(Illumina 用 OASES, Roche/454 用 MIRA4 和 EULER-SR 与 CAP3 协调)和策略(最好是用 OASES 进行支架,当可用时可能与 Roche/454 合并)是影响最大的因素。