Suppr超能文献

非模式物种转录组组装面临的挑战与进展

Challenges and advances for transcriptome assembly in non-model species.

作者信息

Ungaro Arnaud, Pech Nicolas, Martin Jean-François, McCairns R J Scott, Mévy Jean-Philippe, Chappaz Rémi, Gilles André

机构信息

UMR 7263, Équipe Évolution Génome Environnement, Aix Marseille Université, CNRS, IRD, IMBE, Marseille, France.

UMR Centre de Biologie pour la Gestion des Populations, Montpellier SupAgro, Montferrier-sur-Lez, France.

出版信息

PLoS One. 2017 Sep 20;12(9):e0185020. doi: 10.1371/journal.pone.0185020. eCollection 2017.

Abstract

Analyses of high-throughput transcriptome sequences of non-model organisms are based on two main approaches: de novo assembly and genome-guided assembly using mapping to assign reads prior to assembly. Given the limits of mapping reads to a reference when it is highly divergent, as is frequently the case for non-model species, we evaluate whether using blastn would outperform mapping methods for read assignment in such situations (>15% divergence). We demonstrate its high performance by using simulated reads of lengths corresponding to those generated by the most common sequencing platforms, and over a realistic range of genetic divergence (0% to 30% divergence). Here we focus on gene identification and not on resolving the whole set of transcripts (i.e. the complete transcriptome). For simulated datasets, the transcriptome-guided assembly based on blastn recovers 94.8% of genes irrespective of read length at 0% divergence; however, assignment rate of reads is negatively correlated with both increasing divergence level and reducing read lengths. Nevertheless, we still observe 92.6% of recovered genes at 30% divergence irrespective of read length. This analysis also produces a categorization of genes relative to their assignment, and suggests guidelines for data processing prior to analyses of comparative transcriptomics and gene expression to minimize potential inferential bias associated with incorrect transcript assignment. We also compare the performances of de novo assembly alone vs in combination with a transcriptome-guided assembly based on blastn both via simulation and empirically, using data from a cyprinid fish species and from an oak species. For any simulated scenario, the transcriptome-guided assembly using blastn outperforms the de novo approach alone, including when the divergence level is beyond the reach of traditional mapping methods. Combining de novo assembly and a related reference transcriptome for read assignment also addresses the bias/error in contigs caused by the dependence on a related reference alone. Empirical data corroborate these findings when assembling transcriptomes from the two non-model organisms: Parachondrostoma toxostoma (fish) and Quercus pubescens (plant). For the fish species, out of the 31,944 genes known from D. rerio, the guided and de novo assemblies recover respectively 20,605 and 20,032 genes but the performance of the guided assembly approach is much higher for both the contiguity and completeness metrics. For the oak, out of the 29,971 genes known from Vitis vinifera, the transcriptome-guided and de novo assemblies display similar performance, but the new guided approach detects 16,326 genes where the de novo assembly only detects 9,385 genes.

摘要

对非模式生物的高通量转录组序列进行分析主要基于两种方法

从头组装和在组装前通过映射来分配 reads 的基因组引导组装。鉴于在参考序列高度分化时(非模式物种常常如此)将 reads 映射到参考序列存在局限性,我们评估在这种情况(分化率>15%)下使用 blastn 在 reads 分配方面是否优于映射方法。我们通过使用与最常见测序平台生成的长度对应的模拟 reads,并在实际的遗传分化范围内(0%至 30%分化率)来证明其高性能。这里我们关注的是基因识别,而非解析整个转录本集合(即完整的转录组)。对于模拟数据集,基于 blastn 的转录组引导组装在 0%分化率时,无论 reads 长度如何,都能找回 94.8%的基因;然而,reads 的分配率与分化水平的增加和 reads 长度的减少均呈负相关。尽管如此,在 30%分化率时,无论 reads 长度如何,我们仍能观察到 92.6%的找回基因。该分析还根据基因的分配情况进行了分类,并为比较转录组学和基因表达分析之前的数据处理提供了指导方针,以尽量减少与错误转录本分配相关的潜在推断偏差。我们还通过模拟以及实证方式,使用鲤科鱼类和橡树物种的数据,比较了单独的从头组装与结合基于 blastn 的转录组引导组装的性能。对于任何模拟场景,使用 blastn 的转录组引导组装都优于单独的从头组装方法,包括当分化水平超出传统映射方法的范围时。将从头组装和相关的参考转录组用于 reads 分配,也解决了仅依赖相关参考所导致的重叠群中的偏差/错误。当从两种非模式生物:托氏副软骨鱼(鱼类)和柔毛栎(植物)组装转录组时,实证数据证实了这些发现。对于鱼类物种,从斑马鱼已知的 31944 个基因中,引导组装和从头组装分别找回了 20605 个和 20032 个基因,但引导组装方法在连续性和完整性指标方面的性能要高得多。对于橡树,从葡萄已知的 29971 个基因中,转录组引导组装和从头组装表现出相似的性能,但新的引导方法检测到 16326 个基因,而从头组装仅检测到 9385 个基因。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6d4e/5607178/302040d80529/pone.0185020.g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验