跨非模式动物的比较表明从头转录组组装的最佳测序深度。

BACKGROUND

The lack of genomic resources can present challenges for studies of non-model organisms. Transcriptome sequencing offers an attractive method to gather information about genes and gene expression without the need for a reference genome. However, it is unclear what sequencing depth is adequate to assemble the transcriptome de novo for these purposes.

RESULTS

We assembled transcriptomes of animals from six different phyla (Annelids, Arthropods, Chordates, Cnidarians, Ctenophores, and Molluscs) at regular increments of reads using Velvet/Oases and Trinity to determine how read count affects the assembly. This included an assembly of mouse heart reads because we could compare those against the reference genome that is available. We found qualitative differences in the assemblies of whole-animals versus tissues. With increasing reads, whole-animal assemblies show rapid increase of transcripts and discovery of conserved genes, while single-tissue assemblies show a slower discovery of conserved genes though the assembled transcripts were often longer. A deeper examination of the mouse assemblies shows that with more reads, assembly errors become more frequent but such errors can be mitigated with more stringent assembly parameters.

CONCLUSIONS

These assembly trends suggest that representative assemblies are generated with as few as 20 million reads for tissue samples and 30 million reads for whole-animals for RNA-level coverage. These depths provide a good balance between coverage and noise. Beyond 60 million reads, the discovery of new genes is low and sequencing errors of highly-expressed genes are likely to accumulate. Finally, siphonophores (polymorphic Cnidarians) are an exception and possibly require alternate assembly strategies.

背景

缺乏基因组资源可能会给非模式生物的研究带来挑战。转录组测序提供了一种有吸引力的方法，可以在不需要参考基因组的情况下收集有关基因和基因表达的信息。然而，目前尚不清楚为了达到这些目的，需要多少测序深度才能从头组装转录组。

结果

我们使用 Velvet/Oases 和 Trinity 以常规增量读取的方式对来自六个不同门的动物（环节动物、节肢动物、脊索动物、刺胞动物、栉水母和软体动物）的转录组进行了组装，以确定读取计数如何影响组装。这包括对老鼠心脏读取的组装，因为我们可以将其与可用的参考基因组进行比较。我们发现，与组织相比，整体动物的组装存在定性差异。随着读取量的增加，整体动物的组装显示出转录本的快速增加和保守基因的发现，而单一组织的组装则显示出较慢的保守基因发现，尽管组装的转录本通常更长。对老鼠组装的更深入研究表明，随着读取次数的增加，组装错误变得更加频繁，但通过更严格的组装参数可以减轻这些错误。

结论

这些组装趋势表明，对于组织样本，只需 2000 万条读取，对于整体动物，只需 3000 万条读取即可进行 RNA 水平的覆盖，即可生成代表性的组装。这些深度在覆盖范围和噪声之间提供了很好的平衡。超过 6000 万条读取后，新基因的发现率较低，并且高度表达基因的测序错误可能会累积。最后，水螅（多态刺胞动物）是一个例外，可能需要替代的组装策略。