Department of Zoology and Animal Biology, University of Geneva, 1211 Geneva 4, Switzerland.
Genome Res. 2010 Oct;20(10):1432-40. doi: 10.1101/gr.103846.109. Epub 2010 Aug 6.
Transcriptome analysis has important applications in many biological fields. However, assembling a transcriptome without a known reference remains a challenging task requiring algorithmic improvements. We present two methods for substantially improving transcriptome de novo assembly. The first method relies on the observation that the use of a single k-mer length by current de novo assemblers is suboptimal to assemble transcriptomes where the sequence coverage of transcripts is highly heterogeneous. We present the Multiple-k method in which various k-mer lengths are used for de novo transcriptome assembly. We demonstrate its good performance by assembling de novo a published next-generation transcriptome sequence data set of Aedes aegypti, using the existing genome to check the accuracy of our method. The second method relies on the use of a reference proteome to improve the de novo assembly. We developed the Scaffolding using Translation Mapping (STM) method that uses mapping against the closest available reference proteome for scaffolding contigs that map onto the same protein. In a controlled experiment using simulated data, we show that the STM method considerably improves the assembly, with few errors. We applied these two methods to assemble the transcriptome of the non-model catfish Loricaria gr. cataphracta. Using the Multiple-k and STM methods, the assembly increases in contiguity and in gene identification, showing that our methods clearly improve quality and can be widely used. The new methods were used to assemble successfully the transcripts of the core set of genes regulating tooth development in vertebrates, while classic de novo assembly failed.
转录组分析在许多生物学领域都有重要的应用。然而,在没有已知参考的情况下组装转录组仍然是一项具有挑战性的任务,需要算法的改进。我们提出了两种方法,可以大大提高转录组从头组装的性能。第一种方法依赖于这样一个观察结果,即当前的从头组装程序使用单一的 k-mer 长度对于组装转录组是次优的,因为转录本的序列覆盖度高度不均匀。我们提出了多 k 方法,该方法使用各种 k-mer 长度进行从头转录组组装。我们使用现有的基因组来检查我们方法的准确性,通过组装已发表的埃及伊蚊下一代转录组序列数据集来证明其良好的性能。第二种方法依赖于使用参考蛋白质组来改进从头组装。我们开发了使用翻译映射 (STM) 进行支架构建的方法,该方法使用与最接近的可用参考蛋白质组进行映射,以构建映射到同一蛋白质的支架连续体。在使用模拟数据的受控实验中,我们表明 STM 方法大大提高了组装的准确性,错误很少。我们将这两种方法应用于非模式鲶鱼 Loricaria gr. cataphracta 的转录组组装。使用多 k 和 STM 方法,组装的连续性和基因识别得到了提高,表明我们的方法明显提高了质量,可以广泛应用。这两种新方法成功地组装了脊椎动物牙齿发育核心调控基因的转录本,而经典的从头组装方法则失败了。