CNRS UMR 5554, Institut des Sciences de l'Evolution de Montpellier, Université Montpellier 2, Place E. Bataillon, 34095 Montpellier, France.
Mol Ecol Resour. 2012 Sep;12(5):834-45. doi: 10.1111/j.1755-0998.2012.03148.x. Epub 2012 Apr 30.
Next-generation sequencing (NGS) technologies offer the opportunity for population genomic study of non-model organisms sampled in the wild. The transcriptome is a convenient and popular target for such purposes. However, designing genetic markers from NGS transcriptome data requires assembling gene-coding sequences out of short reads. This is a complex task owing to gene duplications, genetic polymorphism, alternative splicing and transcription noise. Typical assembling programmes return thousands of predicted contigs, whose connection to the species true gene content is unclear, and from which SNP definition is uneasy. Here, the transcriptomes of five diverse non-model animal species (hare, turtle, ant, oyster and tunicate) were assembled from newly generated 454 and Illumina sequence reads. In two species for which a reference genome is available, a new procedure was introduced to annotate each predicted contig as either a full-length cDNA, fragment, chimera, allele, paralogue, genomic sequence or other, based on the number of, and overlap between, blast hits to the appropriate reference. Analyses showed that (i) the highest quality assemblies are obtained when 454 and Illumina data are combined, (ii) typical de novo assemblies include a majority of irrelevant cDNA predictions and (iii) assemblies can be appropriately cleaned by filtering contigs based on length and coverage. We conclude that robust, reference-free assembly of thousands of genes from transcriptomic NGS data is possible, opening promising perspectives for transcriptome-based population genomics in animals. A Galaxy pipeline implementing our best-performing assembling strategy is provided.
下一代测序 (NGS) 技术为在野外采样的非模式生物的群体基因组研究提供了机会。转录组是此类目的的一个方便且流行的目标。然而,从 NGS 转录组数据设计遗传标记需要从短读长组装基因编码序列。由于基因重复、遗传多态性、可变剪接和转录噪声,这是一项复杂的任务。典型的组装程序会返回数千个预测的连续序列,这些序列与物种的真实基因内容之间的关系尚不清楚,并且难以从中定义 SNP。在这里,从新生成的 454 和 Illumina 序列读取中组装了五个不同的非模式动物物种(野兔、海龟、蚂蚁、牡蛎和被囊动物)的转录组。对于有参考基因组的两个物种,引入了一种新程序,根据与适当参考序列的比对命中数量和重叠,将每个预测的连续序列注释为全长 cDNA、片段、嵌合体、等位基因、同源物、基因组序列或其他。分析表明:(i) 当组合使用 454 和 Illumina 数据时,可获得最高质量的组装;(ii) 典型的从头组装包含大多数不相关的 cDNA 预测;(iii) 可以通过基于长度和覆盖度过滤连续序列来适当清理组装。我们得出结论,从转录组 NGS 数据中稳健、无参考的数千个基因组装是可能的,为基于转录组的动物群体基因组学开辟了有希望的前景。提供了一个实现我们表现最佳组装策略的 Galaxy 管道。