同源基因引导的高度杂合作物组装:创建参考转录组以揭示黑麦草中的遗传多样性。
Orthology Guided Assembly in highly heterozygous crops: creating a reference transcriptome to uncover genetic diversity in Lolium perenne.
机构信息
Plant Sciences Unit--Growth and Development, Institute for Agricultural and Fisheries Research-ILVO, Melle, Belgium.
出版信息
Plant Biotechnol J. 2013 Jun;11(5):605-17. doi: 10.1111/pbi.12051. Epub 2013 Feb 21.
Despite current advances in next-generation sequencing data analysis procedures, de novo assembly of a reference sequence required for SNP discovery and expression analysis is still a major challenge in genetically uncharacterized, highly heterozygous species. High levels of polymorphism inherent to outbreeding crop species hamper De Bruijn Graph-based de novo assembly algorithms, causing transcript fragmentation and the redundant assembly of allelic contigs. If multiple genotypes are sequenced to study genetic diversity, primary de novo assembly is best performed per genotype to limit the level of polymorphism and avoid transcript fragmentation. Here, we propose an Orthology Guided Assembly procedure that first uses sequence similarity (tBLASTn) to proteins of a model species to select allelic and fragmented contigs from all genotypes and then performs CAP3 clustering on a gene-by-gene basis. Thus, we simultaneously annotate putative orthologues for each protein of the model species, resolve allelic redundancy and fragmentation and create a de novo transcript sequence representing the consensus of all alleles present in the sequenced genotypes. We demonstrate the procedure using RNA-seq data from 14 genotypes of Lolium perenne to generate a reference transcriptome for gene discovery and translational research, to reveal the transcriptome-wide distribution and density of SNPs in an outbreeding crop and to illustrate the effect of polymorphisms on the assembly procedure. The results presented here illustrate that constructing a non-redundant reference sequence is essential for comparative genomics, orthology-based annotation and candidate gene selection but also for read mapping and subsequent polymorphism discovery and/or read count-based gene expression analysis.
尽管目前在下一代测序数据分析程序方面取得了进展,但对于 SNP 发现和表达分析而言,从头组装参考序列仍然是一个具有挑战性的问题,尤其是在那些遗传特征尚未明确且高度杂合的物种中。异花授粉作物物种固有的高水平多态性会阻碍基于 De Bruijn 图的从头组装算法,导致转录本碎片化和等位基因序列的冗余组装。如果要对多个基因型进行测序以研究遗传多样性,最好针对每个基因型进行初步从头组装,以限制多态性水平并避免转录本碎片化。在这里,我们提出了一种同源物指导组装程序,该程序首先使用序列相似性(tBLASTn)对模型物种的蛋白质进行搜索,以从所有基因型中选择等位基因和碎片化的序列,然后根据基因进行 CAP3 聚类。这样,我们可以同时为模型物种的每个蛋白质注释假定的同源物,解决等位基因冗余和碎片化问题,并创建一个代表测序基因型中所有等位基因共识的从头转录序列。我们使用 14 个 Lolium perenne 基因型的 RNA-seq 数据来演示该程序,以生成一个参考转录组,用于基因发现和转化研究,揭示异花授粉作物中转录组范围内 SNP 的分布和密度,并说明多态性对组装程序的影响。这里呈现的结果表明,构建一个非冗余的参考序列对于比较基因组学、基于同源物的注释和候选基因选择至关重要,但对于读取映射以及随后的多态性发现和/或基于读取计数的基因表达分析也是如此。