Contreras-Moreira Bruno, Cantalapiedra Carlos P, García-Pereira María J, Gordon Sean P, Vogel John P, Igartua Ernesto, Casas Ana M, Vinuesa Pablo
Estación Experimental de Aula Dei - Consejo Superior de Investigaciones CientíficasZaragoza, Spain; Fundación ARAIDZaragoza, Spain.
Estación Experimental de Aula Dei - Consejo Superior de Investigaciones Científicas Zaragoza, Spain.
Front Plant Sci. 2017 Feb 14;8:184. doi: 10.3389/fpls.2017.00184. eCollection 2017.
The pan-genome of a species is defined as the union of all the genes and non-coding sequences found in all its individuals. However, constructing a pan-genome for plants with large genomes is daunting both in sequencing cost and the scale of the required computational analysis. A more affordable alternative is to focus on the genic repertoire by using transcriptomic data. Here, the software GET_HOMOLOGUES-EST was benchmarked with genomic and RNA-seq data of 19 ecotypes and then applied to the analysis of transcripts from 16 genotypes. The goal was to sample their pan-genomes and classify sequences as core, if detected in all accessions, or accessory, when absent in some of them. The resulting sequence clusters were used to simulate pan-genome growth, and to compile Average Nucleotide Identity matrices that summarize intra-species variation. Although transcripts were found to under-estimate pan-genome size by at least 10%, we concluded that clusters of expressed sequences can recapitulate phylogeny and reproduce two properties observed in gene models: accessory loci show lower expression and higher non-synonymous substitution rates than core genes. Finally, accessory sequences were observed to preferentially encode transposon components in both species, plus disease resistance genes in cultivated barleys, and a variety of protein domains from other families that appear frequently associated with presence/absence variation in the literature. These results demonstrate that pan-genome analyses are useful to explore germplasm diversity.
一个物种的泛基因组被定义为在其所有个体中发现的所有基因和非编码序列的总和。然而,为具有大基因组的植物构建泛基因组在测序成本和所需计算分析的规模方面都是令人生畏的。一种更经济实惠的替代方法是通过使用转录组数据来关注基因库。在这里,软件GET_HOMOLOGUES-EST用19个生态型的基因组和RNA-seq数据进行了基准测试,然后应用于对16个基因型的转录本进行分析。目标是对它们的泛基因组进行采样,并将序列分类为核心序列(如果在所有种质中都能检测到)或辅助序列(如果在其中一些种质中不存在)。所得的序列簇用于模拟泛基因组的增长,并编制总结种内变异的平均核苷酸同一性矩阵。虽然发现转录本会使泛基因组大小至少低估10%,但我们得出结论,表达序列簇可以概括系统发育,并重现基因模型中观察到的两个特性:辅助基因座的表达低于核心基因,且非同义替换率高于核心基因。最后,观察到辅助序列在这两个物种中都优先编码转座子成分,在栽培大麦中还编码抗病基因,以及文献中经常与存在/缺失变异相关的其他家族的各种蛋白质结构域。这些结果表明,泛基因组分析对于探索种质多样性是有用的。