Guan Xueying, Nah Gyoungju, Song Qingxin, Udall Joshua A, Stelly David M, Chen Z Jeffrey
Institute for Cellular and Molecular Biology and Center for Computational Biology and Bioinformatics, The University of Texas at Austin, Austin, Texas 78712, USA.
BMC Res Notes. 2014 Aug 6;7:493. doi: 10.1186/1756-0500-7-493.
The most widely cultivated cotton (Gossypium hirsutum L., AD-genome) is derived from tetraploidization between A- and D-genome species. G. arboreum L. (A-genome) and G. raimondii Ulbr. (D-genome) are two of closely-related extant progenitors. Gene expression studies in allotetraploid cotton are complicated by the homoeologous loci of A- and D-genome origins. To develop genomic resources for gene expression and cotton breeding, we sequenced and assembled expressed sequence tags (ESTs) derived from G. arboreum and G. raimondii.
Roche/454 FLX sequencing technology was employed to sequence normalized cDNA libraries prepared from leaves, roots, bolls, ovules, and fibers in G. arboreum and G. raimondii, respectively. Sequencing reads from two independent libraries in each species were combined to assemble high-quality EST contigs. The combined sequencing reads included 1,699,776 from A-genome and 1,464,815 from D-genome, which were clustered into 89,588 contigs in the A-genome and 65,542 contigs in the D-genome. These contigs represented ~80% of EST collections in Cotton Gene Index 11 (CGI11, March 2011). Compared to the D-genome transcript database, 27,537 and 10,452 contigs were unique transcripts in A and D genomes, respectively. Further analysis using self-blastn reduced the unigene contig number by 52% in A-genome and 57% in D-genome, suggesting that 50% or more of contigs are paralogs or isoforms within each species. The majority of EST contigs (73-81%) were conserved between A- and D-genomes, whereas 27% and 19% contigs were specific to A- and D-genomes, respectively. Using these ESTs, we generated a total of 75,754 genome-specific single nucleotide polymorphism (SNP) (gSNPs or GNPs) or homoeologous-specific SNPs (hSNPs) of 10,885 contigs or genes between A and D genomes, indicating a possibility of separating allelic expression for those genes in allotetraploid cotton.
Expressed genes are highly redundant within each diploid progenitor and between A and D progenitor species, suggesting that diploid progenitors in cotton are likely ancient tetraploids. This large set of A- and D-genome ESTs and GNPs will be valuable resources for genome annotation, gene expression, and crop improvement in allotetraploid cotton.
种植最广泛的棉花(陆地棉,AD基因组)源自A基因组和D基因组物种之间的四倍体化。亚洲棉(A基因组)和雷蒙德氏棉(D基因组)是两个现存的近缘祖先。异源四倍体棉花中的基因表达研究因A和D基因组起源的同源基因座而变得复杂。为了开发用于基因表达和棉花育种的基因组资源,我们对源自亚洲棉和雷蒙德氏棉的表达序列标签(EST)进行了测序和组装。
采用罗氏/454 FLX测序技术分别对亚洲棉和雷蒙德氏棉的叶、根、棉铃、胚珠和纤维制备的标准化cDNA文库进行测序。将每个物种两个独立文库的测序读段合并,以组装高质量的EST重叠群。合并后的测序读段包括来自A基因组的1,699,776条和来自D基因组的1,464,815条,它们在A基因组中聚集成89,588个重叠群,在D基因组中聚集成65,542个重叠群。这些重叠群代表了棉花基因索引11(CGI11,2011年3月)中EST集合的约80%。与D基因组转录本数据库相比,A和D基因组中分别有27,537个和10,452个重叠群是独特的转录本。使用自比对(self-blastn)进一步分析后,A基因组中单一基因重叠群数量减少了52%,D基因组中减少了57%,这表明每个物种中50%或更多的重叠群是旁系同源物或同工型。大多数EST重叠群(73 - 81%)在A和D基因组之间是保守的,而分别有27%和19%的重叠群是A和D基因组特有的。利用这些EST,我们总共生成了75,754个A和D基因组之间10,885个重叠群或基因的基因组特异性单核苷酸多态性(gSNP或GNP)或同源特异性SNP(hSNP),这表明在异源四倍体棉花中分离这些基因的等位基因表达是有可能的。
每个二倍体祖先内部以及A和D祖先物种之间的表达基因高度冗余,这表明棉花中的二倍体祖先可能是古老的四倍体。这一大组A和D基因组的EST和GNP将是用于异源四倍体棉花基因组注释、基因表达和作物改良的宝贵资源。