Department of Biology and the Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA 16802, USA.
BMC Evol Biol. 2010 Feb 24;10:61. doi: 10.1186/1471-2148-10-61.
Although the overwhelming majority of genes found in angiosperms are members of gene families, and both gene- and genome-duplication are pervasive forces in plant genomes, some genes are sufficiently distinct from all other genes in a genome that they can be operationally defined as 'single copy'. Using the gene clustering algorithm MCL-tribe, we have identified a set of 959 single copy genes that are shared single copy genes in the genomes of Arabidopsis thaliana, Populus trichocarpa, Vitis vinifera and Oryza sativa. To characterize these genes, we have performed a number of analyses examining GO annotations, coding sequence length, number of exons, number of domains, presence in distant lineages, such as Selaginella and Physcomitrella, and phylogenetic analysis to estimate copy number in other seed plants and to demonstrate their phylogenetic utility. We then provide examples of how these genes may be used in phylogenetic analyses to reconstruct organismal history, both by using extant coverage in EST databases for seed plants and de novo amplification via RT-PCR in the family Brassicaceae.
There are 959 single copy nuclear genes shared in Arabidopsis, Populus, Vitis and Oryza ["APVO SSC genes"]. The majority of these genes are also present in the Selaginella and Physcomitrella genomes. Public EST sets for 197 species suggest that most of these genes are present across a diverse collection of seed plants, and appear to exist as single or very low copy genes, though exceptions are seen in recently polyploid taxa and in lineages where there is significant evidence for a shared large-scale duplication event. Genes encoding proteins localized in organelles are more commonly single copy than expected by chance, but the evolutionary forces responsible for this bias are unknown.Regardless of the evolutionary mechanisms responsible for the large number of shared single copy genes in diverse flowering plant lineages, these genes are valuable for phylogenetic and comparative analyses. Eighteen of the APVO SSC single copy genes were amplified in the Brassicaceae using RT-PCR and directly sequenced. Alignments of these sequences provide improved resolution of Brassicaceae phylogeny compared to recent studies using plastid and ITS sequences. An analysis of sequences from 13 APVO SSC genes from 69 species of seed plants, derived mainly from public EST databases, yielded a phylogeny that was largely congruent with prior hypotheses based on multiple plastid sequences. Whereas single gene phylogenies that rely on EST sequences have limited bootstrap support as the result of limited sequence information, concatenated alignments result in phylogenetic trees with strong bootstrap support for already established relationships. Overall, these single copy nuclear genes are promising markers for phylogenetics, and contain a greater proportion of phylogenetically-informative sites than commonly used protein-coding sequences from the plastid or mitochondrial genomes.
Putatively orthologous, shared single copy nuclear genes provide a vast source of new evidence for plant phylogenetics, genome mapping, and other applications, as well as a substantial class of genes for which functional characterization is needed. Preliminary evidence indicates that many of the shared single copy nuclear genes identified in this study may be well suited as markers for addressing phylogenetic hypotheses at a variety of taxonomic levels.
尽管被子植物中绝大多数基因都是基因家族的成员,且基因和基因组加倍是植物基因组中普遍存在的力量,但有些基因在基因组中与其他所有基因都有足够的区别,因此可以操作地定义为“单拷贝”。使用基因聚类算法 MCL-tribe,我们确定了一组 959 个单拷贝基因,它们是拟南芥、杨树、葡萄和水稻基因组中的共享单拷贝基因。为了描述这些基因,我们进行了多项分析,包括 GO 注释、编码序列长度、外显子数量、结构域数量、在距离较远的谱系(如卷柏和藓类植物)中的存在情况,以及在其他种子植物中的系统发生分析,以估计它们在其他种子植物中的拷贝数,并证明它们在系统发生学上的有用性。然后,我们提供了一些例子,说明如何在系统发生分析中使用这些基因来重建生物的历史,包括使用种子植物的 EST 数据库中的现有覆盖率和通过 RT-PCR 在十字花科家族中进行从头扩增。
有 959 个单拷贝核基因在拟南芥、杨树、葡萄和水稻中共享[APVO SSC 基因]。这些基因中的大多数也存在于卷柏和藓类植物的基因组中。197 个物种的公共 EST 集表明,这些基因大多数存在于多样化的种子植物中,并且似乎作为单拷贝或低拷贝基因存在,尽管在最近的多倍体分类群和存在大规模共享复制事件的谱系中存在例外。定位于细胞器的蛋白质编码基因比预期的随机单拷贝更常见,但导致这种偏差的进化力量尚不清楚。无论导致不同开花植物谱系中大量共享单拷贝基因的进化机制是什么,这些基因对于系统发生和比较分析都是有价值的。使用 RT-PCR 在十字花科中扩增了 18 个 APVO SSC 单拷贝基因,并直接测序。与最近使用质体和 ITS 序列进行的研究相比,这些序列的比对提供了对十字花科系统发育的更好分辨率。对来自 69 个种子植物物种的 13 个 APVO SSC 单拷贝基因序列的分析表明,基于多个质体序列的系统发育假说与基于多个质体序列的系统发育假说基本一致。由于 EST 序列的信息量有限,基于单个基因的系统发育分析的支持度有限,而串联比对则产生了具有很强支持度的系统发育树,用于已经建立的关系。总的来说,这些单拷贝核基因是系统发生学的有前途的标记,并且包含比质体或线粒体基因组中常用的蛋白质编码序列更多的系统发育信息位点。
假定的直系同源、共享单拷贝核基因为植物系统发生学、基因组图谱和其他应用提供了大量新的证据,同时也为需要功能表征的大量基因提供了证据。初步证据表明,本研究中鉴定的许多共享单拷贝核基因可能非常适合作为解决各种分类水平上的系统发育假设的标记。