Lin Haining, Ouyang Shu, Simons Rain, Nobuta Kan, Haas Brian J, Zhu Wei, Gu Xun, Silva Joana C, Meyers Blake C, Buell C Robin
The Institute for Genomic Research, 9712 Medical Center Dr. , Rockville, MD 20850, USA.
BMC Plant Biol. 2008 Feb 19;8:18. doi: 10.1186/1471-2229-8-18.
High gene numbers in plant genomes reflect polyploidy and major gene duplication events. Oryza sativa, cultivated rice, is a diploid monocotyledonous species with a ~390 Mb genome that has undergone segmental duplication of a substantial portion of its genome. This, coupled with other genetic events such as tandem duplications, has resulted in a substantial number of its genes, and resulting proteins, occurring in paralogous families.
Using a computational pipeline that utilizes Pfam and novel protein domains, we characterized paralogous families in rice and compared these with paralogous families in the model dicotyledonous diploid species, Arabidopsis thaliana. Arabidopsis, which has undergone genome duplication as well, has a substantially smaller genome (~120 Mb) and gene complement compared to rice. Overall, 53% and 68% of the non-transposable element-related rice and Arabidopsis proteins could be classified into paralogous protein families, respectively. Singleton and paralogous family genes differed substantially in their likelihood of encoding a protein of known or putative function; 26% and 66% of singleton genes compared to 73% and 96% of the paralogous family genes encode a known or putative protein in rice and Arabidopsis, respectively. Furthermore, a major skew in the distribution of specific gene function was observed; a total of 17 Gene Ontology categories in both rice and Arabidopsis were statistically significant in their differential distribution between paralogous family and singleton proteins. In contrast to mammalian organisms, we found that duplicated genes in rice and Arabidopsis tend to have more alternative splice forms. Using data from Massively Parallel Signature Sequencing, we show that a significant portion of the duplicated genes in rice show divergent expression although a correlation between sequence divergence and correlation of expression could be seen in very young genes.
Collectively, these data suggest that while co-regulation and conserved function are present in some paralogous protein family members, evolutionary pressures have resulted in functional divergence with differential expression patterns.
植物基因组中的高基因数量反映了多倍体和主要的基因复制事件。栽培稻(Oryza sativa)是一种二倍体单子叶植物,其基因组约为390 Mb,基因组的很大一部分经历了片段重复。这与串联重复等其他遗传事件一起,导致其大量基因以及由此产生的蛋白质以旁系同源家族的形式存在。
我们使用了一种利用Pfam和新蛋白质结构域的计算流程,对水稻中的旁系同源家族进行了表征,并将其与模式双子叶二倍体物种拟南芥(Arabidopsis thaliana)中的旁系同源家族进行了比较。拟南芥也经历了基因组复制,但其基因组(约120 Mb)和基因组成与水稻相比要小得多。总体而言,水稻和拟南芥中分别有53%和68%的非转座元件相关蛋白质可归类为旁系同源蛋白质家族。单拷贝基因和旁系同源家族基因在编码已知或推定功能蛋白质的可能性上有很大差异;水稻和拟南芥中分别有26%和66%的单拷贝基因编码已知或推定蛋白质,而旁系同源家族基因的这一比例分别为73%和96%。此外,还观察到特定基因功能分布的主要偏差;水稻和拟南芥中共有17个基因本体类别在旁系同源家族蛋白和单拷贝蛋白之间的差异分布上具有统计学意义。与哺乳动物不同,我们发现水稻和拟南芥中的重复基因往往有更多的可变剪接形式。利用大规模平行签名测序数据,我们表明水稻中很大一部分重复基因表现出差异表达,尽管在非常年轻的基因中可以看到序列差异与表达相关性之间的关联。
总体而言,这些数据表明,虽然一些旁系同源蛋白质家族成员中存在共同调控和保守功能,但进化压力导致了具有差异表达模式的功能分化。