Chen Chun-Long, Zhou Hui, Liao Jian-You, Qu Liang-Hu, Amar Laurence
Institut de Biologie Animale Intégrative et Cellulaire, Université Paris Sud, Orsay, France
RNA. 2009 Apr;15(4):503-14. doi: 10.1261/rna.1306009. Epub 2009 Feb 13.
The compact genome of the unicellular eukaryote Paramecium tetraurelia contains noncoding DNA (ncDNA) distributed into >39,000 intergenic sequences and >90,000 introns of 390 base pairs (bp) and 25 bp on average, respectively. Here we analyzed the molecular features of the ncRNA genes, introns, and intergenic sequences of this genome. We mainly used computational programs and comparative genomics possible because the P. tetraurelia genome had formed throughout whole-genome duplications (WGDs). We characterized 417 5S rRNA, snRNA, snoRNA, SRP RNA, and tRNA putative genes, 415 of which map within intergenic sequences, and two, within introns. The evolution of these ncRNA genes appears to have mainly involved purifying selection and gene deletion. We then compared the introns that interrupt the protein-coding gene duplicates arisen from the recent WGD and identified a population of a few thousands of introns having evolved under most stringent constraints (>95% of identity). We also showed that low nucleotide substitution levels characterize the 50 and 80-115 base pairs flanking, respectively, the stop and start codons of the protein-coding genes. Lower substitution levels mark the base pairs flanking the highly transcribed genes, or the start codons of the genes of the sets with a high number of WGD-related sequences. Finally, adjacent to protein-coding genes, we characterized 32 DNA motifs able to encode stable and evolutionary conserved RNA secondary structures and defining putative expression controlling elements. Fourteen DNA motifs with similar properties map distant from protein-coding genes and may encode regulatory ncRNAs.
单细胞真核生物四膜虫的紧凑基因组包含非编码DNA(ncDNA),其分布在超过39,000个基因间序列和超过90,000个内含子中,平均长度分别为390个碱基对(bp)和25 bp。在此,我们分析了该基因组中非编码RNA基因、内含子和基因间序列的分子特征。由于四膜虫基因组是在全基因组复制(WGD)过程中形成的,我们主要使用了计算程序和比较基因组学方法。我们鉴定了417个5S rRNA、snRNA、snoRNA、SRP RNA和tRNA的假定基因,其中415个位于基因间序列中,2个位于内含子中。这些非编码RNA基因的进化似乎主要涉及纯化选择和基因删除。然后,我们比较了打断近期WGD产生的蛋白质编码基因重复序列的内含子,并鉴定出数千个在最严格限制条件下进化的内含子群体(同一性>95%)。我们还表明,蛋白质编码基因的终止密码子和起始密码子两侧分别为50和80 - 115个碱基对的区域具有低核苷酸替换水平。较低的替换水平标记了高转录基因两侧的碱基对,或具有大量与WGD相关序列的基因集的起始密码子两侧的碱基对。最后,在蛋白质编码基因附近,我们鉴定了32个能够编码稳定且进化保守的RNA二级结构并定义假定表达控制元件的DNA基序。14个具有相似性质的DNA基序位于远离蛋白质编码基因的位置,可能编码调控性非编码RNA。