Echols Nathaniel, Harrison Paul, Balasubramanian Suganthi, Luscombe Nicholas M, Bertone Paul, Zhang Zhaolei, Gerstein Mark
Department of Molecular Biophysics and Biochemistry, Yale University, 266 Whitney Avenue, Box 208114, New Haven, CT 06520-8114, USA.
Nucleic Acids Res. 2002 Jun 1;30(11):2515-23. doi: 10.1093/nar/30.11.2515.
Based on searches for disabled homologs to known proteins, we have identified a large population of pseudogenes in four sequenced eukaryotic genomes-the worm, yeast, fly and human (chromosomes 21 and 22 only). Each of our nearly 2500 pseudogenes is characterized by one or more disablements mid-domain, such as premature stops and frameshifts. Here, we perform a comprehensive survey of the amino acid and nucleotide composition of these pseudogenes in comparison to that of functional genes and intergenic DNA. We show that pseudogenes invariably have an amino acid composition intermediate between genes and translated intergenic DNA. Although the degree of intermediacy varies among the four organisms, in all cases, it is most evident for amino acid types that differ most in occurrence between genes and intergenic regions. The same intermediacy also applies to codon frequencies, especially in the worm and human. Moreover, the intermediate composition of pseudogenes applies even though the composition of the genes in the four organisms is markedly different, showing a strong correlation with the overall A/T content of the genomic sequence. Pseudogenes can be divided into 'ancient' and 'modern' subsets, based on the level of sequence identity with their closest matching homolog (within the same genome). Modern pseudogenes usually have a much closer sequence composition to genes than ancient pseudogenes. Collectively, our results indicate that the composition of pseudogenes that are under no selective constraints progressively drifts from that of coding DNA towards non-coding DNA. Therefore, we propose that the degree to which pseudogenes approach a random sequence composition may be useful in dating different sets of pseudogenes, as well as to assess the rate at which intergenic DNA accumulates mutations. Our compositional analyses with the interactive viewer are available over the web at http://genecensus.org/pseudogene.
通过搜索已知蛋白质的失活同源物,我们在四个已测序的真核生物基因组(线虫、酵母、果蝇和人类,仅21号和22号染色体)中鉴定出了大量假基因。我们近2500个假基因中的每一个都具有一个或多个结构域中部的失活特征,如过早终止和移码突变。在这里,我们对这些假基因的氨基酸和核苷酸组成进行了全面调查,并与功能基因和基因间DNA的组成进行了比较。我们发现,假基因的氨基酸组成始终介于基因和可翻译的基因间DNA之间。尽管中间程度在这四种生物中有所不同,但在所有情况下,对于基因和基因间区域中出现频率差异最大的氨基酸类型来说最为明显。同样的中间性也适用于密码子频率,尤其是在线虫和人类中。此外,即使这四种生物中基因的组成明显不同,假基因的中间组成仍然适用,这表明与基因组序列的总体A/T含量有很强的相关性。根据与最匹配同源物(在同一基因组内)的序列同一性水平,假基因可分为“古老”和“现代”子集。现代假基因的序列组成通常比古老假基因与基因的组成更接近。总体而言,我们的结果表明,不受选择约束的假基因的组成逐渐从编码DNA向非编码DNA漂移。因此,我们提出,假基因接近随机序列组成的程度可能有助于确定不同假基因集的年代,以及评估基因间DNA积累突变的速率。我们使用交互式查看器进行的组成分析可通过网络在http://genecensus.org/pseudogene上获取。