Zhang Zhaolei, Harrison Paul M, Liu Yin, Gerstein Mark
Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520-8114, USA.
Genome Res. 2003 Dec;13(12):2541-58. doi: 10.1101/gr.1429003.
Processed pseudogenes were created by reverse-transcription of mRNAs; they provide snapshots of ancient genes existing millions of years ago in the genome. To find them in the present-day human, we developed a pipeline using features such as intron-absence, frame-disruption, polyadenylation, and truncation. This has enabled us to identify in recent genome drafts approximately 8000 processed pseudogenes (distributed from http://pseudogene.org). Overall, processed pseudogenes are very similar to their closest corresponding human gene, being 94% complete in coding regions, with sequence similarity of 75% for amino acids and 86% for nucleotides. Their chromosomal distribution appears random and dispersed, with the numbers on chromosomes proportional to length, suggesting sustained "bombardment" over evolution. However, it does vary with GC-content: Processed pseudogenes occur mostly in intermediate GC-content regions. This is similar to Alus but contrasts with functional genes and L1-repeats. Pseudogenes, moreover, have age profiles similar to Alus. The number of pseudogenes associated with a given gene follows a power-law relationship, with a few genes giving rise to many pseudogenes and most giving rise to few. The prevalence of processed pseudogenes agrees well with germ-line gene expression. Highly expressed ribosomal proteins account for approximately 20% of the total. Other notables include cyclophilin-A, keratin, GAPDH, and cytochrome c.
加工后的假基因是通过mRNA的逆转录产生的;它们提供了数百万年前存在于基因组中的古老基因的快照。为了在现代人类中找到它们,我们开发了一种利用内含子缺失、框架破坏、多聚腺苷酸化和截短等特征的流程。这使我们能够在最近的基因组草图中识别出大约8000个加工后的假基因(可从http://pseudogene.org获取)。总体而言,加工后的假基因与其最接近的相应人类基因非常相似,编码区的完整性为94%,氨基酸序列相似性为75%,核苷酸序列相似性为86%。它们的染色体分布似乎是随机且分散的,染色体上的数量与长度成正比,这表明在进化过程中持续受到“轰击”。然而,它确实会因GC含量而有所不同:加工后的假基因大多出现在中等GC含量区域。这与Alu元件相似,但与功能基因和L1重复序列不同。此外,假基因的年龄分布与Alu元件相似。与给定基因相关的假基因数量遵循幂律关系,少数基因产生许多假基因,而大多数基因产生的假基因较少。加工后的假基因的流行程度与种系基因表达非常吻合。高表达的核糖体蛋白约占总数的20%。其他值得注意的包括亲环蛋白A、角蛋白、甘油醛-3-磷酸脱氢酶和细胞色素c。