Zheng Deyou, Frankish Adam, Baertsch Robert, Kapranov Philipp, Reymond Alexandre, Choo Siew Woh, Lu Yontao, Denoeud France, Antonarakis Stylianos E, Snyder Michael, Ruan Yijun, Wei Chia-Lin, Gingeras Thomas R, Guigó Roderic, Harrow Jennifer, Gerstein Mark B
Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520, USA.
Genome Res. 2007 Jun;17(6):839-51. doi: 10.1101/gr.5586307.
Arising from either retrotransposition or genomic duplication of functional genes, pseudogenes are "genomic fossils" valuable for exploring the dynamics and evolution of genes and genomes. Pseudogene identification is an important problem in computational genomics, and is also critical for obtaining an accurate picture of a genome's structure and function. However, no consensus computational scheme for defining and detecting pseudogenes has been developed thus far. As part of the ENCyclopedia Of DNA Elements (ENCODE) project, we have compared several distinct pseudogene annotation strategies and found that different approaches and parameters often resulted in rather distinct sets of pseudogenes. We subsequently developed a consensus approach for annotating pseudogenes (derived from protein coding genes) in the ENCODE regions, resulting in 201 pseudogenes, two-thirds of which originated from retrotransposition. A survey of orthologs for these pseudogenes in 28 vertebrate genomes showed that a significant fraction ( approximately 80%) of the processed pseudogenes are primate-specific sequences, highlighting the increasing retrotransposition activity in primates. Analysis of sequence conservation and variation also demonstrated that most pseudogenes evolve neutrally, and processed pseudogenes appear to have lost their coding potential immediately or soon after their emergence. In order to explore the functional implication of pseudogene prevalence, we have extensively examined the transcriptional activity of the ENCODE pseudogenes. We performed systematic series of pseudogene-specific RACE analyses. These, together with complementary evidence derived from tiling microarrays and high throughput sequencing, demonstrated that at least a fifth of the 201 pseudogenes are transcribed in one or more cell lines or tissues.
假基因源于功能基因的逆转座或基因组复制,是探索基因和基因组动态变化及进化的“基因组化石”。假基因识别是计算基因组学中的一个重要问题,对于准确了解基因组的结构和功能也至关重要。然而,目前尚未开发出用于定义和检测假基因的共识计算方案。作为DNA元件百科全书(ENCODE)项目的一部分,我们比较了几种不同的假基因注释策略,发现不同的方法和参数常常会导致截然不同的假基因集。随后,我们开发了一种用于注释ENCODE区域中(源自蛋白质编码基因的)假基因的共识方法,共得到201个假基因,其中三分之二源自逆转座。对28种脊椎动物基因组中这些假基因的直系同源基因进行的调查显示,相当一部分(约80%)加工假基因是灵长类特有的序列,这突出了灵长类动物中不断增加的逆转座活性。对序列保守性和变异性的分析还表明,大多数假基因呈中性进化,加工假基因似乎在出现后立即或很快就失去了编码潜力。为了探索假基因普遍存在的功能意义,我们广泛研究了ENCODE假基因的转录活性。我们进行了一系列系统的假基因特异性RACE分析。这些分析以及来自平铺微阵列和高通量测序的补充证据表明,在这201个假基因中,至少五分之一在一种或多种细胞系或组织中被转录。