Zheng Deyou, Gerstein Mark B
Department of Molecular Biophysics and Biochemistry, Yale University, Whitney Avenue, New Haven, CT 06520, USA.
Genome Biol. 2006;7 Suppl 1(Suppl 1):S13.1-10. doi: 10.1186/gb-2006-7-s1-s13. Epub 2006 Aug 7.
Pseudogenes are inheritable genetic elements showing sequence similarity to functional genes but with deleterious mutations. We describe a computational pipeline for identifying them, which in contrast to previous work explicitly uses intron-exon structure in parent genes to classify pseudogenes. We require alignments between duplicated pseudogenes and their parents to span intron-exon junctions, and this can be used to distinguish between true duplicated and processed pseudogenes (with insertions).
Applying our approach to the ENCODE regions, we identify about 160 pseudogenes, 10% of which have clear 'intron-exon' structure and are thus likely generated from recent duplications.
Detailed examination of our results and comparison of our annotation with the GENCODE reference annotation demonstrate that our computation pipeline provides a good balance between identifying all pseudogenes and delineating the precise structure of duplicated genes.
假基因是可遗传的遗传元件,与功能基因具有序列相似性,但存在有害突变。我们描述了一种用于识别假基因的计算流程,与之前的工作不同,该流程明确使用亲本基因中的内含子-外显子结构对假基因进行分类。我们要求重复的假基因与其亲本之间的比对跨越内含子-外显子连接点,这可用于区分真正的重复假基因和加工假基因(有插入)。
将我们的方法应用于ENCODE区域,我们识别出约160个假基因,其中10%具有清晰的“内含子-外显子”结构,因此可能是近期复制产生的。
对我们的结果进行详细检查,并将我们的注释与GENCODE参考注释进行比较,结果表明我们的计算流程在识别所有假基因和描绘重复基因的精确结构之间取得了良好的平衡。