Department of Ecology and Evolution, The University of Chicago, 1101 E 57th Street, Chicago, IL 60637, USA.
Bioinformatics. 2011 Jul 1;27(13):1749-53. doi: 10.1093/bioinformatics/btr280. Epub 2011 May 5.
Retrocopies are important genes in the genomes of almost all higher eukaryotes. However, the annotation of such genes is a non-trivial task. Intronless genes have often been considered to be retroposed copies of intron-containing paralogs. Such categorization relies on the implicit premise that alignable regions of the duplicates should be long enough to cover exon-exon junctions of the intron-containing genes, and thus intron loss events can be inferred. Here, we examined the alternative possibility that intronless genes could be generated by partial DNA-based duplication of intron-containing genes in the fruitfly genome.
By building pairwise protein-, transcript- and genome-level DNA alignments between intronless genes and their corresponding intron-containing paralogs, we found that alignments do not cover exon-exon junctions in 40% of cases and thus no intron loss could be inferred. For these cases, the candidate parental proteins tend to be partially duplicated, and intergenic sequences or neighboring genes are included in the intronless paralog. Moreover, we observed that it is significantly less likely for these paralogs to show inter-chromosomal duplication and testis-dominant transcription, compared to the remaining 60% of cases with evidence of clear intron loss (retrogenes). These lines of analysis reveal that DNA-based duplication contributes significantly to the 40% of cases of single exon gene duplication. Finally, we performed an analogous survey in the human genome and the result is similar, wherein 34% of the cases do not cover exon-exon junctions. Thus, genome annotation for retrogene identification should discard candidates without clear evidence of intron loss.
逆转录副本是几乎所有高等真核生物基因组中的重要基因。然而,此类基因的注释是一项复杂的任务。无内含子基因通常被认为是具有内含子的同源基因的逆转录副本。这种分类依赖于一个隐含的前提,即重复序列的可比对区域应该足够长,以覆盖含有内含子基因的外显子-内含子交界处,从而可以推断内含子丢失事件。在这里,我们研究了另一种可能性,即无内含子基因可能是通过果蝇基因组中具有内含子的基因的部分基于 DNA 的重复而产生的。
通过在无内含子基因与其相应的具有内含子的同源基因之间构建两两蛋白质、转录本和基因组水平的 DNA 比对,我们发现,在 40%的情况下,比对并未覆盖外显子-内含子交界处,因此无法推断内含子丢失。对于这些情况,候选亲本蛋白往往是部分重复的,并且内含子基因的内含子基因或相邻基因包含在内含子基因中。此外,我们观察到,与具有明确内含子丢失证据(返基因)的剩余 60%的情况相比,这些同源基因发生染色体间重复和睾丸显性转录的可能性显著降低。这些分析表明,基于 DNA 的重复对 40%的单外显子基因重复有重要贡献。最后,我们在人类基因组中进行了类似的调查,结果相似,其中 34%的情况未覆盖外显子-内含子交界处。因此,反转录基因识别的基因组注释应排除没有明确内含子丢失证据的候选基因。