Sharma Virag, Elghafari Anas, Hiller Michael
Max Planck Institute of Molecular Cell Biology and Genetics, Pfotenhauerstr. 108, 01307 Dresden, Germany Max Planck Institute for the Physics of Complex Systems, Nöthnitzer Str. 38, 01187 Dresden, Germany.
Max Planck Institute of Molecular Cell Biology and Genetics, Pfotenhauerstr. 108, 01307 Dresden, Germany Max Planck Institute for the Physics of Complex Systems, Nöthnitzer Str. 38, 01187 Dresden, Germany Technical University, 01069 Dresden, Germany.
Nucleic Acids Res. 2016 Jun 20;44(11):e103. doi: 10.1093/nar/gkw210. Epub 2016 Mar 25.
Identifying coding genes is an essential step in genome annotation. Here, we utilize existing whole genome alignments to detect conserved coding exons and then map gene annotations from one genome to many aligned genomes. We show that genome alignments contain thousands of spurious frameshifts and splice site mutations in exons that are truly conserved. To overcome these limitations, we have developed CESAR (Coding Exon-Structure Aware Realigner) that realigns coding exons, while considering reading frame and splice sites of each exon. CESAR effectively avoids spurious frameshifts in conserved genes and detects 91% of shifted splice sites. This results in the identification of thousands of additional conserved exons and 99% of the exons that lack inactivating mutations match real exons. Finally, to demonstrate the potential of using CESAR for comparative gene annotation, we applied it to 188 788 exons of 19 865 human genes to annotate human genes in 99 other vertebrates. These comparative gene annotations are available as a resource (http://bds.mpi-cbg.de/hillerlab/CESAR/). CESAR (https://github.com/hillerlab/CESAR/) can readily be applied to other alignments to accurately annotate coding genes in many other vertebrate and invertebrate genomes.
识别编码基因是基因组注释中的关键步骤。在此,我们利用现有的全基因组比对来检测保守的编码外显子,然后将一个基因组的基因注释映射到多个比对的基因组上。我们发现,在真正保守的外显子中,基因组比对包含数千个虚假的移码突变和剪接位点突变。为克服这些限制,我们开发了CESAR(编码外显子结构感知重排器),它在考虑每个外显子的阅读框和剪接位点的同时,对编码外显子进行重排。CESAR有效避免了保守基因中的虚假移码突变,并检测到91%的移位剪接位点。这使得我们能够识别出数千个额外的保守外显子,并且99%没有失活突变的外显子与真实外显子匹配。最后,为证明使用CESAR进行比较基因注释的潜力,我们将其应用于19865个人类基因的188788个外显子,以注释99种其他脊椎动物的人类基因。这些比较基因注释可作为一种资源获取(http://bds.mpi-cbg.de/hillerlab/CESAR/)。CESAR(https://github.com/hillerlab/CESAR/)可轻松应用于其他比对,以准确注释许多其他脊椎动物和无脊椎动物基因组中的编码基因。