Wilming Laurens G, Boychenko Veronika, Harrow Jennifer L
HAVANA Group, Informatics Department, Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SA, UK
HAVANA Group, Informatics Department, Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SA, UK.
Database (Oxford). 2015 Sep 27;2015. doi: 10.1093/database/bav091. Print 2015.
Homeobox genes are a group of genes coding for transcription factors with a DNA-binding helix-turn-helix structure called a homeodomain and which play a crucial role in pattern formation during embryogenesis. Many homeobox genes are located in clusters and some of these, most notably the HOX genes, are known to have antisense or opposite strand long non-coding RNA (lncRNA) genes that play a regulatory role. Because automated annotation of both gene clusters and non-coding genes is fraught with difficulty (over-prediction, under-prediction, inaccurate transcript structures), we set out to manually annotate all homeobox genes in the mouse and human genomes. This includes all supported splice variants, pseudogenes and both antisense and flanking lncRNAs. One of the areas where manual annotation has a significant advantage is the annotation of duplicated gene clusters. After comprehensive annotation of all homeobox genes and their antisense genes in human and in mouse, we found some discrepancies with the current gene set in RefSeq regarding exact gene structures and coding versus pseudogene locus biotype. We also identified previously un-annotated pseudogenes in the DUX, Rhox and Obox gene clusters, which helped us re-evaluate and update the gene nomenclature in these regions. We found that human homeobox genes are enriched in antisense lncRNA loci, some of which are known to play a role in gene or gene cluster regulation, compared to their mouse orthologues. Of the annotated set of 241 human protein-coding homeobox genes, 98 have an antisense locus (41%) while of the 277 orthologous mouse genes, only 62 protein coding gene have an antisense locus (22%), based on publicly available transcriptional evidence.
同源框基因是一组编码转录因子的基因,这些转录因子具有一种名为同源结构域的DNA结合螺旋-转角-螺旋结构,并且在胚胎发育过程中的模式形成中发挥关键作用。许多同源框基因位于基因簇中,其中一些,最著名的是HOX基因,已知具有反义或相反链长非编码RNA(lncRNA)基因,这些基因发挥调节作用。由于基因簇和非编码基因的自动注释充满困难(过度预测、预测不足、转录本结构不准确),我们着手手动注释小鼠和人类基因组中的所有同源框基因。这包括所有支持的剪接变体、假基因以及反义lncRNA和侧翼lncRNA。手动注释具有显著优势的领域之一是重复基因簇的注释。在对人类和小鼠中的所有同源框基因及其反义基因进行全面注释后,我们发现与RefSeq中当前的基因集在精确基因结构以及编码基因与假基因位点生物型方面存在一些差异。我们还在DUX、Rhox和Obox基因簇中鉴定出先前未注释的假基因,这有助于我们重新评估和更新这些区域的基因命名法。我们发现,与小鼠直系同源基因相比,人类同源框基因在反义lncRNA位点中富集,其中一些已知在基因或基因簇调控中发挥作用。根据公开可得的转录证据,在注释的241个人类蛋白质编码同源框基因集中,有98个具有反义位点(41%),而在277个直系同源小鼠基因中,只有62个蛋白质编码基因具有反义位点(22%)。