Denoeud France, Kapranov Philipp, Ucla Catherine, Frankish Adam, Castelo Robert, Drenkow Jorg, Lagarde Julien, Alioto Tyler, Manzano Caroline, Chrast Jacqueline, Dike Sujit, Wyss Carine, Henrichsen Charlotte N, Holroyd Nancy, Dickson Mark C, Taylor Ruth, Hance Zahra, Foissac Sylvain, Myers Richard M, Rogers Jane, Hubbard Tim, Harrow Jennifer, Guigó Roderic, Gingeras Thomas R, Antonarakis Stylianos E, Reymond Alexandre
Grup de Recerca en Informática Biomèdica, Institut Municipal d'Investigació Mèdica/Universitat Pompeu Fabra, 08003 Barcelona, Catalonia, Spain.
Genome Res. 2007 Jun;17(6):746-59. doi: 10.1101/gr.5660607.
This report presents systematic empirical annotation of transcript products from 399 annotated protein-coding loci across the 1% of the human genome targeted by the Encyclopedia of DNA elements (ENCODE) pilot project using a combination of 5' rapid amplification of cDNA ends (RACE) and high-density resolution tiling arrays. We identified previously unannotated and often tissue- or cell-line-specific transcribed fragments (RACEfrags), both 5' distal to the annotated 5' terminus and internal to the annotated gene bounds for the vast majority (81.5%) of the tested genes. Half of the distal RACEfrags span large segments of genomic sequences away from the main portion of the coding transcript and often overlap with the upstream-annotated gene(s). Notably, at least 20% of the resultant novel transcripts have changes in their open reading frames (ORFs), most of them fusing ORFs of adjacent transcripts. A significant fraction of distal RACEfrags show expression levels comparable to those of known exons of the same locus, suggesting that they are not part of very minority splice forms. These results have significant implications concerning (1) our current understanding of the architecture of protein-coding genes; (2) our views on locations of regulatory regions in the genome; and (3) the interpretation of sequence polymorphisms mapping to regions hitherto considered to be "noncoding," ultimately relating to the identification of disease-related sequence alterations.
本报告展示了对DNA元件百科全书(ENCODE)试点项目所针对的人类基因组1%区域内399个注释蛋白编码基因座的转录产物进行的系统实证注释,采用了5' cDNA末端快速扩增(RACE)和高密度分辨率平铺阵列相结合的方法。我们鉴定出了先前未注释的、通常具有组织或细胞系特异性的转录片段(RACE片段),这些片段位于注释的5'末端的5'远端以及绝大多数(81.5%)测试基因的注释基因边界内。一半的远端RACE片段跨越了远离编码转录本主要部分的大片段基因组序列,并且常常与上游注释的基因重叠。值得注意的是,至少20%的新转录本在其开放阅读框(ORF)中有变化,其中大多数融合了相邻转录本的ORF。相当一部分远端RACE片段的表达水平与同一基因座已知外显子的表达水平相当,这表明它们并非极少数剪接形式的一部分。这些结果对于(1)我们目前对蛋白编码基因结构的理解;(2)我们对基因组中调控区域位置的看法;以及(3)映射到迄今被认为是“非编码”区域的序列多态性的解释具有重要意义,最终与疾病相关序列改变的鉴定有关。