Kaur Gazaldeep, Perteghella Tamara, Carbonell-Sala Sílvia, Gonzalez-Martinez Jose, Hunt Toby, Mądry Tomasz, Jungreis Irwin, Arnan Carme, Lagarde Julien, Borsari Beatrice, Sisu Cristina, Jiang Yunzhe, Bennett Ruth, Berry Andrew, Cerdán-Vélez Daniel, Cochran Kelly, Vara Covadonga, Davidson Claire, Donaldson Sarah, Dursun Cagatay, González-López Silvia, Gopal Das Sasti, Hardy Matthew, Hollis Zoe, Kay Mike, Montañés José Carlos, Ni Pengyu, Nurtdinov Ramil, Palumbo Emilio, Pulido-Quetglas Carlos, Suner Marie-Marthe, Yu Xuezhu, Zhang Dingyao, Loveland Jane E, Albà M Mar, Diekhans Mark, Tanzer Andrea, Mudge Jonathan M, Flicek Paul, Martin Fergal J, Gerstein Mark, Kellis Manolis, Kundaje Anshul, Paten Benedict, Tress Michael L, Johnson Rory, Uszczynska-Ratajczak Barbara, Frankish Adam, Guigó Roderic
Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Dr. Aiguader 88, Barcelona 08003, Catalonia, Spain.
Departament de Ciències Experimentals i de la Salut, Universitat Pompeu Fabra (UPF).
bioRxiv. 2024 Oct 31:2024.10.29.620654. doi: 10.1101/2024.10.29.620654.
Accurate and complete gene annotations are indispensable for understanding how genome sequences encode biological functions. For twenty years, the GENCODE consortium has developed reference annotations for the human and mouse genomes, becoming a foundation for biomedical and genomics communities worldwide. Nevertheless, collections of important yet poorly-understood gene classes like long non-coding RNAs (lncRNAs) remain incomplete and scattered across multiple, uncoordinated catalogs, slowing down progress in the field. To address these issues, GENCODE has undertaken the most comprehensive lncRNAs annotation effort to date. This is founded on the manual annotation of full-length targeted long-read sequencing, on matched embryonic and adult tissues, of orthologous regions in human and mouse. Altogether 17,931 novel human genes (140,268 novel transcripts) and 22,784 novel mouse genes (136,169 novel transcripts) have been added to the GENCODE catalog representing a 2-fold and 6-fold increase in transcripts, respectively - the greatest increase since the sequencing of the human genome. Novel gene annotations display evolutionary constraints, have well-formed promoter regions, and link to phenotype-associated genetic variants. They greatly enhance the functional interpretability of the human genome, as they help explain millions of previously-mapped "orphan" omics measurements corresponding to transcription start sites, chromatin modifications and transcription factor binding sites. Crucially, our targeted design assigned human-mouse orthologs at a rate beyond previous studies, tripling the number of human disease-associated lncRNAs with mouse orthologs. The expanded and enhanced GENCODE lncRNA annotations mark a critical step towards deciphering the human and mouse genomes.
准确而完整的基因注释对于理解基因组序列如何编码生物学功能而言不可或缺。二十年来,GENCODE联盟一直在为人和小鼠基因组开发参考注释,成为全球生物医学和基因组学领域的一个基础。然而,像长链非编码RNA(lncRNA)这类重要但了解甚少的基因类别的集合仍然不完整,且分散在多个未经协调的目录中,减缓了该领域的进展。为解决这些问题,GENCODE开展了迄今为止最全面的lncRNA注释工作。这一工作基于对人和小鼠直系同源区域的全长靶向长读长测序进行人工注释,样本来自匹配的胚胎和成年组织。GENCODE目录中总共新增了17,931个新的人类基因(140,268个新转录本)和22,784个新的小鼠基因(136,169个新转录本),转录本数量分别增加了2倍和6倍——这是自人类基因组测序以来最大的增幅。新的基因注释显示出进化限制,具有结构良好的启动子区域,并与表型相关的遗传变异相关联。它们极大地增强了人类基因组的功能可解释性,因为它们有助于解释数百万个先前定位的与转录起始位点、染色质修饰和转录因子结合位点相对应的“孤儿”组学测量结果。至关重要的是,我们的靶向设计确定人和小鼠直系同源基因的速率超过了以往研究,使具有小鼠直系同源基因的人类疾病相关lncRNA数量增加了两倍。扩展和增强后的GENCODE lncRNA注释标志着在解读人和小鼠基因组方面迈出了关键一步。