Suppr超能文献

人类参考基因集中超过2500个编码基因的状态仍未确定。

More than 2,500 coding genes in the human reference gene set still have unsettled status.

作者信息

Maquedano Miguel, Cerdán-Vélez Daniel, Tress Michael L

机构信息

Bioinformatics Unit, Spanish National Cancer Research Centre (CNIO).

出版信息

bioRxiv. 2024 Dec 9:2024.12.05.626965. doi: 10.1101/2024.12.05.626965.

Abstract

In 2018 we analysed the three main repositories for the human proteome, Ensembl/GENCODE, RefSeq and UniProtKB. They disagreed on the coding status of one of every eight annotated coding genes. The analysis inspired bilateral collaborations between annotation groups. Here we have repeated our analysis with updated versions of the three reference coding gene sets. Superficially, little appears to have changed. Although there are slightly fewer genes predicted as coding overall, the three groups still disagree on the status of 2,606 annotated genes. However, a comparison without read-through genes and immunoglobulin fragments shows that the three reference sets have merged or reclassified more than 700 genes since the last analysis and that just 0.6% of Ensembl/GENCODE coding genes are not also annotated by the other two reference sets. We used eight features indicative of non-coding genes to examine the 21,873 coding genes annotated across the three reference sets. We found that more than 2,000 had one or more potential non-coding features. While some of these genes will be protein coding, we believe that most are likely to be non-coding genes or pseudogenes. Our results suggest that annotators still vastly overestimate the number of true coding genes.

摘要

2018年,我们分析了人类蛋白质组的三个主要数据库,即Ensembl/GENCODE、RefSeq和UniProtKB。它们对每八个注释的编码基因中的一个的编码状态存在分歧。该分析激发了注释小组之间的双边合作。在此,我们使用三个参考编码基因集的更新版本重复了我们的分析。表面上看,几乎没有什么变化。尽管总体上预测为编码的基因数量略有减少,但这三个小组在2606个注释基因的状态上仍然存在分歧。然而,一项不包括通读基因和免疫球蛋白片段的比较显示,自上次分析以来,这三个参考集已经合并或重新分类了700多个基因,并且Ensembl/GENCODE编码基因中只有0.6%没有被其他两个参考集注释。我们使用八个指示非编码基因的特征来检查在这三个参考集中注释的21873个编码基因。我们发现,超过2000个基因具有一个或多个潜在的非编码特征。虽然这些基因中的一些将是蛋白质编码基因,但我们认为大多数可能是非编码基因或假基因。我们的结果表明,注释者仍然大大高估了真正编码基因的数量。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/acf6/11661123/067c8118deb9/nihpp-2024.12.05.626965v1-f0001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验