Maquedano Miguel, Cerdán-Vélez Daniel, Tress Michael L
Bioinformatics Unit, Spanish National Cancer Research Centre (CNIO), Calle Melchor Fernandez Almagro, 3, 28029 Madrid, Spain.
Database (Oxford). 2025 Jan 18;2025. doi: 10.1093/database/baaf045.
In 2018, we analysed the three main repositories for the human proteome: Ensembl/GENCODE, RefSeq, and UniProtKB. At that time the three gene sets disagreed on the coding status of one of every eight annotated coding genes, and our results suggested that as many as 4234 of these genes might not be correctly classified. Here, we have repeated the analysis with updated versions of the three reference gene sets. Superficially, little appears to have changed. The three sets annotate 21 871 coding genes, slightly fewer than previously, and still disagree on the status of 2603 annotated genes, almost one in eight. However, we show that collaborations between the three reference gene sets have led to greater consensus. Reference catalogues have agreed on the coding status of another 249 genes since the last analysis while at least 700 genes have been reclassified. We still find that there are >2000 coding genes with at least one potential non-coding feature to indicate that they may not be coding genes. This includes a large majority of the 2603 genes for which annotators do not agree on coding status. In total, we believe that as many as 3000 genes may be misclassified as coding and could be annotated as non-coding genes, pseudogenes, or cancer antigens.