Southan Christopher
IUPHAR/BPS Guide to Pharmacology, Centre for Integrative Physiology, University of Edinburgh, Edinburgh, EH8 9XD, UK.
F1000Res. 2017 Apr 7;6:448. doi: 10.12688/f1000research.11119.1. eCollection 2017.
In 2004, when the protein estimate from the finished human genome was only 24,000, the surprise was compounded as reviewed estimates fell to 19,000 by 2014. However, variability in the total canonical protein counts (i.e. excluding alternative splice forms) of open reading frames (ORFs) in different annotation portals persists. This work assesses these differences and possible causes. A 16-year analysis of Ensembl and UniProtKB/Swiss-Prot shows convergence to a protein number of ~20,000. The former had shown some yo-yoing, but both have now plateaued. Nine major annotation portals, reviewed at the beginning of 2017, gave a spread of counts from 21,819 down to 18,891. The 4-way cross-reference concordance (within UniProt) between Ensembl, Swiss-Prot, Entrez Gene and the Human Gene Nomenclature Committee (HGNC) drops to 18,690, indicating methodological differences in protein definitions and experimental existence support between sources. The Swiss-Prot and neXtProt evidence criteria include mass spectrometry peptide verification and also cross-references for antibody detection from the Human Protein Atlas. Notwithstanding, hundreds of Swiss-Prot entries are classified as non-coding biotypes by HGNC. The only inference that protein numbers might still rise comes from numerous reports of small ORF (smORF) discovery. However, while there have been recent cases of protein verifications from previous miss-annotation of non-coding RNA, very few have passed the Swiss-Prot curation and genome annotation thresholds. The post-genomic era has seen both advances in data generation and improvements in the human reference assembly. Notwithstanding, current numbers, while persistently discordant, show that the earlier yo-yoing has largely ceased. Given the importance to biology and biomedicine of defining the canonical human proteome, the task will need more collaborative inter-source curation combined with broader and deeper experimental confirmation and of proteins predicted . The eventual closure could be well be below ~19,000.
2004年,当根据已完成的人类基因组估算出的蛋白质数量仅为24000种时,令人惊讶的是,到2014年,经审查后的估算值降至19000种。然而,不同注释平台中开放阅读框(ORF)的总标准蛋白质计数(即不包括可变剪接形式)仍存在差异。这项工作评估了这些差异及可能的原因。对Ensembl和UniProtKB/Swiss-Prot进行的为期16年的分析表明,两者趋向于一个约20000种蛋白质的数量。前者曾有过一些波动,但现在两者都趋于平稳。2017年初对九个主要注释平台进行审查时,得出的计数范围从21819种到18891种不等。Ensembl、Swiss-Prot、Entrez Gene和人类基因命名委员会(HGNC)之间的四路交叉引用一致性(在UniProt内部)降至18690种,这表明不同来源在蛋白质定义和实验存在支持方面存在方法学差异。Swiss-Prot和neXtProt的证据标准包括质谱肽段验证以及来自人类蛋白质图谱的抗体检测交叉引用。尽管如此,HGNC仍将数百个Swiss-Prot条目归类为非编码生物型。蛋白质数量可能仍会增加的唯一推断来自众多关于小开放阅读框(smORF)发现的报告。然而,虽然最近有一些案例表明之前对非编码RNA的错误注释已被确认为蛋白质,但很少有能通过Swiss-Prot的审核和基因组注释阈值的。后基因组时代在数据生成方面取得了进展,人类参考基因组组装也有所改进。尽管如此,目前的数量虽然仍不一致,但表明早期的波动已基本停止。鉴于定义标准人类蛋白质组对生物学和生物医学的重要性,这项任务将需要更多跨来源的协作管理,同时结合更广泛、更深入的实验确认以及对预测蛋白质的确认。最终确定的数量很可能会低于约19000种。