Rodriguez Jose Manuel, Maquedano Miguel, Cerdan-Velez Daniel, Calvo Enrique, Vazquez Jesús, Tress Michael L
Cardiovascular Proteomics Laboratory, Centro Nacional de Investigaciones Cardiovasculares Carlos III (CNIC), 28029 Madrid, Spain.
CIBER de Enfermedades Cardiovasculares (CIBERCV), 28029 Madrid, Spain.
bioRxiv. 2024 Nov 15:2024.11.14.623419. doi: 10.1101/2024.11.14.623419.
The human genome has been the subject of intense scrutiny by experimental and manual curation projects for more than two decades. Novel coding genes have been proposed from large-scale RNASeq, ribosome profiling and proteomics experiments. Here we carry out an in-depth analysis of an entire proteomics database. We analysed the proteins, peptides and spectra housed in the human build of the PeptideAtlas proteomics database to identify coding regions that are not yet annotated in the GENCODE reference gene set. We find support for hundreds of missing alternative protein isoforms and unannotated upstream translations, and evidence of cross-contamination from other species. There was reliable peptide evidence for 34 novel unannotated open reading frames (ORFs) in PeptideAtlas. We find that almost half belong to coding genes that are missing from GENCODE and other reference sets. Most of the remaining ORFs were not conserved beyond human, however, and their peptide confirmation was restricted to cancer cell lines. We show that this is strong evidence for aberrant translation, raising important questions about the extent of aberrant translation and how these ORFs should be annotated in reference genomes.
二十多年来,人类基因组一直是实验和人工整理项目深入研究的对象。通过大规模RNA测序、核糖体分析和蛋白质组学实验,人们提出了新的编码基因。在此,我们对整个蛋白质组学数据库进行了深入分析。我们分析了PeptideAtlas蛋白质组学数据库人类版本中包含的蛋白质、肽段和质谱图,以识别GENCODE参考基因集中尚未注释的编码区域。我们发现了数百种缺失的可变蛋白质异构体和未注释的上游翻译的证据,以及来自其他物种的交叉污染迹象。在PeptideAtlas中,有可靠的肽段证据支持34个新的未注释开放阅读框(ORF)。我们发现,几乎一半的开放阅读框属于GENCODE和其他参考集中缺失的编码基因。然而,其余的开放阅读框大多在人类之外并不保守,其肽段确认仅限于癌细胞系。我们表明,这是异常翻译的有力证据,引发了关于异常翻译程度以及这些开放阅读框应如何在参考基因组中注释的重要问题。