Castellana Natalie E, Payne Samuel H, Shen Zhouxin, Stanke Mario, Bafna Vineet, Briggs Steven P
Department of Computer Science and Engineering, University of California San Diego, 9500 Gilman Drive, La Jolla, CA 92093, USA.
Proc Natl Acad Sci U S A. 2008 Dec 30;105(52):21034-8. doi: 10.1073/pnas.0811066106. Epub 2008 Dec 19.
Gene annotation underpins genome science. Most often protein coding sequence is inferred from the genome based on transcript evidence and computational predictions. While generally correct, gene models suffer from errors in reading frame, exon border definition, and exon identification. To ascertain the error rate of Arabidopsis thaliana gene models, we isolated proteins from a sample of Arabidopsis tissues and determined the amino acid sequences of 144,079 distinct peptides by tandem mass spectrometry. The peptides corresponded to 1 or more of 3 different translations of the genome: a 6-frame translation, an exon splice-graph, and the currently annotated proteome. The majority of the peptides (126,055) resided in existing gene models (12,769 confirmed proteins), comprising 40% of annotated genes. Surprisingly, 18,024 novel peptides were found that do not correspond to annotated genes. Using the gene finding program AUGUSTUS and 5,426 novel peptides that occurred in clusters, we discovered 778 new protein-coding genes and refined the annotation of an additional 695 gene models. The remaining 13,449 novel peptides provide high quality annotation (>99% correct) for thousands of additional genes. Our observation that 18,024 of 144,079 peptides did not match current gene models suggests that 13% of the Arabidopsis proteome was incomplete due to approximately equal numbers of missing and incorrect gene models.
基因注释是基因组科学的基础。大多数情况下,蛋白质编码序列是根据转录本证据和计算预测从基因组中推断出来的。虽然通常是正确的,但基因模型在阅读框、外显子边界定义和外显子识别方面存在错误。为了确定拟南芥基因模型的错误率,我们从拟南芥组织样本中分离蛋白质,并通过串联质谱法确定了144,079个不同肽段的氨基酸序列。这些肽段对应于基因组的3种不同翻译中的1种或多种:六框架翻译、外显子剪接图和当前注释的蛋白质组。大多数肽段(126,055个)存在于现有的基因模型中(12,769个已确认的蛋白质),占注释基因的40%。令人惊讶的是,发现了18,024个与注释基因不对应的新肽段。使用基因预测程序AUGUSTUS和5,426个成簇出现的新肽段,我们发现了778个新的蛋白质编码基因,并完善了另外695个基因模型的注释。其余13,449个新肽段为数千个其他基因提供了高质量注释(>99%正确)。我们观察到144,079个肽段中有18,024个与当前基因模型不匹配,这表明拟南芥蛋白质组的13%是不完整的,原因是缺失和错误的基因模型数量大致相等。