Suppr超能文献

通过蛋白质基因组学发现和修正拟南芥基因

Discovery and revision of Arabidopsis genes by proteogenomics.

作者信息

Castellana Natalie E, Payne Samuel H, Shen Zhouxin, Stanke Mario, Bafna Vineet, Briggs Steven P

机构信息

Department of Computer Science and Engineering, University of California San Diego, 9500 Gilman Drive, La Jolla, CA 92093, USA.

出版信息

Proc Natl Acad Sci U S A. 2008 Dec 30;105(52):21034-8. doi: 10.1073/pnas.0811066106. Epub 2008 Dec 19.

Abstract

Gene annotation underpins genome science. Most often protein coding sequence is inferred from the genome based on transcript evidence and computational predictions. While generally correct, gene models suffer from errors in reading frame, exon border definition, and exon identification. To ascertain the error rate of Arabidopsis thaliana gene models, we isolated proteins from a sample of Arabidopsis tissues and determined the amino acid sequences of 144,079 distinct peptides by tandem mass spectrometry. The peptides corresponded to 1 or more of 3 different translations of the genome: a 6-frame translation, an exon splice-graph, and the currently annotated proteome. The majority of the peptides (126,055) resided in existing gene models (12,769 confirmed proteins), comprising 40% of annotated genes. Surprisingly, 18,024 novel peptides were found that do not correspond to annotated genes. Using the gene finding program AUGUSTUS and 5,426 novel peptides that occurred in clusters, we discovered 778 new protein-coding genes and refined the annotation of an additional 695 gene models. The remaining 13,449 novel peptides provide high quality annotation (>99% correct) for thousands of additional genes. Our observation that 18,024 of 144,079 peptides did not match current gene models suggests that 13% of the Arabidopsis proteome was incomplete due to approximately equal numbers of missing and incorrect gene models.

摘要

基因注释是基因组科学的基础。大多数情况下,蛋白质编码序列是根据转录本证据和计算预测从基因组中推断出来的。虽然通常是正确的,但基因模型在阅读框、外显子边界定义和外显子识别方面存在错误。为了确定拟南芥基因模型的错误率,我们从拟南芥组织样本中分离蛋白质,并通过串联质谱法确定了144,079个不同肽段的氨基酸序列。这些肽段对应于基因组的3种不同翻译中的1种或多种:六框架翻译、外显子剪接图和当前注释的蛋白质组。大多数肽段(126,055个)存在于现有的基因模型中(12,769个已确认的蛋白质),占注释基因的40%。令人惊讶的是,发现了18,024个与注释基因不对应的新肽段。使用基因预测程序AUGUSTUS和5,426个成簇出现的新肽段,我们发现了778个新的蛋白质编码基因,并完善了另外695个基因模型的注释。其余13,449个新肽段为数千个其他基因提供了高质量注释(>99%正确)。我们观察到144,079个肽段中有18,024个与当前基因模型不匹配,这表明拟南芥蛋白质组的13%是不完整的,原因是缺失和错误的基因模型数量大致相等。

相似文献

1
Discovery and revision of Arabidopsis genes by proteogenomics.通过蛋白质基因组学发现和修正拟南芥基因
Proc Natl Acad Sci U S A. 2008 Dec 30;105(52):21034-8. doi: 10.1073/pnas.0811066106. Epub 2008 Dec 19.
6

引用本文的文献

5
Deep Proteogenomics of a Photosynthetic Cyanobacterium.光合蓝细菌的深度蛋白基因组学研究
J Proteome Res. 2023 Jun 2;22(6):1969-1983. doi: 10.1021/acs.jproteome.3c00065. Epub 2023 May 5.
8
Twisting development, the birth of a potential new gene.扭转发展,一个潜在新基因的诞生。
iScience. 2022 Nov 19;25(12):105627. doi: 10.1016/j.isci.2022.105627. eCollection 2022 Dec 22.

本文引用的文献

5
A high-quality catalog of the Drosophila melanogaster proteome.一份高质量的黑腹果蝇蛋白质组目录。
Nat Biotechnol. 2007 May;25(5):576-83. doi: 10.1038/nbt1300. Epub 2007 Apr 22.
7
Improving gene annotation using peptide mass spectrometry.利用肽质谱法改进基因注释
Genome Res. 2007 Feb;17(2):231-9. doi: 10.1101/gr.5646507. Epub 2006 Dec 22.
9
AUGUSTUS: ab initio prediction of alternative transcripts.奥古斯塔斯:可变转录本的从头预测。
Nucleic Acids Res. 2006 Jul 1;34(Web Server issue):W435-9. doi: 10.1093/nar/gkl200.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验