通过蛋白质基因组学发现和修正拟南芥基因

Discovery and revision of Arabidopsis genes by proteogenomics.

作者信息

Castellana Natalie E, Payne Samuel H, Shen Zhouxin, Stanke Mario, Bafna Vineet, Briggs Steven P

机构信息

Department of Computer Science and Engineering, University of California San Diego, 9500 Gilman Drive, La Jolla, CA 92093, USA.

出版信息

Proc Natl Acad Sci U S A. 2008 Dec 30;105(52):21034-8. doi: 10.1073/pnas.0811066106. Epub 2008 Dec 19.

DOI:10.1073/pnas.0811066106

PMID:19098097

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2605632/

Abstract

Gene annotation underpins genome science. Most often protein coding sequence is inferred from the genome based on transcript evidence and computational predictions. While generally correct, gene models suffer from errors in reading frame, exon border definition, and exon identification. To ascertain the error rate of Arabidopsis thaliana gene models, we isolated proteins from a sample of Arabidopsis tissues and determined the amino acid sequences of 144,079 distinct peptides by tandem mass spectrometry. The peptides corresponded to 1 or more of 3 different translations of the genome: a 6-frame translation, an exon splice-graph, and the currently annotated proteome. The majority of the peptides (126,055) resided in existing gene models (12,769 confirmed proteins), comprising 40% of annotated genes. Surprisingly, 18,024 novel peptides were found that do not correspond to annotated genes. Using the gene finding program AUGUSTUS and 5,426 novel peptides that occurred in clusters, we discovered 778 new protein-coding genes and refined the annotation of an additional 695 gene models. The remaining 13,449 novel peptides provide high quality annotation (>99% correct) for thousands of additional genes. Our observation that 18,024 of 144,079 peptides did not match current gene models suggests that 13% of the Arabidopsis proteome was incomplete due to approximately equal numbers of missing and incorrect gene models.

摘要

基因注释是基因组科学的基础。大多数情况下，蛋白质编码序列是根据转录本证据和计算预测从基因组中推断出来的。虽然通常是正确的，但基因模型在阅读框、外显子边界定义和外显子识别方面存在错误。为了确定拟南芥基因模型的错误率，我们从拟南芥组织样本中分离蛋白质，并通过串联质谱法确定了144,079个不同肽段的氨基酸序列。这些肽段对应于基因组的3种不同翻译中的1种或多种：六框架翻译、外显子剪接图和当前注释的蛋白质组。大多数肽段（126,055个）存在于现有的基因模型中（12,769个已确认的蛋白质），占注释基因的40%。令人惊讶的是，发现了18,024个与注释基因不对应的新肽段。使用基因预测程序AUGUSTUS和5,426个成簇出现的新肽段，我们发现了778个新的蛋白质编码基因，并完善了另外695个基因模型的注释。其余13,449个新肽段为数千个其他基因提供了高质量注释（>99%正确）。我们观察到144,079个肽段中有18,024个与当前基因模型不匹配，这表明拟南芥蛋白质组的13%是不完整的，原因是缺失和错误的基因模型数量大致相等。

相似文献

Discovery and revision of Arabidopsis genes by proteogenomics.通过蛋白质基因组学发现和修正拟南芥基因

Proc Natl Acad Sci U S A. 2008 Dec 30;105(52):21034-8. doi: 10.1073/pnas.0811066106. Epub 2008 Dec 19.

Identification of new protein coding sequences and signal peptidase cleavage sites of Helicobacter pylori strain 26695 by proteogenomics.通过蛋白质组学鉴定幽门螺杆菌 26695 株的新蛋白编码序列和信号肽切割位点。

J Proteomics. 2013 Jun 28;86:27-42. doi: 10.1016/j.jprot.2013.04.036. Epub 2013 May 9.

[Analysis, identification and correction of some errors of model refseqs appeared in NCBI Human Gene Database by in silico cloning and experimental verification of novel human genes].[通过新型人类基因的电子克隆和实验验证对NCBI人类基因数据库中出现的模型参考序列的一些错误进行分析、鉴定和校正]

Yi Chuan Xue Bao. 2004 May;31(5):431-43.

A comparative proteomics resource: proteins of Arabidopsis thaliana.一个比较蛋白质组学资源：拟南芥的蛋白质

Genome Biol. 2003;4(8):R51. doi: 10.1186/gb-2003-4-8-r51. Epub 2003 Jul 28.

Genome-scale proteomics reveals Arabidopsis thaliana gene models and proteome dynamics.全基因组蛋白质组学揭示拟南芥基因模型和蛋白质组动态变化。

Science. 2008 May 16;320(5878):938-41. doi: 10.1126/science.1157956. Epub 2008 Apr 24.

Arabidopsis thaliana proteomics: from proteome to genome.拟南芥蛋白质组学：从蛋白质组到基因组

J Exp Bot. 2006;57(7):1485-91. doi: 10.1093/jxb/erj130. Epub 2006 Mar 21.

pep2pro: a new tool for comprehensive proteome data analysis to reveal information about organ-specific proteomes in Arabidopsis thaliana.pep2pro：一种用于全面蛋白质组数据分析的新工具，可揭示拟南芥器官特异性蛋白质组的信息。

Integr Biol (Camb). 2011 Mar;3(3):225-37. doi: 10.1039/c0ib00078g. Epub 2011 Jan 24.

N-terminal Proteomics Assisted Profiling of the Unexplored Translation Initiation Landscape in Arabidopsis thaliana.N端蛋白质组学辅助分析拟南芥中未探索的翻译起始图谱

Mol Cell Proteomics. 2017 Jun;16(6):1064-1080. doi: 10.1074/mcp.M116.066662. Epub 2017 Apr 21.

Concerted action of the new Genomic Peptide Finder and AUGUSTUS allows for automated proteogenomic annotation of the Chlamydomonas reinhardtii genome.新型基因组肽发现器与 AUGUSTUS 的协同作用可实现莱茵衣藻基因组的自动化蛋白基因组注释。

Proteomics. 2011 May;11(9):1814-23. doi: 10.1002/pmic.201000621. Epub 2011 Mar 22.

Multiple reference genomes and transcriptomes for Arabidopsis thaliana.拟南芥的多个参考基因组和转录组。

Nature. 2011 Aug 28;477(7365):419-23. doi: 10.1038/nature10414.

引用本文的文献

Experimental validation of computationally predicted phytoene synthase isoforms encoded by the Arabidopsis thaliana PSY gene.对拟南芥PSY基因编码的八氢番茄红素合酶同工型进行计算预测的实验验证。

Plant Cell Rep. 2025 Apr 1;44(4):93. doi: 10.1007/s00299-025-03482-1.

Plant genome information facilitates plant functional genomics.植物基因组信息有助于植物功能基因组学。

Planta. 2024 Apr 9;259(5):117. doi: 10.1007/s00425-024-04397-z.

The role of the AP-1 adaptor complex in outgoing and incoming membrane traffic.AP-1 衔接复合物在外向和内吞膜运输中的作用。

J Cell Biol. 2024 Jul 1;223(7). doi: 10.1083/jcb.202310071. Epub 2024 Apr 5.

Protein nonadditive expression and solubility contribute to heterosis in hybrids and allotetraploids.蛋白质非加性表达和溶解性促成杂种和异源四倍体中的杂种优势。

Front Plant Sci. 2023 Sep 14;14:1252564. doi: 10.3389/fpls.2023.1252564. eCollection 2023.

Deep Proteogenomics of a Photosynthetic Cyanobacterium.光合蓝细菌的深度蛋白基因组学研究

J Proteome Res. 2023 Jun 2;22(6):1969-1983. doi: 10.1021/acs.jproteome.3c00065. Epub 2023 May 5.

PepQuery2 democratizes public MS proteomics data for rapid peptide searching. PepQuery2 使公共 MS 蛋白质组学数据民主化，便于快速进行肽搜索。

Nat Commun. 2023 Apr 18;14(1):2213. doi: 10.1038/s41467-023-37462-4.

Nematode gene annotation by machine-learning-assisted proteotranscriptomics enables proteome-wide evolutionary analysis.基于机器学习辅助的蛋白质组转录组学进行线虫基因注释，可实现全蛋白质组范围的进化分析。

Genome Res. 2023 Jan;33(1):112-128. doi: 10.1101/gr.277070.122. Epub 2023 Jan 18.

Twisting development, the birth of a potential new gene.扭转发展，一个潜在新基因的诞生。

iScience. 2022 Nov 19;25(12):105627. doi: 10.1016/j.isci.2022.105627. eCollection 2022 Dec 22.

Middle-down approach: a choice to sequence and characterize proteins/proteomes by mass spectrometry.自上而下的方法：一种通过质谱对蛋白质/蛋白质组进行测序和表征的选择。

RSC Adv. 2019 Jan 2;9(1):313-344. doi: 10.1039/c8ra07200k. eCollection 2018 Dec 19.

Could Causal Discovery in Proteogenomics Assist in Understanding Gene-Protein Relations? A Perennial Fruit Tree Case Study Using Sweet Cherry as a Model.在蛋白质基因组学中进行因果关系推断能否有助于理解基因-蛋白质关系？以甜樱桃为模型的多年生核果类果树案例研究。

Cells. 2021 Dec 29;11(1):92. doi: 10.3390/cells11010092.

本文引用的文献

Genome-scale proteomics reveals Arabidopsis thaliana gene models and proteome dynamics.全基因组蛋白质组学揭示拟南芥基因模型和蛋白质组动态变化。

Science. 2008 May 16;320(5878):938-41. doi: 10.1126/science.1157956. Epub 2008 Apr 24.

Steady progress and recent breakthroughs in the accuracy of automated genome annotation.自动基因组注释准确性方面的稳步进展和近期突破。

Nat Rev Genet. 2008 Jan;9(1):62-73. doi: 10.1038/nrg2220.

Revisiting the protein-coding gene catalog of Drosophila melanogaster using 12 fly genomes.利用12种果蝇基因组重新审视黑腹果蝇的蛋白质编码基因目录。

Genome Res. 2007 Dec;17(12):1823-36. doi: 10.1101/gr.6679507. Epub 2007 Nov 7.

Whole proteome analysis of post-translational modifications: applications of mass-spectrometry for proteogenomic annotation.翻译后修饰的全蛋白质组分析：质谱技术在蛋白质基因组注释中的应用

Genome Res. 2007 Sep;17(9):1362-77. doi: 10.1101/gr.6427907. Epub 2007 Aug 9.

A high-quality catalog of the Drosophila melanogaster proteome.一份高质量的黑腹果蝇蛋白质组目录。

Nat Biotechnol. 2007 May;25(5):576-83. doi: 10.1038/nbt1300. Epub 2007 Apr 22.

A large number of novel coding small open reading frames in the intergenic regions of the Arabidopsis thaliana genome are transcribed and/or under purifying selection.拟南芥基因组基因间隔区中大量新的编码小开放阅读框被转录和/或处于纯化选择之下。

Genome Res. 2007 May;17(5):632-40. doi: 10.1101/gr.5836207. Epub 2007 Mar 29.

Improving gene annotation using peptide mass spectrometry.利用肽质谱法改进基因注释

Genome Res. 2007 Feb;17(2):231-9. doi: 10.1101/gr.5646507. Epub 2006 Dec 22.

Expressed peptide tags: an additional layer of data for genome annotation.表达的肽标签：基因组注释的额外数据层。

J Proteome Res. 2006 Nov;5(11):3048-58. doi: 10.1021/pr060134x.

AUGUSTUS: ab initio prediction of alternative transcripts.奥古斯塔斯：可变转录本的从头预测。

Nucleic Acids Res. 2006 Jul 1;34(Web Server issue):W435-9. doi: 10.1093/nar/gkl200.

Novel gene and gene model detection using a whole genome open reading frame analysis in proteomics.在蛋白质组学中使用全基因组开放阅读框分析进行新型基因和基因模型检测。

Genome Biol. 2006;7(4):R35. doi: 10.1186/gb-2006-7-4-r35. Epub 2006 Apr 28.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。