• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

解决基于核苷酸的蛋白质数据库中的统计偏差,以用于蛋白质基因组搜索策略。

Addressing statistical biases in nucleotide-derived protein databases for proteogenomic search strategies.

机构信息

Faculty of Life Sciences, The University of Manchester, Manchester M13 9PT, UK.

出版信息

J Proteome Res. 2012 Nov 2;11(11):5221-34. doi: 10.1021/pr300411q. Epub 2012 Oct 15.

DOI:10.1021/pr300411q
PMID:23025403
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3703792/
Abstract

Proteogenomics has the potential to advance genome annotation through high quality peptide identifications derived from mass spectrometry experiments, which demonstrate a given gene or isoform is expressed and translated at the protein level. This can advance our understanding of genome function, discovering novel genes and gene structure that have not yet been identified or validated. Because of the high-throughput shotgun nature of most proteomics experiments, it is essential to carefully control for false positives and prevent any potential misannotation. A number of statistical procedures to deal with this are in wide use in proteomics, calculating false discovery rate (FDR) and posterior error probability (PEP) values for groups and individual peptide spectrum matches (PSMs). These methods control for multiple testing and exploit decoy databases to estimate statistical significance. Here, we show that database choice has a major effect on these confidence estimates leading to significant differences in the number of PSMs reported. We note that standard target:decoy approaches using six-frame translations of nucleotide sequences, such as assembled transcriptome data, apparently underestimate the confidence assigned to the PSMs. The source of this error stems from the inflated and unusual nature of the six-frame database, where for every target sequence there exists five "incorrect" targets that are unlikely to code for protein. The attendant FDR and PEP estimates lead to fewer accepted PSMs at fixed thresholds, and we show that this effect is a product of the database and statistical modeling and not the search engine. A variety of approaches to limit database size and remove noncoding target sequences are examined and discussed in terms of the altered statistical estimates generated and PSMs reported. These results are of importance to groups carrying out proteogenomics, aiming to maximize the validation and discovery of gene structure in sequenced genomes, while still controlling for false positives.

摘要

蛋白质基因组学有可能通过从质谱实验中得出的高质量肽鉴定来推进基因组注释,这些实验证明了特定的基因或同工型在蛋白质水平上表达和翻译。这可以增进我们对基因组功能的理解,发现尚未被识别或验证的新基因和基因结构。由于大多数蛋白质组学实验具有高通量的鸟枪法性质,因此必须仔细控制假阳性并防止任何潜在的错误注释。目前在蛋白质组学中广泛使用了许多统计程序来处理这些问题,这些程序为组和单个肽谱匹配(PSM)计算错误发现率(FDR)和后验误差概率(PEP)值。这些方法控制了多重测试并利用诱饵数据库来估计统计显著性。在这里,我们表明数据库的选择对这些置信度估计有重大影响,导致报告的 PSM 数量有显著差异。我们注意到,使用核苷酸序列的六框架翻译(如组装的转录组数据)的标准靶标:诱饵方法显然低估了分配给 PSM 的置信度。这种错误的根源在于六框架数据库的膨胀和异常性质,其中对于每个目标序列,存在五个不太可能编码蛋白质的“错误”目标。随之而来的 FDR 和 PEP 估计会导致在固定阈值下接受的 PSM 更少,我们表明这种效果是数据库和统计建模的产物,而不是搜索引擎的产物。我们研究并讨论了各种限制数据库大小和去除非编码目标序列的方法,这些方法考虑了生成的更改后的统计估计和报告的 PSM。对于从事蛋白质基因组学的小组来说,这些结果非常重要,目的是在控制假阳性的同时,最大化对测序基因组中基因结构的验证和发现。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/589e/3703792/ac8c37a967d5/pr-2012-00411q_0009.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/589e/3703792/1e8b1a671be1/pr-2012-00411q_0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/589e/3703792/29f4934219f6/pr-2012-00411q_0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/589e/3703792/0d7ef42fed63/pr-2012-00411q_0005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/589e/3703792/c0667e21a5c3/pr-2012-00411q_0006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/589e/3703792/b0a6cff68c62/pr-2012-00411q_0007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/589e/3703792/a633dead6104/pr-2012-00411q_0008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/589e/3703792/ac8c37a967d5/pr-2012-00411q_0009.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/589e/3703792/1e8b1a671be1/pr-2012-00411q_0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/589e/3703792/29f4934219f6/pr-2012-00411q_0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/589e/3703792/0d7ef42fed63/pr-2012-00411q_0005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/589e/3703792/c0667e21a5c3/pr-2012-00411q_0006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/589e/3703792/b0a6cff68c62/pr-2012-00411q_0007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/589e/3703792/a633dead6104/pr-2012-00411q_0008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/589e/3703792/ac8c37a967d5/pr-2012-00411q_0009.jpg

相似文献

1
Addressing statistical biases in nucleotide-derived protein databases for proteogenomic search strategies.解决基于核苷酸的蛋白质数据库中的统计偏差,以用于蛋白质基因组搜索策略。
J Proteome Res. 2012 Nov 2;11(11):5221-34. doi: 10.1021/pr300411q. Epub 2012 Oct 15.
2
False discovery rate: the Achilles' heel of proteogenomics.错误发现率:蛋白质基因组学的致命弱点。
Brief Bioinform. 2022 Sep 20;23(5). doi: 10.1093/bib/bbac163.
3
Transfer posterior error probability estimation for peptide identification.肽鉴定中转后误差概率估计的转移。
BMC Bioinformatics. 2020 May 4;21(1):173. doi: 10.1186/s12859-020-3485-y.
4
Decoy methods for assessing false positives and false discovery rates in shotgun proteomics.用于评估鸟枪法蛋白质组学中假阳性和错误发现率的诱饵方法。
Anal Chem. 2009 Jan 1;81(1):146-59. doi: 10.1021/ac801664q.
5
Deep coverage of the Escherichia coli proteome enables the assessment of false discovery rates in simple proteogenomic experiments.深度覆盖大肠杆菌蛋白质组可用于评估简单蛋白质基因组实验中的假发现率。
Mol Cell Proteomics. 2013 Nov;12(11):3420-30. doi: 10.1074/mcp.M113.029165. Epub 2013 Aug 1.
6
Common Decoy Distributions Simplify False Discovery Rate Estimation in Shotgun Proteomics.通用诱饵分布简化了鸟枪法蛋白质组学中的错误发现率估计
J Proteome Res. 2022 Feb 4;21(2):339-348. doi: 10.1021/acs.jproteome.1c00600. Epub 2022 Jan 6.
7
Challenges in Peptide-Spectrum Matching: A Robust and Reproducible Statistical Framework for Removing Low-Accuracy, High-Scoring Hits.肽段谱匹配中的挑战:一种稳健且可重复的统计框架,用于去除低准确性、高得分的假阳性匹配。
J Proteome Res. 2020 Jan 3;19(1):161-173. doi: 10.1021/acs.jproteome.9b00478. Epub 2019 Dec 20.
8
Plant proteogenomics: from protein extraction to improved gene predictions.植物蛋白质基因组学:从蛋白质提取到改进的基因预测
Methods Mol Biol. 2013;1002:267-94. doi: 10.1007/978-1-62703-360-2_21.
9
Modeling Lower-Order Statistics to Enable Decoy-Free FDR Estimation in Proteomics.对低阶统计量进行建模以实现蛋白质组学中无诱饵的错误发现率估计。
J Proteome Res. 2023 Apr 7;22(4):1159-1171. doi: 10.1021/acs.jproteome.2c00604. Epub 2023 Mar 24.
10
False Discovery Rate Estimation in Proteomics.蛋白质组学中的错误发现率估计
Methods Mol Biol. 2016;1362:119-28. doi: 10.1007/978-1-4939-3106-4_7.

引用本文的文献

1
Moving Toward Metaproteogenomics: A Computational Perspective on Analyzing Microbial Samples via Proteogenomics.迈向宏蛋白质组学:通过蛋白质组学分析微生物样本的计算视角。
Methods Mol Biol. 2025;2859:297-318. doi: 10.1007/978-1-0716-4152-1_17.
2
Exploring the Alternative Proteome with OpenProt and Mass Spectrometry.探索开放蛋白质组学和质谱技术中的替代蛋白质组。
Methods Mol Biol. 2024;2836:3-17. doi: 10.1007/978-1-0716-4007-4_1.
3
Proteogenomic Gene Structure Validation in the Pineapple Genome.菠萝基因组中的蛋白质基因组基因结构验证

本文引用的文献

1
An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database.一种将肽的串联质谱数据与蛋白质数据库中氨基酸序列相关联的方法。
J Am Soc Mass Spectrom. 1994 Nov;5(11):976-89. doi: 10.1016/1044-0305(94)80016-2.
2
Venomics profiling of Thamnodynastes strigatus unveils matrix metalloproteinases and other novel proteins recruited to the toxin arsenal of rear-fanged snakes.横纹斜鳞蛇 venom 组学分析揭示了基质金属蛋白酶和其他新型蛋白被招募到后毒牙蛇的毒器库中。
J Proteome Res. 2012 Feb 3;11(2):1152-62. doi: 10.1021/pr200876c. Epub 2012 Jan 20.
3
Proteogenomic analysis of Candida glabrata using high resolution mass spectrometry.
J Proteome Res. 2024 May 3;23(5):1583-1592. doi: 10.1021/acs.jproteome.3c00675. Epub 2024 Apr 23.
4
Unraveling the small proteome of the plant symbiont by ribosome profiling and proteogenomics.通过核糖体谱分析和蛋白质基因组学解析植物共生体的小蛋白质组
Microlife. 2023 Mar 10;4:uqad012. doi: 10.1093/femsml/uqad012. eCollection 2023.
5
OpenCustomDB: Integration of Unannotated Open Reading Frames and Genetic Variants to Generate More Comprehensive Customized Protein Databases.OpenCustomDB:整合未注释的开放阅读框和遗传变异以生成更全面的定制蛋白质数据库。
J Proteome Res. 2023 May 5;22(5):1492-1500. doi: 10.1021/acs.jproteome.3c00054. Epub 2023 Mar 24.
6
Short open reading frame genes in innate immunity: from discovery to characterization.先天免疫中的短开放阅读框基因:从发现到表征。
Trends Immunol. 2022 Sep;43(9):741-756. doi: 10.1016/j.it.2022.07.005. Epub 2022 Aug 11.
7
An analysis of proteogenomics and how and when transcriptome-informed reduction of protein databases can enhance eukaryotic proteomics.蛋白质基因组学分析,以及转录组信息如何以及何时减少蛋白质数据库可增强真核蛋白质组学。
Genome Biol. 2022 Jun 20;23(1):132. doi: 10.1186/s13059-022-02701-2.
8
IntroSpect: Motif-Guided Immunopeptidome Database Building Tool to Improve the Sensitivity of HLA I Binding Peptide Identification by Mass Spectrometry.Introspect:基于基序的免疫肽组数据库构建工具,通过质谱提高 HLA I 结合肽鉴定的灵敏度。
Biomolecules. 2022 Apr 14;12(4):579. doi: 10.3390/biom12040579.
9
A Practical Guide to Small Protein Discovery and Characterization Using Mass Spectrometry.使用质谱技术进行小蛋白发现和鉴定的实用指南。
J Bacteriol. 2022 Jan 18;204(1):e0035321. doi: 10.1128/JB.00353-21. Epub 2021 Nov 8.
10
Personalized Proteome: Comparing Proteogenomics and Open Variant Search Approaches for Single Amino Acid Variant Detection.个性化蛋白质组学:比较蛋白质基因组学和开放变异搜索方法在单个氨基酸变异检测中的应用。
J Proteome Res. 2021 Jun 4;20(6):3353-3364. doi: 10.1021/acs.jproteome.1c00264. Epub 2021 May 17.
使用高分辨率质谱技术对光滑念珠菌进行蛋白质基因组分析。
J Proteome Res. 2012 Jan 1;11(1):247-60. doi: 10.1021/pr200827k. Epub 2011 Dec 13.
4
Protein identification using customized protein sequence databases derived from RNA-Seq data.利用从 RNA-Seq 数据中衍生的定制蛋白质序列数据库进行蛋白质鉴定。
J Proteome Res. 2012 Feb 3;11(2):1009-17. doi: 10.1021/pr200766z. Epub 2011 Dec 14.
5
System-wide perturbation analysis with nearly complete coverage of the yeast proteome by single-shot ultra HPLC runs on a bench top Orbitrap.采用台式 Orbitrap 进行单次超高效液相色谱运行,对酵母蛋白质组进行近乎完全覆盖的全系统扰动分析。
Mol Cell Proteomics. 2012 Mar;11(3):M111.013722. doi: 10.1074/mcp.M111.013722. Epub 2011 Oct 20.
6
Target-decoy approach and false discovery rate: when things may go wrong.靶向诱饵方法和错误发现率:当事情可能出错时。
J Am Soc Mass Spectrom. 2011 Jul;22(7):1111-20. doi: 10.1007/s13361-011-0139-3. Epub 2011 May 5.
7
iProphet: multi-level integrative analysis of shotgun proteomic data improves peptide and protein identification rates and error estimates.iProphet:高通量蛋白质组学数据的多层次综合分析可提高肽段和蛋白质的鉴定率和错误评估。
Mol Cell Proteomics. 2011 Dec;10(12):M111.007690. doi: 10.1074/mcp.M111.007690. Epub 2011 Aug 29.
8
A proteogenomic analysis of Anopheles gambiae using high-resolution Fourier transform mass spectrometry.使用高分辨率傅里叶变换质谱技术对冈比亚按蚊进行蛋白质基因组分析。
Genome Res. 2011 Nov;21(11):1872-81. doi: 10.1101/gr.127951.111. Epub 2011 Jul 27.
9
De novo assembly and validation of planaria transcriptome by massive parallel sequencing and shotgun proteomics.大规模平行测序和鸟枪法蛋白质组学从头组装和验证扁形动物转录组。
Genome Res. 2011 Jul;21(7):1193-200. doi: 10.1101/gr.113779.110. Epub 2011 May 2.
10
MSblender: A probabilistic approach for integrating peptide identifications from multiple database search engines.MSblender:一种整合来自多个数据库搜索引擎的肽鉴定的概率方法。
J Proteome Res. 2011 Jul 1;10(7):2949-58. doi: 10.1021/pr2002116. Epub 2011 Apr 29.