使用模拟数据库对基于质谱的蛋白质组学算法进行基准测试。

Benchmarking mass spectrometry based proteomics algorithms using a simulated database.

作者信息

Awan Muaaz Gul, Awan Abdullah Gul, Saeed Fahad

机构信息

Lawrence Berkeley National Laboratory, Berkeley, CA, USA.

Al-Khwarizmi Institute of Computer Science (KICS), University of Engineering & Technology (UET), Lahore, Pakistan.

出版信息

Netw Model Anal Health Inform Bioinform. 2021;10. doi: 10.1007/s13721-021-00298-3. Epub 2021 Mar 26.

DOI:10.1007/s13721-021-00298-3

PMID:34012763

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8130870/

Abstract

Protein sequencing algorithms process data from a variety of instruments that has been generated under diverse experimental conditions. Currently there is no way to predict the accuracy of an algorithm for a given data set. Most of the published algorithms and associated software has been evaluated on limited number of experimental data sets. However, these performance evaluations do not cover the complete search space the algorithmand the software might encounter in real-world. To this end, we present a database of simulated spectra that can be used to benchmark any spectra to peptide search engine. We demonstrate the usability of this database by bench marking two popular peptide sequencing engines. We show wide variation in the accuracy of peptide deductions and a complete quality profile of a given algorithm can be useful for practitioners and algorithm developers. All benchmarking data is available at https://users.cs.fiu.edu/~fsaeed/Benchmark.html.

摘要

蛋白质测序算法处理来自各种仪器的数据，这些数据是在不同实验条件下生成的。目前，对于给定的数据集，没有办法预测算法的准确性。大多数已发表的算法和相关软件仅在有限数量的实验数据集上进行了评估。然而，这些性能评估并未涵盖算法和软件在现实世界中可能遇到的完整搜索空间。为此，我们提出了一个模拟光谱数据库，可用于对任何光谱与肽搜索引擎进行基准测试。我们通过对两个流行的肽测序引擎进行基准测试来证明该数据库的可用性。我们展示了肽推导准确性的广泛差异，并且给定算法的完整质量概况对从业者和算法开发者可能有用。所有基准测试数据可在https://users.cs.fiu.edu/~fsaeed/Benchmark.html获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b8fc/8130870/840ba6dd6bc1/nihms-1698506-f0001.jpg

相似文献

Benchmarking mass spectrometry based proteomics algorithms using a simulated database.使用模拟数据库对基于质谱的蛋白质组学算法进行基准测试。

Netw Model Anal Health Inform Bioinform. 2021;10. doi: 10.1007/s13721-021-00298-3. Epub 2021 Mar 26.

In-depth analysis of protein inference algorithms using multiple search engines and well-defined metrics.使用多个搜索引擎和明确的指标对蛋白质推断算法进行深入分析。

J Proteomics. 2017 Jan 6;150:170-182. doi: 10.1016/j.jprot.2016.08.002. Epub 2016 Aug 4.

Evaluating de novo sequencing in proteomics: already an accurate alternative to database-driven peptide identification?评估蛋白质组学中的从头测序：是否已经成为数据库驱动肽鉴定的准确替代方法？

Brief Bioinform. 2018 Sep 28;19(5):954-970. doi: 10.1093/bib/bbx033.

Application of de Novo Sequencing to Large-Scale Complex Proteomics Data Sets.从头测序在大规模复杂蛋白质组学数据集上的应用。

J Proteome Res. 2016 Mar 4;15(3):732-42. doi: 10.1021/acs.jproteome.5b00861. Epub 2016 Jan 25.

Optimization of Search Engines and Postprocessing Approaches to Maximize Peptide and Protein Identification for High-Resolution Mass Data.优化搜索引擎和后处理方法以最大化高分辨率质谱数据的肽段和蛋白质鉴定

J Proteome Res. 2015 Nov 6;14(11):4662-73. doi: 10.1021/acs.jproteome.5b00536. Epub 2015 Sep 30.

Data-Dependent Scoring Parameter Optimization in MS-GF+ Using Spectrum Quality Filter.基于谱质量过滤的 MS-GF+ 数据相关评分参数优化

J Proteome Res. 2018 Oct 5;17(10):3593-3598. doi: 10.1021/acs.jproteome.8b00415. Epub 2018 Jul 26.

Algorithms for database-dependent search of MS/MS data.用于基于数据库搜索MS/MS数据的算法。

Methods Mol Biol. 2013;1007:119-38. doi: 10.1007/978-1-62703-392-3_5.

Benchmark on Indexing Algorithms for Accelerating Molecular Similarity Search.用于加速分子相似性搜索的索引算法基准测试。

J Chem Inf Model. 2020 Dec 28;60(12):6167-6184. doi: 10.1021/acs.jcim.0c00393. Epub 2020 Oct 23.

Dear-PSM: A deep learning-based peptide search engine enables full database search for proteomics.亲爱的蛋白质组学标准倡议组织：基于深度学习的肽搜索引擎可实现蛋白质组学的全数据库搜索。

Smart Med. 2024 Aug 27;3(3):e20240014. doi: 10.1002/SMMD.20240014. eCollection 2024 Sep.

Maximizing the sensitivity and reliability of peptide identification in large-scale proteomic experiments by harnessing multiple search engines.利用多个搜索引擎，最大限度地提高大规模蛋白质组学实验中肽鉴定的灵敏度和可靠性。

Proteomics. 2010 Mar;10(6):1172-89. doi: 10.1002/pmic.200900074.

本文引用的文献

Comparison of clustering tools in R for medium-sized 10x Genomics single-cell RNA-sequencing data.用于中等规模10x基因组学单细胞RNA测序数据的R语言聚类工具比较

F1000Res. 2018 Aug 15;7:1297. doi: 10.12688/f1000research.15809.2. eCollection 2018.

MaSS-Simulator: A Highly Configurable Simulator for Generating MS/MS Datasets for Benchmarking of Proteomics Algorithms.MaSS-Simulator：一个高度可配置的用于生成 MS/MS 数据集的模拟器，用于对蛋白质组学算法进行基准测试。

Proteomics. 2018 Oct;18(20):e1800206. doi: 10.1002/pmic.201800206. Epub 2018 Sep 28.

MoleculeNet: a benchmark for molecular machine learning.分子网络：分子机器学习的一个基准

Chem Sci. 2017 Oct 31;9(2):513-530. doi: 10.1039/c7sc02664a. eCollection 2018 Jan 14.

MSFragger: ultrafast and comprehensive peptide identification in mass spectrometry-based proteomics.MSFragger：基于质谱的蛋白质组学中实现超快速且全面的肽段鉴定

Nat Methods. 2017 May;14(5):513-520. doi: 10.1038/nmeth.4256. Epub 2017 Apr 10.

MS-REDUCE: an ultrafast technique for reduction of big mass spectrometry data for high-throughput processing.MS-REDUCE：一种用于减少大量质谱数据以进行高通量处理的超快速技术。

Bioinformatics. 2016 May 15;32(10):1518-26. doi: 10.1093/bioinformatics/btw023. Epub 2016 Jan 21.

Emerging systems biology approaches in nanotoxicology: Towards a mechanism-based understanding of nanomaterial hazard and risk.纳米毒理学中新兴的系统生物学方法：迈向基于机制的纳米材料危害与风险理解

Toxicol Appl Pharmacol. 2016 May 15;299:101-11. doi: 10.1016/j.taap.2015.12.014. Epub 2015 Dec 22.

The Proteome of Primary Prostate Cancer.原发性前列腺癌的蛋白质组。

Eur Urol. 2016 May;69(5):942-52. doi: 10.1016/j.eururo.2015.10.053. Epub 2015 Dec 2.

Novor: real-time peptide de novo sequencing software.Novor：实时肽从头测序软件。

J Am Soc Mass Spectrom. 2015 Nov;26(11):1885-94. doi: 10.1007/s13361-015-1204-0. Epub 2015 Jun 30.

Applications of targeted proteomics in systems biology and translational medicine.靶向蛋白质组学在系统生物学和转化医学中的应用。

Proteomics. 2015 Sep;15(18):3193-208. doi: 10.1002/pmic.201500004. Epub 2015 Jul 16.

A Scalable Approach for Protein False Discovery Rate Estimation in Large Proteomic Data Sets.一种用于大规模蛋白质组学数据集中蛋白质错误发现率估计的可扩展方法。

Mol Cell Proteomics. 2015 Sep;14(9):2394-404. doi: 10.1074/mcp.M114.046995. Epub 2015 May 17.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验