Suppr超能文献

使用多个搜索引擎和明确的指标对蛋白质推断算法进行深入分析。

In-depth analysis of protein inference algorithms using multiple search engines and well-defined metrics.

作者信息

Audain Enrique, Uszkoreit Julian, Sachsenberg Timo, Pfeuffer Julianus, Liang Xiao, Hermjakob Henning, Sanchez Aniel, Eisenacher Martin, Reinert Knut, Tabb David L, Kohlbacher Oliver, Perez-Riverol Yasset

机构信息

Department of Proteomics, Center of Molecular Immunology, Ciudad de la Habana, Cuba; Center for Bioinformatics, Quantitative Biology Center and Department of Computer Science, University of Tübingen, Sand 14, 72076 Tübingen, Germany.

Medizinisches Proteom-Center, Ruhr-Universität Bochum, Universitätsstr. 150, D-44801 Bochum, Germany.

出版信息

J Proteomics. 2017 Jan 6;150:170-182. doi: 10.1016/j.jprot.2016.08.002. Epub 2016 Aug 4.

Abstract

UNLABELLED

In mass spectrometry-based shotgun proteomics, protein identifications are usually the desired result. However, most of the analytical methods are based on the identification of reliable peptides and not the direct identification of intact proteins. Thus, assembling peptides identified from tandem mass spectra into a list of proteins, referred to as protein inference, is a critical step in proteomics research. Currently, different protein inference algorithms and tools are available for the proteomics community. Here, we evaluated five software tools for protein inference (PIA, ProteinProphet, Fido, ProteinLP, MSBayesPro) using three popular database search engines: Mascot, X!Tandem, and MS-GF+. All the algorithms were evaluated using a highly customizable KNIME workflow using four different public datasets with varying complexities (different sample preparation, species and analytical instruments). We defined a set of quality control metrics to evaluate the performance of each combination of search engines, protein inference algorithm, and parameters on each dataset. We show that the results for complex samples vary not only regarding the actual numbers of reported protein groups but also concerning the actual composition of groups. Furthermore, the robustness of reported proteins when using databases of differing complexities is strongly dependant on the applied inference algorithm. Finally, merging the identifications of multiple search engines does not necessarily increase the number of reported proteins, but does increase the number of peptides per protein and thus can generally be recommended.

SIGNIFICANCE

Protein inference is one of the major challenges in MS-based proteomics nowadays. Currently, there are a vast number of protein inference algorithms and implementations available for the proteomics community. Protein assembly impacts in the final results of the research, the quantitation values and the final claims in the research manuscript. Even though protein inference is a crucial step in proteomics data analysis, a comprehensive evaluation of the many different inference methods has never been performed. Previously Journal of proteomics has published multiple studies about other benchmark of bioinformatics algorithms (PMID: 26585461; PMID: 22728601) in proteomics studies making clear the importance of those studies for the proteomics community and the journal audience. This manuscript presents a new bioinformatics solution based on the KNIME/OpenMS platform that aims at providing a fair comparison of protein inference algorithms (https://github.com/KNIME-OMICS). Six different algorithms - ProteinProphet, MSBayesPro, ProteinLP, Fido and PIA- were evaluated using the highly customizable workflow on four public datasets with varying complexities. Five popular database search engines Mascot, X!Tandem, MS-GF+ and combinations thereof were evaluated for every protein inference tool. In total >186 proteins lists were analyzed and carefully compare using three metrics for quality assessments of the protein inference results: 1) the numbers of reported proteins, 2) peptides per protein, and the 3) number of uniquely reported proteins per inference method, to address the quality of each inference method. We also examined how many proteins were reported by choosing each combination of search engines, protein inference algorithms and parameters on each dataset. The results show that using 1) PIA or Fido seems to be a good choice when studying the results of the analyzed workflow, regarding not only the reported proteins and the high-quality identifications, but also the required runtime. 2) Merging the identifications of multiple search engines gives almost always more confident results and increases the number of peptides per protein group. 3) The usage of databases containing not only the canonical, but also known isoforms of proteins has a small impact on the number of reported proteins. The detection of specific isoforms could, concerning the question behind the study, compensate for slightly shorter reports using the parsimonious reports. 4) The current workflow can be easily extended to support new algorithms and search engine combinations.

摘要

未标注

在基于质谱的鸟枪法蛋白质组学中,蛋白质鉴定通常是期望得到的结果。然而,大多数分析方法是基于可靠肽段的鉴定,而非完整蛋白质的直接鉴定。因此,将串联质谱鉴定出的肽段组装成蛋白质列表(即蛋白质推断)是蛋白质组学研究中的关键步骤。目前,蛋白质组学界有不同的蛋白质推断算法和工具。在此,我们使用三种流行的数据库搜索引擎(Mascot、X!Tandem和MS-GF+)评估了五种用于蛋白质推断的软件工具(PIA、ProteinProphet、Fido、ProteinLP、MSBayesPro)。所有算法均使用高度可定制的KNIME工作流程,利用四个具有不同复杂性的公共数据集(不同的样品制备、物种和分析仪器)进行评估。我们定义了一组质量控制指标,以评估每个数据集上搜索引擎、蛋白质推断算法和参数的每种组合的性能。我们表明,复杂样品的结果不仅在报告的蛋白质组实际数量方面存在差异,而且在组的实际组成方面也有所不同。此外,使用不同复杂性数据库时报告蛋白质的稳健性强烈依赖于所应用的推断算法。最后,合并多个搜索引擎的鉴定结果不一定会增加报告的蛋白质数量,但会增加每个蛋白质的肽段数量,因此通常值得推荐。

意义

蛋白质推断是当今基于质谱的蛋白质组学中的主要挑战之一。目前,蛋白质组学界有大量的蛋白质推断算法和实现方法。蛋白质组装会影响研究的最终结果、定量值以及研究稿件中的最终结论。尽管蛋白质推断是蛋白质组学数据分析中的关键步骤,但从未对众多不同的推断方法进行过全面评估。此前《蛋白质组学杂志》已发表了多篇关于蛋白质组学研究中其他生物信息学算法基准测试的研究(PMID:26585461;PMID:22728601),明确了这些研究对蛋白质组学界和期刊读者的重要性。本文稿提出了一种基于KNIME/OpenMS平台的新生物信息学解决方案,旨在对蛋白质推断算法进行公平比较(https://github.com/KNIME-OMICS)。使用高度可定制的工作流程,在四个具有不同复杂性的公共数据集上对六种不同算法(ProteinProphet、MSBayesPro、ProteinLP、Fido和PIA)进行了评估。针对每个蛋白质推断工具评估了五种流行的数据库搜索引擎(Mascot、X!Tandem、MS-GF+及其组合)。总共分析了超过186个蛋白质列表,并使用三种用于评估蛋白质推断结果质量的指标进行仔细比较:1)报告的蛋白质数量,2)每个蛋白质的肽段数量,以及3)每种推断方法唯一报告的蛋白质数量,以评估每种推断方法的质量。我们还研究了在每个数据集上选择搜索引擎、蛋白质推断算法和参数的每种组合时报告了多少蛋白质。结果表明,1)在研究分析工作流程的结果时,使用PIA或Fido似乎是一个不错的选择,这不仅涉及报告的蛋白质和高质量鉴定,还涉及所需的运行时间。2)合并多个搜索引擎的鉴定结果几乎总能给出更可靠的结果,并增加每个蛋白质组的肽段数量。3)使用不仅包含蛋白质标准形式,还包含已知异构体的数据库对报告的蛋白质数量影响较小。关于研究背后的问题,特定异构体的检测可以弥补使用简约报告时报告略短的情况。4)当前的工作流程可以轻松扩展以支持新的算法和搜索引擎组合。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验