解读宏蛋白质组学数据：数据库搜索日志

Navigating through metaproteomics data: a logbook of database searching.

作者信息

Muth Thilo, Kolmeder Carolin A, Salojärvi Jarkko, Keskitalo Salla, Varjosalo Markku, Verdam Froukje J, Rensen Sander S, Reichl Udo, de Vos Willem M, Rapp Erdmann, Martens Lennart

机构信息

Max Planck Institute for Dynamics of Complex Technical Systems, Magdeburg, Germany.

Department of Veterinary Biosciences, University of Helsinki, Helsinki, Finland.

出版信息

Proteomics. 2015 Oct;15(20):3439-53. doi: 10.1002/pmic.201400560. Epub 2015 Apr 27.

DOI:10.1002/pmic.201400560

PMID:25778831

Abstract

Metaproteomic research involves various computational challenges during the identification of fragmentation spectra acquired from the proteome of a complex microbiome. These issues are manifold and range from the construction of customized sequence databases, the optimal setting of search parameters to limitations in the identification search algorithms themselves. In order to assess the importance of these individual factors, we studied the effect of strategies to combine different search algorithms, explored the influence of chosen database search settings, and investigated the impact of the size of the protein sequence database used for identification. Furthermore, we applied de novo sequencing as a complementary approach to classic database searching. All evaluations were performed on a human intestinal metaproteome dataset. Pyrococcus furiosus proteome data were used to contrast database searching of metaproteomic data to a classic proteomic experiment. Searching against subsets of metaproteome databases and the use of multiple search engines increased the number of identifications. The integration of P. furiosus sequences in a metaproteomic sequence database showcased the limitation of the target-decoy-controlled false discovery rate approach in combination with large sequence databases. The selection of varying search engine parameters and the application of de novo sequencing represented useful methods to increase the reliability of the results. Based on our findings, we provide recommendations for the data analysis that help researchers to establish or improve analysis workflows in metaproteomics.

摘要

宏蛋白质组学研究在从复杂微生物群落蛋白质组中获取的碎片谱鉴定过程中涉及各种计算挑战。这些问题多种多样，从定制序列数据库的构建、搜索参数的优化设置到鉴定搜索算法本身的局限性。为了评估这些个体因素的重要性，我们研究了组合不同搜索算法的策略的效果，探讨了所选数据库搜索设置的影响，并研究了用于鉴定的蛋白质序列数据库大小的影响。此外，我们将从头测序作为经典数据库搜索的补充方法。所有评估均在人类肠道宏蛋白质组数据集上进行。嗜热栖热菌蛋白质组数据用于将宏蛋白质组数据的数据库搜索与经典蛋白质组实验进行对比。针对宏蛋白质组数据库的子集进行搜索以及使用多个搜索引擎增加了鉴定数量。将嗜热栖热菌序列整合到宏蛋白质组序列数据库中展示了目标-诱饵控制的错误发现率方法与大型序列数据库结合时的局限性。选择不同的搜索引擎参数和应用从头测序是提高结果可靠性的有用方法。基于我们的发现，我们为数据分析提供了建议，以帮助研究人员建立或改进宏蛋白质组学中的分析工作流程。