Suppr超能文献

用于生物质谱分析的不断发展的计算平台:使用MASSyPup64的工作流程、统计学和数据挖掘

An evolving computational platform for biological mass spectrometry: workflows, statistics and data mining with MASSyPup64.

作者信息

Winkler Robert

机构信息

Department of Biotechnology and Biochemistry, CINVESTAV Unidad Irapuato , Mexico.

出版信息

PeerJ. 2015 Nov 17;3:e1401. doi: 10.7717/peerj.1401. eCollection 2015.

Abstract

In biological mass spectrometry, crude instrumental data need to be converted into meaningful theoretical models. Several data processing and data evaluation steps are required to come to the final results. These operations are often difficult to reproduce, because of too specific computing platforms. This effect, known as 'workflow decay', can be diminished by using a standardized informatic infrastructure. Thus, we compiled an integrated platform, which contains ready-to-use tools and workflows for mass spectrometry data analysis. Apart from general unit operations, such as peak picking and identification of proteins and metabolites, we put a strong emphasis on the statistical validation of results and Data Mining. MASSyPup64 includes e.g., the OpenMS/TOPPAS framework, the Trans-Proteomic-Pipeline programs, the ProteoWizard tools, X!Tandem, Comet and SpiderMass. The statistical computing language R is installed with packages for MS data analyses, such as XCMS/metaXCMS and MetabR. The R package Rattle provides a user-friendly access to multiple Data Mining methods. Further, we added the non-conventional spreadsheet program teapot for editing large data sets and a command line tool for transposing large matrices. Individual programs, console commands and modules can be integrated using the Workflow Management System (WMS) taverna. We explain the useful combination of the tools by practical examples: (1) A workflow for protein identification and validation, with subsequent Association Analysis of peptides, (2) Cluster analysis and Data Mining in targeted Metabolomics, and (3) Raw data processing, Data Mining and identification of metabolites in untargeted Metabolomics. Association Analyses reveal relationships between variables across different sample sets. We present its application for finding co-occurring peptides, which can be used for target proteomics, the discovery of alternative biomarkers and protein-protein interactions. Data Mining derived models displayed a higher robustness and accuracy for classifying sample groups in targeted Metabolomics than cluster analyses. Random Forest models do not only provide predictive models, which can be deployed for new data sets, but also the variable importance. We demonstrate that the later is especially useful for tracking down significant signals and affected pathways in untargeted Metabolomics. Thus, Random Forest modeling supports the unbiased search for relevant biological features in Metabolomics. Our results clearly manifest the importance of Data Mining methods to disclose non-obvious information in biological mass spectrometry . The application of a Workflow Management System and the integration of all required programs and data in a consistent platform makes the presented data analyses strategies reproducible for non-expert users. The simple remastering process and the Open Source licenses of MASSyPup64 (http://www.bioprocess.org/massypup/) enable the continuous improvement of the system.

摘要

在生物质谱分析中,原始仪器数据需要转换为有意义的理论模型。为了得到最终结果,需要进行几个数据处理和数据评估步骤。由于计算平台过于特定,这些操作往往难以重现。这种被称为“工作流程衰退”的效应可以通过使用标准化的信息基础设施来减轻。因此,我们编制了一个集成平台,其中包含用于质谱数据分析的即用型工具和工作流程。除了一般的单元操作,如峰检测以及蛋白质和代谢物的鉴定外,我们还非常强调结果的统计验证和数据挖掘。MASSyPup64包括例如OpenMS/TOPPAS框架、跨蛋白质组学管道程序、ProteoWizard工具、X!Tandem、Comet和SpiderMass。统计计算语言R安装了用于质谱数据分析的包,如XCMS/metaXCMS和MetabR。R包Rattle提供了对多种数据挖掘方法的用户友好访问。此外,我们添加了用于编辑大型数据集的非传统电子表格程序teapot和用于转置大型矩阵的命令行工具。可以使用工作流管理系统(WMS)taverna集成各个程序、控制台命令和模块。我们通过实际示例解释这些工具的有用组合:(1)用于蛋白质鉴定和验证以及随后肽段关联分析的工作流程,(2)靶向代谢组学中的聚类分析和数据挖掘,以及(3)非靶向代谢组学中的原始数据处理、数据挖掘和代谢物鉴定。关联分析揭示了不同样本集之间变量的关系。我们展示了其在寻找共现肽段方面的应用,这些肽段可用于靶向蛋白质组学、发现替代生物标志物以及蛋白质 - 蛋白质相互作用。在靶向代谢组学中,数据挖掘衍生的模型在对样本组进行分类时显示出比聚类分析更高的稳健性和准确性。随机森林模型不仅提供可用于新数据集的预测模型,还提供变量重要性。我们证明,后者对于在非靶向代谢组学中追踪显著信号和受影响的途径特别有用。因此,随机森林建模支持在代谢组学中无偏地搜索相关生物学特征。我们的结果清楚地表明了数据挖掘方法在生物质谱分析中揭示非明显信息的重要性。工作流管理系统的应用以及将所有所需程序和数据集成在一个一致的平台上,使得所提出的数据分析策略对于非专业用户来说是可重现的。MASSyPup64简单的重新制作过程和开源许可(http://www.bioprocess.org/massypup/)使系统能够持续改进。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验