Ulintz Peter J, Zhu Ji, Qin Zhaohui S, Andrews Philip C
National Resource for Proteomics and Pathways, School of Public Health, University of Michigan, Ann Arbor, Michigan 48109, USA.
Mol Cell Proteomics. 2006 Mar;5(3):497-509. doi: 10.1074/mcp.M500233-MCP200. Epub 2005 Nov 30.
Manual analysis of mass spectrometry data is a current bottleneck in high throughput proteomics. In particular, the need to manually validate the results of mass spectrometry database searching algorithms can be prohibitively time-consuming. Development of software tools that attempt to quantify the confidence in the assignment of a protein or peptide identity to a mass spectrum is an area of active interest. We sought to extend work in this area by investigating the potential of recent machine learning algorithms to improve the accuracy of these approaches and as a flexible framework for accommodating new data features. Specifically we demonstrated the ability of boosting and random forest approaches to improve the discrimination of true hits from false positive identifications in the results of mass spectrometry database search engines compared with thresholding and other machine learning approaches. We accommodated additional attributes obtainable from database search results, including a factor addressing proton mobility. Performance was evaluated using publically available electrospray data and a new collection of MALDI data generated from purified human reference proteins.
质谱数据的人工分析是高通量蛋白质组学当前的一个瓶颈。特别是,手动验证质谱数据库搜索算法的结果可能会非常耗时。开发试图量化质谱图中蛋白质或肽段身份分配置信度的软件工具是一个备受关注的活跃领域。我们试图通过研究近期机器学习算法的潜力来扩展该领域的工作,以提高这些方法的准确性,并作为一个灵活的框架来适应新的数据特征。具体而言,我们证明了与阈值化和其他机器学习方法相比,提升算法和随机森林方法能够提高质谱数据库搜索引擎结果中真阳性识别与假阳性识别之间的区分度。我们纳入了可从数据库搜索结果中获得的其他属性,包括一个涉及质子迁移率的因子。使用公开可用的电喷雾数据和从纯化的人类参考蛋白质生成的新的基质辅助激光解吸电离(MALDI)数据集合对性能进行了评估。