验证质谱数据库搜索结果中的特征选择。

Feature selection in validating mass spectrometry database search results.

作者信息

Fang Jianwen, Dong Yinghua, Williams Todd D, Lushington Gerald H

机构信息

Bioinformatics Core Facility & Information and Telecommunication Technology Center, University of Kansas, 2099 Constant Dr., Lawrence, Kansas 66047, USA.

出版信息

J Bioinform Comput Biol. 2008 Feb;6(1):223-40. doi: 10.1142/s0219720008003345.

DOI:10.1142/s0219720008003345

PMID:18324754

Abstract

Tandem mass spectrometry (MS/MS) combined with protein database searching has been widely used in protein identification. A validation procedure is generally required to reduce the number of false positives. Advanced tools using statistical and machine learning approaches may provide faster and more accurate validation than manual inspection and empirical filtering criteria. In this study, we use two feature selection algorithms based on random forest and support vector machine to identify peptide properties that can be used to improve validation models. We demonstrate that an improved model based on an optimized set of features reduces the number of false positives by 58% relative to the model which used only search engine scores, at the same sensitivity score of 0.8. In addition, we develop classification models based on the physicochemical properties and protein sequence environment of these peptides without using search engine scores. The performance of the best model based on the support vector machine algorithm is at 0.8 AUC, 0.78 accuracy, and 0.7 specificity, suggesting a reasonably accurate classification. The identified properties important to fragmentation and ionization can be either used in independent validation tools or incorporated into peptide sequencing and database search algorithms to improve existing software programs.

摘要

串联质谱法（MS/MS）与蛋白质数据库搜索相结合已广泛应用于蛋白质鉴定。通常需要一个验证程序来减少假阳性的数量。使用统计和机器学习方法的先进工具可能比人工检查和经验性过滤标准提供更快、更准确的验证。在本研究中，我们使用基于随机森林和支持向量机的两种特征选择算法来识别可用于改进验证模型的肽段特性。我们证明，在灵敏度评分为0.8的情况下，基于优化特征集的改进模型相对于仅使用搜索引擎评分的模型，可将假阳性数量减少58%。此外，我们在不使用搜索引擎评分的情况下，基于这些肽段的物理化学性质和蛋白质序列环境开发了分类模型。基于支持向量机算法的最佳模型的性能为AUC 0.8、准确率0.78和特异性0.7，表明分类相当准确。所确定的对片段化和电离重要的特性可用于独立验证工具，或纳入肽段测序和数据库搜索算法中，以改进现有的软件程序。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

验证质谱数据库搜索结果中的特征选择。

Feature selection in validating mass spectrometry database search results.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

验证质谱数据库搜索结果中的特征选择。

Feature selection in validating mass spectrometry database search results.

作者信息

机构信息

出版信息

相似文献

引用本文的文献