Fang Jianwen, Dong Yinghua, Williams Todd D, Lushington Gerald H
Bioinformatics Core Facility & Information and Telecommunication Technology Center, University of Kansas, 2099 Constant Dr., Lawrence, Kansas 66047, USA.
J Bioinform Comput Biol. 2008 Feb;6(1):223-40. doi: 10.1142/s0219720008003345.
Tandem mass spectrometry (MS/MS) combined with protein database searching has been widely used in protein identification. A validation procedure is generally required to reduce the number of false positives. Advanced tools using statistical and machine learning approaches may provide faster and more accurate validation than manual inspection and empirical filtering criteria. In this study, we use two feature selection algorithms based on random forest and support vector machine to identify peptide properties that can be used to improve validation models. We demonstrate that an improved model based on an optimized set of features reduces the number of false positives by 58% relative to the model which used only search engine scores, at the same sensitivity score of 0.8. In addition, we develop classification models based on the physicochemical properties and protein sequence environment of these peptides without using search engine scores. The performance of the best model based on the support vector machine algorithm is at 0.8 AUC, 0.78 accuracy, and 0.7 specificity, suggesting a reasonably accurate classification. The identified properties important to fragmentation and ionization can be either used in independent validation tools or incorporated into peptide sequencing and database search algorithms to improve existing software programs.
串联质谱法(MS/MS)与蛋白质数据库搜索相结合已广泛应用于蛋白质鉴定。通常需要一个验证程序来减少假阳性的数量。使用统计和机器学习方法的先进工具可能比人工检查和经验性过滤标准提供更快、更准确的验证。在本研究中,我们使用基于随机森林和支持向量机的两种特征选择算法来识别可用于改进验证模型的肽段特性。我们证明,在灵敏度评分为0.8的情况下,基于优化特征集的改进模型相对于仅使用搜索引擎评分的模型,可将假阳性数量减少58%。此外,我们在不使用搜索引擎评分的情况下,基于这些肽段的物理化学性质和蛋白质序列环境开发了分类模型。基于支持向量机算法的最佳模型的性能为AUC 0.8、准确率0.78和特异性0.7,表明分类相当准确。所确定的对片段化和电离重要的特性可用于独立验证工具,或纳入肽段测序和数据库搜索算法中,以改进现有的软件程序。