LeDuc Richard D, Fellers Ryan T, Early Bryan P, Greer Joseph B, Thomas Paul M, Kelleher Neil L
National Center for Genome Analysis Support, Indiana University , 2709 E. 10th Street, Bloomington, Indiana 47408, United States.
J Proteome Res. 2014 Jul 3;13(7):3231-40. doi: 10.1021/pr401277r. Epub 2014 Jun 12.
The automated processing of data generated by top down proteomics would benefit from improved scoring for protein identification and characterization of highly related protein forms (proteoforms). Here we propose the "C-score" (short for Characterization Score), a Bayesian approach to the proteoform identification and characterization problem, implemented within a framework to allow the infusion of expert knowledge into generative models that take advantage of known properties of proteins and top down analytical systems (e.g., fragmentation propensities, "off-by-1 Da" discontinuous errors, and intelligent weighting for site-specific modifications). The performance of the scoring system based on the initial generative models was compared to the current probability-based scoring system used within both ProSightPC and ProSightPTM on a manually curated set of 295 human proteoforms. The current implementation of the C-score framework generated a marked improvement over the existing scoring system as measured by the area under the curve on the resulting ROC chart (AUC of 0.99 versus 0.78).
自上而下蛋白质组学产生的数据的自动化处理将受益于改进的蛋白质鉴定评分以及高度相关蛋白质形式(蛋白异构体)的表征。在此,我们提出了“C评分”(表征评分的缩写),这是一种用于蛋白异构体鉴定和表征问题的贝叶斯方法,在一个框架内实施,以便将专家知识注入到利用蛋白质已知特性和自上而下分析系统(例如,片段化倾向、“相差1 Da”的不连续误差以及位点特异性修饰的智能加权)的生成模型中。在一组经人工整理的295种人类蛋白异构体上,将基于初始生成模型的评分系统的性能与ProSightPC和ProSightPTM中使用的当前基于概率的评分系统进行了比较。通过所得ROC图上的曲线下面积衡量,C评分框架的当前实现相对于现有评分系统有显著改进(AUC为0.99对0.78)。