Torii Manabu, Liu Hongfang, Hu Zhang-Zhi
ISIS Center.
AMIA Annu Symp Proc. 2009 Nov 14;2009:640-4.
Glycosylation is a common and complex protein post-translational modification (PTM). In particular, mucin-type O-linked glycosylation is abundant and plays important biological functions. The number of determined glycosylation sites is still small and there remains the need of accurate computational prediction for annotation and functional understanding of proteins. PTM site prediction can be formulated as a machine learning task. An important step in applying machine learning to this task is encoding protein fragments as feature vectors. Here we assess existing encoding methods as well as an enhanced encoding method named composition of monomer spectrum (CMS) using support vector machines (SVMs). SVMs employing the existing encoding methods achieved AUC (area under ROC curve) of 90.3-91.3%, and ones employing CMS achieved AUC of 92.4%. Analysis of different encoding methods suggests the potential in further improving the prediction.
糖基化是一种常见且复杂的蛋白质翻译后修饰(PTM)。特别是,粘蛋白型O-连接糖基化丰富且具有重要的生物学功能。已确定的糖基化位点数量仍然很少,对于蛋白质的注释和功能理解仍需要准确的计算预测。PTM位点预测可以被表述为一个机器学习任务。将机器学习应用于该任务的一个重要步骤是将蛋白质片段编码为特征向量。在这里,我们使用支持向量机(SVM)评估现有的编码方法以及一种名为单体谱组成(CMS)的增强编码方法。采用现有编码方法的支持向量机获得的ROC曲线下面积(AUC)为90.3 - 91.3%,而采用CMS的支持向量机获得的AUC为92.4%。对不同编码方法的分析表明在进一步改进预测方面具有潜力。