Suppr超能文献

基于氨基酸多尺度组成和特征选择的O-糖基化位点预测

Prediction of O-glycosylation sites based on multi-scale composition of amino acids and feature selection.

作者信息

Chen Yuan, Zhou Wei, Wang Haiyan, Yuan Zheming

机构信息

Hunan Provincial Key Laboratory of Crop Germplasm Innovation and Utilization, Hunan Agricultural University, Changsha, 410128, China.

出版信息

Med Biol Eng Comput. 2015 Jun;53(6):535-44. doi: 10.1007/s11517-015-1268-9. Epub 2015 Mar 10.

Abstract

Protein glycosylation is one of the most important and complex post-translational modification that provides greater proteomic diversity than any other post-translational modification. Fast and reliable computational methods to identify glycosylation sites are in great demand. Two key issues, feature encoding and feature selection, can critically affect the accuracy of a computational method. We present a new O-glycosylation sites prediction method using only amino acid sequence information. The method includes the following components: (1) on the basis of multi-scale theory, features based on multi-scale composition of amino acids were extracted from the training sequences with identified glycosylation sites; (2) perform a two-stage feature selection to remove features that had adverse effects on the prediction, including a stage one preliminary filtering with Student's t test, and a second stage screening through iterative elimination using novel pairwise comparisons conducted in random subspace using support vector machine. Important features retained are used to build prediction model. The method is evaluated with sequence-based tenfold cross-validation tests on balanced datasets. The results of our experiments show that our method significantly outperforms those reported in the literature in terms of sensitivity, specificity, accuracy, Matthew's correlation coefficient. The prediction accuracy of serine and threonine residues sites reached 95.7 and 92.7%. The Matthew correlation coefficient of our method for S and T sites is 0.914 and 0.873, respectively. This method can evaluate each feature with the interactions of the rest of the features, which are still included in the model and have the advantage of high efficiency.

摘要

蛋白质糖基化是最重要且最复杂的翻译后修饰之一,它能提供比其他任何翻译后修饰都更丰富的蛋白质组多样性。因此,迫切需要快速且可靠的计算方法来识别糖基化位点。特征编码和特征选择这两个关键问题会严重影响计算方法的准确性。我们提出了一种仅使用氨基酸序列信息的新型O-糖基化位点预测方法。该方法包括以下几个部分:(1)基于多尺度理论,从具有已识别糖基化位点的训练序列中提取基于氨基酸多尺度组成的特征;(2)进行两阶段特征选择以去除对预测有不利影响的特征,包括第一阶段使用学生t检验进行初步筛选,以及第二阶段通过在随机子空间中使用支持向量机进行新颖的成对比较进行迭代消除来筛选。保留的重要特征用于构建预测模型。该方法在平衡数据集上通过基于序列的十折交叉验证测试进行评估。我们的实验结果表明,在敏感性、特异性、准确性、马修斯相关系数方面,我们的方法显著优于文献中报道的方法。丝氨酸和苏氨酸残基位点的预测准确率分别达到95.7%和92.7%。我们的方法针对S和T位点的马修斯相关系数分别为0.914和0.873。该方法可以在模型中仍然包含其余特征相互作用的情况下评估每个特征,具有高效的优点。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验