School of Information Engineering, Jingdezhen Ceramic University, Jingdezhen, China.
Math Biosci Eng. 2021 Oct 25;18(6):9132-9147. doi: 10.3934/mbe.2021450.
Protein S-nitrosylation is one of the most important post-translational modifications, a well-grounded understanding of S-nitrosylation is very significant since it plays a key role in a variety of biological processes. For an uncharacterized protein sequence, it is a very meaningful problem for both basic research and drug development when we can firstly identify whether it is a S-nitrosylation protein or not, and then predict the specific S-nitrosylation site(s). This work has proposed two models for identifying S-nitrosylation protein and its PTM sites. Firstly, three kinds of features are extracted from protein sequence: KNN scoring of functional domain annotation, PseAAC and bag-of-words based on the physical and chemical properties of amino acids. Secondly, the synthetic minority oversampling technique is used to balance the data sets, and some state-of-the-art classifiers and feature fusion strategies are performed on the balanced data sets. In the five-fold cross-validation for predicting S-nitrosylation proteins, the results of Accuracy (ACC), Matthew's correlation coefficient (MCC) and area under ROC curve (AUC) are 81.84%, 0.5178, 0.8635, respectively. Finally, a model for predicting S-nitrosylation sites has been constructed on the basis of tripeptide composition (TPC) and the composition of k-spaced amino acid pairs (CKSAAP). To eliminate redundant information and improve work efficiency, elastic nets are employed for feature selection. The five-fold cross-validation tests have indicated the promising success rates of the proposed model. For the convenience of related researchers, the web-server named "RF-SNOPS" has been established at http://www.jci-bioinfo.cn/RF-SNOPS.
蛋白质 S-亚硝基化是最重要的翻译后修饰之一,深入了解 S-亚硝基化非常重要,因为它在各种生物过程中起着关键作用。对于一个未被描述的蛋白质序列,当我们能够首先确定它是否是一个 S-亚硝基化蛋白,然后预测其特定的 S-亚硝基化位点时,这对于基础研究和药物开发都是一个非常有意义的问题。这项工作提出了两种模型来识别 S-亚硝基化蛋白及其 PTM 位点。首先,从蛋白质序列中提取三种特征:基于功能域注释的 KNN 评分、PseAAC 和基于氨基酸理化性质的单词袋。其次,使用合成少数过采样技术来平衡数据集,并在平衡数据集上使用一些最先进的分类器和特征融合策略。在五重交叉验证中预测 S-亚硝基化蛋白,准确性 (ACC)、马修相关系数 (MCC) 和 ROC 曲线下面积 (AUC) 的结果分别为 81.84%、0.5178 和 0.8635。最后,在三肽组成 (TPC) 和 k 间隔氨基酸对组成 (CKSAAP) 的基础上构建了一个预测 S-亚硝基化位点的模型。为了消除冗余信息和提高工作效率,弹性网络用于特征选择。五重交叉验证测试表明,所提出的模型具有较高的成功率。为了方便相关研究人员,我们在 http://www.jci-bioinfo.cn/RF-SNOPS 上建立了一个名为“RF-SNOPS”的网络服务器。