Department of Computer Science & Engineering, National Institute of Technology Raipur, GE Road, Raipur, Chhattisgarh, 492010, India.
Med Biol Eng Comput. 2021 Nov;59(11-12):2297-2310. doi: 10.1007/s11517-021-02436-5. Epub 2021 Sep 20.
Advances in high-throughput techniques lead to evolving a large number of unknown protein sequences (UPS). Functional characterization of UPS is significant for the investigation of disease symptoms and drug repositioning. Protein subcellular localization is imperative for the functional characterization of protein sequences. Diverse techniques are used on protein sequences for feature extraction. However, many times a single feature extraction technique leads to poor prediction performance. In this paper, two feature augmentations are described through sequence induced, physicochemical, and evolutionary information of the amino acid residues. While augmented features preserve the sequence-order-information and protein-residue-properties. Two bacterial protein datasets Gram-Positive (G +) and Gram-Negative (G-) are utilized for the experimental work. After performing essential preprocessing on protein datasets, two sets of feature vectors are obtained. These feature vectors are used separately to train the different individual and ensembles such as decision tree (C 4.5), k-nearest neighbor (k-NN), multi-layer perceptron (MLP), Naïve Bayes (NB), support vector machine (SVM), AdaBoost, gradient boosting machine (GBM), and random forest (RF) with fivefold cross-validation. Prediction results of the model demonstrate that overall accuracy reported by C4.5 is highest 99.57% on G + and 97.47% on G- datasets with known protein sequences. Similarly, for the UPS overall accuracy of G + is 85.17% with SVM and 82.45% with G- dataset using MLP.
高通量技术的进步导致了大量未知蛋白质序列(UPS)的出现。UPS 的功能表征对于研究疾病症状和药物重新定位具有重要意义。蛋白质亚细胞定位对于蛋白质序列的功能表征至关重要。已经使用多种技术对蛋白质序列进行特征提取。然而,很多时候单一的特征提取技术会导致预测性能不佳。在本文中,描述了两种通过序列诱导、氨基酸残基的物理化学和进化信息进行的特征增强方法。增强后的特征保留了序列顺序信息和蛋白质残基特性。使用革兰氏阳性(G+)和革兰氏阴性(G-)两种细菌蛋白质数据集进行实验工作。在对蛋白质数据集进行必要的预处理后,获得了两组特征向量。这些特征向量分别用于训练不同的个体和集成,如决策树(C4.5)、k-最近邻(k-NN)、多层感知机(MLP)、朴素贝叶斯(NB)、支持向量机(SVM)、AdaBoost、梯度提升机(GBM)和随机森林(RF),并进行五重交叉验证。模型的预测结果表明,在已知蛋白质序列的 G+数据集上,C4.5 报告的总体准确率最高为 99.57%,在 G-数据集上为 97.47%。同样,对于 UPS,在 G+数据集上使用 SVM 的总体准确率为 85.17%,在 G-数据集上使用 MLP 的总体准确率为 82.45%。