Suppr超能文献

增强序列特征和亚细胞定位用于未知蛋白质序列的功能特征分析。

Augmented sequence features and subcellular localization for functional characterization of unknown protein sequences.

机构信息

Department of Computer Science & Engineering, National Institute of Technology Raipur, GE Road, Raipur, Chhattisgarh, 492010, India.

出版信息

Med Biol Eng Comput. 2021 Nov;59(11-12):2297-2310. doi: 10.1007/s11517-021-02436-5. Epub 2021 Sep 20.

Abstract

Advances in high-throughput techniques lead to evolving a large number of unknown protein sequences (UPS). Functional characterization of UPS is significant for the investigation of disease symptoms and drug repositioning. Protein subcellular localization is imperative for the functional characterization of protein sequences. Diverse techniques are used on protein sequences for feature extraction. However, many times a single feature extraction technique leads to poor prediction performance. In this paper, two feature augmentations are described through sequence induced, physicochemical, and evolutionary information of the amino acid residues. While augmented features preserve the sequence-order-information and protein-residue-properties. Two bacterial protein datasets Gram-Positive (G +) and Gram-Negative (G-) are utilized for the experimental work. After performing essential preprocessing on protein datasets, two sets of feature vectors are obtained. These feature vectors are used separately to train the different individual and ensembles such as decision tree (C 4.5), k-nearest neighbor (k-NN), multi-layer perceptron (MLP), Naïve Bayes (NB), support vector machine (SVM), AdaBoost, gradient boosting machine (GBM), and random forest (RF) with fivefold cross-validation. Prediction results of the model demonstrate that overall accuracy reported by C4.5 is highest 99.57% on G + and 97.47% on G- datasets with known protein sequences. Similarly, for the UPS overall accuracy of G + is 85.17% with SVM and 82.45% with G- dataset using MLP.

摘要

高通量技术的进步导致了大量未知蛋白质序列(UPS)的出现。UPS 的功能表征对于研究疾病症状和药物重新定位具有重要意义。蛋白质亚细胞定位对于蛋白质序列的功能表征至关重要。已经使用多种技术对蛋白质序列进行特征提取。然而,很多时候单一的特征提取技术会导致预测性能不佳。在本文中,描述了两种通过序列诱导、氨基酸残基的物理化学和进化信息进行的特征增强方法。增强后的特征保留了序列顺序信息和蛋白质残基特性。使用革兰氏阳性(G+)和革兰氏阴性(G-)两种细菌蛋白质数据集进行实验工作。在对蛋白质数据集进行必要的预处理后,获得了两组特征向量。这些特征向量分别用于训练不同的个体和集成,如决策树(C4.5)、k-最近邻(k-NN)、多层感知机(MLP)、朴素贝叶斯(NB)、支持向量机(SVM)、AdaBoost、梯度提升机(GBM)和随机森林(RF),并进行五重交叉验证。模型的预测结果表明,在已知蛋白质序列的 G+数据集上,C4.5 报告的总体准确率最高为 99.57%,在 G-数据集上为 97.47%。同样,对于 UPS,在 G+数据集上使用 SVM 的总体准确率为 85.17%,在 G-数据集上使用 MLP 的总体准确率为 82.45%。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验