增强序列特征和亚细胞定位用于未知蛋白质序列的功能特征分析。

Augmented sequence features and subcellular localization for functional characterization of unknown protein sequences.

机构信息

Department of Computer Science & Engineering, National Institute of Technology Raipur, GE Road, Raipur, Chhattisgarh, 492010, India.

出版信息

Med Biol Eng Comput. 2021 Nov;59(11-12):2297-2310. doi: 10.1007/s11517-021-02436-5. Epub 2021 Sep 20.

DOI:10.1007/s11517-021-02436-5

PMID:34545514

Abstract

Advances in high-throughput techniques lead to evolving a large number of unknown protein sequences (UPS). Functional characterization of UPS is significant for the investigation of disease symptoms and drug repositioning. Protein subcellular localization is imperative for the functional characterization of protein sequences. Diverse techniques are used on protein sequences for feature extraction. However, many times a single feature extraction technique leads to poor prediction performance. In this paper, two feature augmentations are described through sequence induced, physicochemical, and evolutionary information of the amino acid residues. While augmented features preserve the sequence-order-information and protein-residue-properties. Two bacterial protein datasets Gram-Positive (G +) and Gram-Negative (G-) are utilized for the experimental work. After performing essential preprocessing on protein datasets, two sets of feature vectors are obtained. These feature vectors are used separately to train the different individual and ensembles such as decision tree (C 4.5), k-nearest neighbor (k-NN), multi-layer perceptron (MLP), Naïve Bayes (NB), support vector machine (SVM), AdaBoost, gradient boosting machine (GBM), and random forest (RF) with fivefold cross-validation. Prediction results of the model demonstrate that overall accuracy reported by C4.5 is highest 99.57% on G + and 97.47% on G- datasets with known protein sequences. Similarly, for the UPS overall accuracy of G + is 85.17% with SVM and 82.45% with G- dataset using MLP.

摘要

高通量技术的进步导致了大量未知蛋白质序列（UPS）的出现。UPS 的功能表征对于研究疾病症状和药物重新定位具有重要意义。蛋白质亚细胞定位对于蛋白质序列的功能表征至关重要。已经使用多种技术对蛋白质序列进行特征提取。然而，很多时候单一的特征提取技术会导致预测性能不佳。在本文中，描述了两种通过序列诱导、氨基酸残基的物理化学和进化信息进行的特征增强方法。增强后的特征保留了序列顺序信息和蛋白质残基特性。使用革兰氏阳性（G+）和革兰氏阴性（G-）两种细菌蛋白质数据集进行实验工作。在对蛋白质数据集进行必要的预处理后，获得了两组特征向量。这些特征向量分别用于训练不同的个体和集成，如决策树（C4.5）、k-最近邻（k-NN）、多层感知机（MLP）、朴素贝叶斯（NB）、支持向量机（SVM）、AdaBoost、梯度提升机（GBM）和随机森林（RF），并进行五重交叉验证。模型的预测结果表明，在已知蛋白质序列的 G+数据集上，C4.5 报告的总体准确率最高为 99.57%，在 G-数据集上为 97.47%。同样，对于 UPS，在 G+数据集上使用 SVM 的总体准确率为 85.17%，在 G-数据集上使用 MLP 的总体准确率为 82.45%。

相似文献

Augmented sequence features and subcellular localization for functional characterization of unknown protein sequences.增强序列特征和亚细胞定位用于未知蛋白质序列的功能特征分析。

Med Biol Eng Comput. 2021 Nov;59(11-12):2297-2310. doi: 10.1007/s11517-021-02436-5. Epub 2021 Sep 20.

Seminal quality prediction using data mining methods.使用数据挖掘方法进行精液质量预测。

Technol Health Care. 2014;22(4):531-45. doi: 10.3233/THC-140816.

Optimizing neural networks for medical data sets: A case study on neonatal apnea prediction.优化神经网络在医学数据集上的应用：以新生儿呼吸暂停预测为例的研究

Artif Intell Med. 2019 Jul;98:59-76. doi: 10.1016/j.artmed.2019.07.008. Epub 2019 Jul 25.

A Study on ML-Based Software Defect Detection for Security Traceability in Smart Healthcare Applications.基于机器学习的软件缺陷检测在智能医疗保健应用中的安全性可追踪性研究。

Sensors (Basel). 2023 Mar 26;23(7):3470. doi: 10.3390/s23073470.

Self-evoluting framework of deep convolutional neural network for multilocus protein subcellular localization.用于多定位点蛋白质亚细胞定位的深度卷积神经网络的自进化框架。

Med Biol Eng Comput. 2020 Dec;58(12):3017-3038. doi: 10.1007/s11517-020-02275-w. Epub 2020 Oct 20.

Wrapper method for feature selection to classify cardiac arrhythmia.用于心律失常分类的特征选择包装方法。

Annu Int Conf IEEE Eng Med Biol Soc. 2017 Jul;2017:3656-3659. doi: 10.1109/EMBC.2017.8037650.

IAMPE: NMR-Assisted Computational Prediction of Antimicrobial Peptides.IAMPE：基于 NMR 的抗菌肽计算预测。

J Chem Inf Model. 2020 Oct 26;60(10):4691-4701. doi: 10.1021/acs.jcim.0c00841. Epub 2020 Sep 30.

Predict Gram-Positive and Gram-Negative Subcellular Localization via Incorporating Evolutionary Information and Physicochemical Features Into Chou's General PseAAC.通过将进化信息和理化特征纳入周氏通用伪氨基酸组成预测革兰氏阳性和革兰氏阴性亚细胞定位

IEEE Trans Nanobioscience. 2015 Dec;14(8):915-26. doi: 10.1109/TNB.2015.2500186. Epub 2015 Nov 12.

Development and validation of multiple machine learning algorithms for the classification of G-protein-coupled receptors using molecular evolution model-based feature extraction strategy.开发并验证了一种基于分子进化模型的特征提取策略的多种机器学习算法，用于 G 蛋白偶联受体的分类。

Amino Acids. 2021 Nov;53(11):1705-1714. doi: 10.1007/s00726-021-03080-x. Epub 2021 Sep 25.

Detection of Iris Presentation Attacks Using Feature Fusion of Thepade's Sorted Block Truncation Coding with Gray-Level Co-Occurrence Matrix Features.利用 Thepade 排序分块截断编码与灰度共生矩阵特征的融合进行虹膜呈现攻击检测。

Sensors (Basel). 2021 Nov 8;21(21):7408. doi: 10.3390/s21217408.

本文引用的文献

Industrial biotechnology of Pseudomonas putida: advances and prospects.恶臭假单胞菌的工业生物技术：进展与展望

Appl Microbiol Biotechnol. 2020 Sep;104(18):7745-7766. doi: 10.1007/s00253-020-10811-9. Epub 2020 Aug 13.

A Review of Matched-pairs Feature Selection Methods for Gene Expression Data Analysis.基因表达数据分析中配对特征选择方法综述

Comput Struct Biotechnol J. 2018 Feb 25;16:88-97. doi: 10.1016/j.csbj.2018.02.005. eCollection 2018.

pLoc_bal-mGpos: Predict subcellular localization of Gram-positive bacterial proteins by quasi-balancing training dataset and PseAAC.pLoc_bal-mGpos：通过准平衡训练数据集和 PseAAC 预测革兰氏阳性菌蛋白质的亚细胞定位

Genomics. 2019 Jul;111(4):886-892. doi: 10.1016/j.ygeno.2018.05.017. Epub 2018 May 26.

iFeature: a Python package and web server for features extraction and selection from protein and peptide sequences.iFeature：一个用于从蛋白质和肽序列中提取和选择特征的 Python 包和网络服务器。

Bioinformatics. 2018 Jul 15;34(14):2499-2502. doi: 10.1093/bioinformatics/bty140.

EvoStruct-Sub: An accurate Gram-positive protein subcellular localization predictor using evolutionary and structural features.EvoStruct-Sub：一种使用进化和结构特征的准确革兰氏阳性蛋白亚细胞定位预测器。

J Theor Biol. 2018 Apr 14;443:138-146. doi: 10.1016/j.jtbi.2018.02.002. Epub 2018 Feb 5.

Prediction of protein subcellular localization with oversampling approach and Chou's general PseAAC.基于过采样方法和周式广义伪氨基酸组成预测蛋白质亚细胞定位

J Theor Biol. 2018 Jan 21;437:239-250. doi: 10.1016/j.jtbi.2017.10.030. Epub 2017 Oct 31.

PaRSnIP: sequence-based protein solubility prediction using gradient boosting machine.PaRSnIP：基于梯度提升机的序列基蛋白质溶解性预测。

Bioinformatics. 2018 Apr 1;34(7):1092-1098. doi: 10.1093/bioinformatics/btx662.

POSSUM: a bioinformatics toolkit for generating numerical sequence feature descriptors based on PSSM profiles.POSSUM：一种基于位置特异性得分矩阵（PSSM）谱生成数字序列特征描述符的生物信息学工具包。

Bioinformatics. 2017 Sep 1;33(17):2756-2758. doi: 10.1093/bioinformatics/btx302.

Feature Fusion Based SVM Classifier for Protein Subcellular Localization Prediction.基于特征融合的支持向量机分类器用于蛋白质亚细胞定位预测

J Integr Bioinform. 2016 Dec 18;13(1):288. doi: 10.2390/biecoll-jib-2016-288.

A multiple information fusion method for predicting subcellular locations of two different types of bacterial protein simultaneously.一种同时预测两种不同类型细菌蛋白质亚细胞定位的多信息融合方法。

Biosystems. 2016 Jan;139:37-45. doi: 10.1016/j.biosystems.2015.12.002. Epub 2015 Dec 24.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

增强序列特征和亚细胞定位用于未知蛋白质序列的功能特征分析。

Augmented sequence features and subcellular localization for functional characterization of unknown protein sequences.

机构信息

出版信息

相似文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

本文引用的文献