基于蛋白质序列特征的变量选择：以 IV 型细菌分泌效应子分类为例。

Variable selection from a feature representing protein sequences: a case of classification on bacterial type IV secreted effectors.

机构信息

College of Artificial Intelligence, Wuxi Vocational College of Science and Technology, No. 8 Xinxi Road, Wuxi, 214028, China.

College of Information and Computer Engineering, Northeast Forestry University, No. 26 Hexing Road, Harbin, 150040, China.

出版信息

BMC Bioinformatics. 2020 Oct 27;21(1):480. doi: 10.1186/s12859-020-03826-6.

DOI:10.1186/s12859-020-03826-6

PMID:33109082

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7590791/

Abstract

BACKGROUND

Classification of certain proteins with specific functions is momentous for biological research. Encoding approaches of protein sequences for feature extraction play an important role in protein classification. Many computational methods (namely classifiers) are used for classification on protein sequences according to various encoding approaches. Commonly, protein sequences keep certain labels corresponding to different categories of biological functions (e.g., bacterial type IV secreted effectors or not), which makes protein prediction a fantasy. As to protein prediction, a kernel set of protein sequences keeping certain labels certified by biological experiments should be existent in advance. However, it has been hardly ever seen in prevailing researches. Therefore, unsupervised learning rather than supervised learning (e.g. classification) should be considered. As to protein classification, various classifiers may help to evaluate the effectiveness of different encoding approaches. Besides, variable selection from an encoded feature representing protein sequences is an important issue that also needs to be considered.

RESULTS

Focusing on the latter problem, we propose a new method for variable selection from an encoded feature representing protein sequences. Taking a benchmark dataset containing 1947 protein sequences as a case, experiments are made to identify bacterial type IV secreted effectors (T4SE) from protein sequences, which are composed of 399 T4SE and 1548 non-T4SE. Comparable and quantified results are obtained only using certain components of the encoded feature, i.e., position-specific scoring matix, and that indicates the effectiveness of our method.

CONCLUSIONS

Certain variables other than an encoded feature they belong to do work for discrimination between different types of proteins. In addition, ensemble classifiers with an automatic assignment of different base classifiers do achieve a better classification result.

摘要

背景

对具有特定功能的特定蛋白质进行分类对于生物研究至关重要。蛋白质序列的编码方法在特征提取中起着重要作用，在蛋白质分类中发挥着重要作用。根据各种编码方法，许多计算方法（即分类器）用于对蛋白质序列进行分类。通常，蛋白质序列保留与不同生物功能类别（例如，细菌 IV 型分泌效应物或非 IV 型分泌效应物）相对应的某些标签，这使得蛋白质预测成为一种幻想。对于蛋白质预测，应该预先存在一组经过生物实验验证的具有某些标签的蛋白质序列核。然而，在现有的研究中几乎从未见过。因此，应该考虑无监督学习而不是监督学习（例如分类）。对于蛋白质分类，各种分类器可以帮助评估不同编码方法的有效性。此外，从表示蛋白质序列的编码特征中进行变量选择也是一个重要问题，也需要考虑。

结果

针对后一个问题，我们提出了一种从表示蛋白质序列的编码特征中进行变量选择的新方法。以包含 1947 个蛋白质序列的基准数据集为例，进行了从蛋白质序列中识别细菌 IV 型分泌效应物（T4SE）的实验，该实验由 399 个 T4SE 和 1548 个非 T4SE 组成。仅使用编码特征的某些成分（即位置特定评分矩阵）即可获得可比且量化的结果，这表明了我们方法的有效性。

结论

与它们所属的编码特征相比，某些变量确实可以用于区分不同类型的蛋白质。此外，具有不同基分类器自动分配的集成分类器确实可以实现更好的分类结果。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/567b/7590791/c8e97870f1d5/12859_2020_3826_Fig1_HTML.jpg

相似文献

Variable selection from a feature representing protein sequences: a case of classification on bacterial type IV secreted effectors.基于蛋白质序列特征的变量选择：以 IV 型细菌分泌效应子分类为例。

BMC Bioinformatics. 2020 Oct 27;21(1):480. doi: 10.1186/s12859-020-03826-6.

Effective prediction of bacterial type IV secreted effectors by combined features of both C-termini and N-termini.通过C端和N端的联合特征对细菌IV型分泌效应蛋白进行有效预测。

J Comput Aided Mol Des. 2017 Nov;31(11):1029-1038. doi: 10.1007/s10822-017-0080-z. Epub 2017 Nov 10.

PredT4SE-Stack: Prediction of Bacterial Type IV Secreted Effectors From Protein Sequences Using a Stacked Ensemble Method.PredT4SE-Stack：使用堆叠集成方法从蛋白质序列预测细菌IV型分泌效应蛋白

Front Microbiol. 2018 Oct 26;9:2571. doi: 10.3389/fmicb.2018.02571. eCollection 2018.

Systematic analysis and prediction of type IV secreted effector proteins by machine learning approaches.基于机器学习方法的 IV 型分泌效应蛋白的系统分析和预测。

Brief Bioinform. 2019 May 21;20(3):931-951. doi: 10.1093/bib/bbx164.

iT4SE-EP: Accurate Identification of Bacterial Type IV Secreted Effectors by Exploring Evolutionary Features from Two PSI-BLAST Profiles.iT4SE-EP：通过探索来自两个PSI-BLAST图谱的进化特征准确鉴定细菌IV型分泌效应蛋白

Molecules. 2021 Apr 24;26(9):2487. doi: 10.3390/molecules26092487.

A Gram-Negative Bacterial Secreted Protein Types Prediction Method Based on PSI-BLAST Profile.一种基于PSI-BLAST序列谱的革兰氏阴性菌分泌蛋白类型预测方法。

Biomed Res Int. 2016;2016:3206741. doi: 10.1155/2016/3206741. Epub 2016 Aug 2.

Accurate prediction of bacterial type IV secreted effectors using amino acid composition and PSSM profiles.使用氨基酸组成和 PSSM 特征预测细菌 IV 型分泌效应子的准确性。

Bioinformatics. 2013 Dec 15;29(24):3135-42. doi: 10.1093/bioinformatics/btt554. Epub 2013 Sep 23.

Comprehensive assessment and performance improvement of effector protein predictors for bacterial secretion systems III, IV and VI.全面评估和性能改进的效应蛋白预测因子的细菌分泌系统 III、IV 和 VI。

Brief Bioinform. 2018 Jan 1;19(1):148-161. doi: 10.1093/bib/bbw100.

High-accuracy prediction of bacterial type III secreted effectors based on position-specific amino acid composition profiles.基于位置特异性氨基酸组成特征预测细菌 III 型分泌效应子的高精度方法。

Bioinformatics. 2011 Mar 15;27(6):777-84. doi: 10.1093/bioinformatics/btr021. Epub 2011 Jan 13.

T4SE-XGB: Interpretable Sequence-Based Prediction of Type IV Secreted Effectors Using eXtreme Gradient Boosting Algorithm.T4SE-XGB：使用极端梯度提升算法对IV型分泌效应蛋白进行基于序列的可解释预测。

Front Microbiol. 2020 Sep 24;11:580382. doi: 10.3389/fmicb.2020.580382. eCollection 2020.

引用本文的文献

Modelling the bioinformatics tertiary analysis research process.建立生物信息学三级分析研究过程模型。

BMC Bioinformatics. 2021 Sep 30;22(Suppl 13):452. doi: 10.1186/s12859-021-04310-5.

Variable Selection from Image Texture Feature for Automatic Classification of Concrete Surface Voids.基于图像纹理特征的变量选择用于混凝土表面孔洞的自动分类

Comput Intell Neurosci. 2021 Mar 6;2021:5538573. doi: 10.1155/2021/5538573. eCollection 2021.

本文引用的文献

Distinct Biomarker Profiles and Clinical Characteristics in T1-T2 Glottic and Supraglottic Carcinomas.声门型和喉咽型 T1-T2 癌的不同生物标志物谱和临床特征。

Laryngoscope. 2020 Dec;130(12):2825-2832. doi: 10.1002/lary.28532. Epub 2020 Feb 17.

ECFS-DEA: an ensemble classifier-based feature selection for differential expression analysis on expression profiles.ECFS-DEA：基于集成分类器的特征选择方法，用于表达谱上的差异表达分析。

BMC Bioinformatics. 2020 Feb 5;21(1):43. doi: 10.1186/s12859-020-3388-y.

Machine learning with autophagy-related proteins for discriminating renal cell carcinoma subtypes.基于自噬相关蛋白的机器学习用于鉴别肾细胞癌亚型。

Sci Rep. 2020 Jan 20;10(1):720. doi: 10.1038/s41598-020-57670-y.

PredPSD: A Gradient Tree Boosting Approach for Single-Stranded and Double-Stranded DNA Binding Protein Prediction.PredPSD：一种用于单链和双链 DNA 结合蛋白预测的梯度提升树方法。

Molecules. 2019 Dec 26;25(1):98. doi: 10.3390/molecules25010098.

AOPs-SVM: A Sequence-Based Classifier of Antioxidant Proteins Using a Support Vector Machine.AOPs-SVM：一种基于序列的使用支持向量机的抗氧化蛋白分类器。

Front Bioeng Biotechnol. 2019 Sep 18;7:224. doi: 10.3389/fbioe.2019.00224. eCollection 2019.

A Random Forest Sub-Golgi Protein Classifier Optimized via Dipeptide and Amino Acid Composition Features.一种通过二肽和氨基酸组成特征优化的随机森林亚高尔基体蛋白分类器。

Front Bioeng Biotechnol. 2019 Sep 4;7:215. doi: 10.3389/fbioe.2019.00215. eCollection 2019.

Automatic Cataract Classification Using Deep Neural Network With Discrete State Transition.基于离散状态转移的深度神经网络的自动白内障分类。

IEEE Trans Med Imaging. 2020 Feb;39(2):436-446. doi: 10.1109/TMI.2019.2928229. Epub 2019 Jul 11.

Incorporating Distance-Based Top-n-gram and Random Forest To Identify Electron Transport Proteins.基于距离的 Top-n-gram 和随机森林在鉴定电子传递蛋白中的应用。

J Proteome Res. 2019 Jul 5;18(7):2931-2939. doi: 10.1021/acs.jproteome.9b00250. Epub 2019 Jun 3.

ELM-MHC: An Improved MHC Identification Method with Extreme Learning Machine Algorithm.ELM-MHC：一种基于极端学习机算法的 MHC 鉴定方法的改进。

J Proteome Res. 2019 Mar 1;18(3):1392-1401. doi: 10.1021/acs.jproteome.9b00012. Epub 2019 Feb 18.

Identifying Plant Pentatricopeptide Repeat Coding Gene/Protein Using Mixed Feature Extraction Methods.使用混合特征提取方法鉴定植物五肽重复编码基因/蛋白质

Front Plant Sci. 2019 Jan 10;9:1961. doi: 10.3389/fpls.2018.01961. eCollection 2018.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

基于蛋白质序列特征的变量选择：以 IV 型细菌分泌效应子分类为例。

Variable selection from a feature representing protein sequences: a case of classification on bacterial type IV secreted effectors.

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献