基于机器学习算法的二级结构和进化信息的结构蛋白折叠识别。

Structural protein fold recognition based on secondary structure and evolutionary information using machine learning algorithms.

机构信息

College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China.

出版信息

Comput Biol Chem. 2021 Apr;91:107456. doi: 10.1016/j.compbiolchem.2021.107456. Epub 2021 Feb 12.

DOI:10.1016/j.compbiolchem.2021.107456

Abstract

Understanding the function of protein is conducive to research in advanced fields such as gene therapy of diseases, the development and design of new drugs, etc. The prerequisite for understanding the function of a protein is to determine its tertiary structure. The realization of protein structure classification is indispensable for this problem and fold recognition is a commonly used method of protein structure classification. Protein sequences of 40% identity in the ASTRAL protein classification database are used for fold recognition research in current work to predict 27 folding types which mostly belong to four protein structural classes: α, β, α/β and α + β. We extract features from primary structure of protein using methods covering DSSP, PSSM and HMM which are based on secondary structure and evolutionary information to convert protein sequences into feature vectors that can be recognized by machine learning algorithm and utilize the combination of LightGBM feature selection algorithm and incremental feature selection method (IFS) to find the optimal classifiers respectively constructed by machine learning algorithms on the basis of tree structure including Random Forest, XGBoost and LightGBM. Bayesian optimization method is used for hyper-parameter adjustment of machine learning algorithms to make the accuracy of fold recognition reach as high as 93.45% at last. The result obtained by the model we propose is outstanding in the study of protein fold recognition.

摘要

了解蛋白质的功能有助于疾病的基因治疗、新药的开发和设计等先进领域的研究。了解蛋白质功能的前提是确定其三级结构。为了解决这个问题，实现蛋白质结构分类是必不可少的，而折叠识别是蛋白质结构分类的常用方法。在当前的工作中，使用 ASTRAL 蛋白质分类数据库中 40%同源性的蛋白质序列进行折叠识别研究，预测 27 种折叠类型，这些折叠类型主要属于四种蛋白质结构类别：α、β、α/β 和 α+β。我们使用基于二级结构和进化信息的 DSSP、PSSM 和 HMM 方法从蛋白质的一级结构中提取特征，将蛋白质序列转换为可以被机器学习算法识别的特征向量，并利用 LightGBM 特征选择算法和增量特征选择方法（IFS）的组合，在包括随机森林、XGBoost 和 LightGBM 的树结构上分别找到由机器学习算法构建的最优分类器。贝叶斯优化方法用于调整机器学习算法的超参数，使折叠识别的准确性最终达到 93.45%。我们提出的模型在蛋白质折叠识别研究中取得了优异的结果。

相似文献

Structural protein fold recognition based on secondary structure and evolutionary information using machine learning algorithms.

Comput Biol Chem. 2021 Apr;91:107456. doi: 10.1016/j.compbiolchem.2021.107456. Epub 2021 Feb 12.

Succinylation Site Prediction Based on Protein Sequences Using the IFS-LightGBM (BO) Model.

Comput Math Methods Med. 2020 Nov 10;2020:8858489. doi: 10.1155/2020/8858489. eCollection 2020.

Predicting structural class for protein sequences of 40% identity based on features of primary and secondary structure using Random Forest algorithm.

Comput Biol Chem. 2020 Feb;84:107164. doi: 10.1016/j.compbiolchem.2019.107164. Epub 2019 Nov 15.

A two-stage approach towards protein secondary structure classification.

Med Biol Eng Comput. 2020 Aug;58(8):1723-1737. doi: 10.1007/s11517-020-02194-w. Epub 2020 May 29.

SVM-Fold: a tool for discriminative multi-class protein fold and superfamily recognition.

BMC Bioinformatics. 2007 May 22;8 Suppl 4(Suppl 4):S2. doi: 10.1186/1471-2105-8-S4-S2.

Improving protein fold recognition using the amalgamation of evolutionary-based and structural based information.

BMC Bioinformatics. 2014;15 Suppl 16(Suppl 16):S12. doi: 10.1186/1471-2105-15-S16-S12. Epub 2014 Dec 8.

A protein structural classes prediction method based on predicted secondary structure and PSI-BLAST profile.

Biochimie. 2014 Feb;97:60-5. doi: 10.1016/j.biochi.2013.09.013. Epub 2013 Sep 22.

A Composite Approach to Protein Tertiary Structure Prediction: Hidden Markov Model Based on Lattice.

Bull Math Biol. 2019 Mar;81(3):899-918. doi: 10.1007/s11538-018-00542-4. Epub 2018 Dec 10.

Protein fold recognition using HMM-HMM alignment and dynamic programming.

J Theor Biol. 2016 Mar 21;393:67-74. doi: 10.1016/j.jtbi.2015.12.018. Epub 2016 Jan 19.

Extracting features from protein sequences to improve deep extreme learning machine for protein fold recognition.

J Theor Biol. 2017 May 21;421:1-15. doi: 10.1016/j.jtbi.2017.03.023. Epub 2017 Mar 27.

引用本文的文献

Insight into Protein Engineering: From Modelling to Synthesis.

Curr Pharm Des. 2025;31(3):179-202. doi: 10.2174/0113816128349577240927071706.

BioS2Net: Holistic Structural and Sequential Analysis of Biomolecules Using a Deep Neural Network.

Int J Mol Sci. 2022 Mar 9;23(6):2966. doi: 10.3390/ijms23062966.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

基于机器学习算法的二级结构和进化信息的结构蛋白折叠识别。

Structural protein fold recognition based on secondary structure and evolutionary information using machine learning algorithms.

机构信息

College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China.

出版信息

Comput Biol Chem. 2021 Apr;91:107456. doi: 10.1016/j.compbiolchem.2021.107456. Epub 2021 Feb 12.

DOI:10.1016/j.compbiolchem.2021.107456

PMID:33610129

Abstract

摘要

基于机器学习算法的二级结构和进化信息的结构蛋白折叠识别。

Structural protein fold recognition based on secondary structure and evolutionary information using machine learning algorithms.

机构信息

出版信息

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

基于机器学习算法的二级结构和进化信息的结构蛋白折叠识别。

Structural protein fold recognition based on secondary structure and evolutionary information using machine learning algorithms.

机构信息

出版信息

相似文献

引用本文的文献