College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China.
Comput Biol Chem. 2021 Apr;91:107456. doi: 10.1016/j.compbiolchem.2021.107456. Epub 2021 Feb 12.
Understanding the function of protein is conducive to research in advanced fields such as gene therapy of diseases, the development and design of new drugs, etc. The prerequisite for understanding the function of a protein is to determine its tertiary structure. The realization of protein structure classification is indispensable for this problem and fold recognition is a commonly used method of protein structure classification. Protein sequences of 40% identity in the ASTRAL protein classification database are used for fold recognition research in current work to predict 27 folding types which mostly belong to four protein structural classes: α, β, α/β and α + β. We extract features from primary structure of protein using methods covering DSSP, PSSM and HMM which are based on secondary structure and evolutionary information to convert protein sequences into feature vectors that can be recognized by machine learning algorithm and utilize the combination of LightGBM feature selection algorithm and incremental feature selection method (IFS) to find the optimal classifiers respectively constructed by machine learning algorithms on the basis of tree structure including Random Forest, XGBoost and LightGBM. Bayesian optimization method is used for hyper-parameter adjustment of machine learning algorithms to make the accuracy of fold recognition reach as high as 93.45% at last. The result obtained by the model we propose is outstanding in the study of protein fold recognition.
了解蛋白质的功能有助于疾病的基因治疗、新药的开发和设计等先进领域的研究。了解蛋白质功能的前提是确定其三级结构。为了解决这个问题,实现蛋白质结构分类是必不可少的,而折叠识别是蛋白质结构分类的常用方法。在当前的工作中,使用 ASTRAL 蛋白质分类数据库中 40%同源性的蛋白质序列进行折叠识别研究,预测 27 种折叠类型,这些折叠类型主要属于四种蛋白质结构类别:α、β、α/β 和 α+β。我们使用基于二级结构和进化信息的 DSSP、PSSM 和 HMM 方法从蛋白质的一级结构中提取特征,将蛋白质序列转换为可以被机器学习算法识别的特征向量,并利用 LightGBM 特征选择算法和增量特征选择方法(IFS)的组合,在包括随机森林、XGBoost 和 LightGBM 的树结构上分别找到由机器学习算法构建的最优分类器。贝叶斯优化方法用于调整机器学习算法的超参数,使折叠识别的准确性最终达到 93.45%。我们提出的模型在蛋白质折叠识别研究中取得了优异的结果。