IEEE/ACM Trans Comput Biol Bioinform. 2021 Nov-Dec;18(6):2409-2419. doi: 10.1109/TCBB.2020.2979430. Epub 2021 Dec 8.
Protein Secondary Structural Class (PSSC) information is important in investigating further challenges of protein sequences like protein fold recognition, protein tertiary structure prediction, and analysis of protein functions for drug discovery. Identification of PSSC using biological methods is time-consuming and cost-intensive. Several computational models have been developed to predict the structural class; however, they lack in generalization of the model. Hence, predicting PSSC based on protein sequences is still proving to be an uphill task. In this article, we proposed an effective, novel and generalized prediction model consisting of a feature modeling and an ensemble of classifiers. The proposed feature modeling extracts discriminating information (features) by leveraging three techniques: (i) Embedding - features are extracted on the basis of spatial residue arrangements of the sequences using word embedding approaches; (ii) SkipXGram Bi-gram - various sets of skipped bi-gram features are extracted from the sequences; and (iii) General Statistical (GS) based features are extracted which covers the global information of structural sequences. The combined effective sets of features are trained and classified using an ensemble of three classifiers: Support Vector Machine (SVM), Random Forest (RF), and Gradient Boosting Machines (GBM). The proposed model when assessed on five benchmark datasets (high and low sequence similarity), viz. z277, z498, 25PDB, 1189, and FC699, reported an overall accuracy of 93.55, 97.58, 81.82, 81.11, and 93.93 percent respectively. The proposed model is further validated on a large-scale updated low similarity ( ≤ 25%) dataset, where it achieved an overall accuracy of 81.11 percent. The proposed generalized model is robust and consistently outperformed several state-of-the-art models on all the five benchmark datasets.
蛋白质二级结构类别 (PSSC) 信息对于研究蛋白质序列的进一步挑战(如蛋白质折叠识别、蛋白质三级结构预测以及药物发现中的蛋白质功能分析)非常重要。使用生物学方法鉴定 PSSC 既耗时又昂贵。已经开发了几种计算模型来预测结构类别;然而,它们缺乏模型的泛化能力。因此,基于蛋白质序列预测 PSSC 仍然是一项具有挑战性的任务。在本文中,我们提出了一种有效、新颖且通用的预测模型,该模型由特征建模和分类器集合组成。所提出的特征建模通过利用三种技术来提取区分信息(特征):(i)嵌入 - 根据序列的空间残基排列使用词嵌入方法提取特征;(ii)SkipXGram 双元 - 从序列中提取各种跳过双元特征集;以及(iii)基于广义统计(GS)的特征,提取涵盖结构序列全局信息的特征。使用三个分类器(支持向量机 (SVM)、随机森林 (RF) 和梯度提升机 (GBM))对组合的有效特征集进行训练和分类。所提出的模型在五个基准数据集(高和低序列相似性),即 z277、z498、25PDB、1189 和 FC699 上进行评估时,分别报告了 93.55%、97.58%、81.82%、81.11%和 93.93%的总体准确性。该模型进一步在大型更新的低相似度(≤25%)数据集上进行验证,在该数据集上,它实现了 81.11%的总体准确性。所提出的通用模型稳健且在所有五个基准数据集上都始终优于几个最先进的模型。