School of Computer Science and Technology, Shandong University of Technology, Zibo 255000, China.
Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun 130012, China.
Biomed Res Int. 2022 Aug 24;2022:9015123. doi: 10.1155/2022/9015123. eCollection 2022.
Predicting the polyproline type II (PPII) helix structure is crucial important in many research areas, such as the protein folding mechanisms, the drug targets, and the protein functions. However, many existing PPII helix prediction algorithms encode the protein sequence information in a single way, which causes the insufficient learning of protein sequence feature information. To improve the protein sequence encoding performance, this paper proposes a BERT-based PPII helix structure prediction algorithm (BERT-PPII), which learns the protein sequence information based on the BERT model. The BERT model's vector can fairly fuse sample's each amino acid residue information. Thus, we utilize the vector as the global feature to represent the sample's global contextual information. As the interactions among the protein chains' local amino acid residues have an important influence on the formation of PPII helix, we utilize the CNN to extract local amino acid residues' features which can further enhance the information expression of protein sequence samples. In this paper, we fuse the vectors with CNN local features to improve the performance of predicting PPII structure. Compared to the state-of-the-art PPIIPRED method, the experimental results on the unbalanced dataset show that the proposed method improves the accuracy value by 1% on the strict dataset and 2% on the less strict dataset. Correspondingly, the results on the balanced dataset show that the AUCs of the proposed method are 0.826 on the strict dataset and 0.785 on less strict datasets, respectively. For the independent test set, the proposed method has the AUC value of 0.827 on the strict dataset and 0.783 on the less strict dataset. The above experimental results have proved that the proposed BERT-PPII method can achieve a superior performance of predicting the PPII helix.
预测聚脯氨酸 II 型 (PPII) 螺旋结构在许多研究领域都至关重要,如蛋白质折叠机制、药物靶点和蛋白质功能。然而,许多现有的 PPII 螺旋预测算法以单一方式编码蛋白质序列信息,导致对蛋白质序列特征信息的学习不足。为了提高蛋白质序列编码性能,本文提出了一种基于 BERT 的 PPII 螺旋结构预测算法 (BERT-PPII),它基于 BERT 模型学习蛋白质序列信息。BERT 模型的向量可以公平地融合样本中每个氨基酸残基的信息。因此,我们利用向量作为全局特征来表示样本的全局上下文信息。由于蛋白质链局部氨基酸残基之间的相互作用对 PPII 螺旋的形成有重要影响,我们利用 CNN 提取局部氨基酸残基的特征,进一步增强蛋白质序列样本的信息表达。在本文中,我们融合向量与 CNN 局部特征,以提高预测 PPII 结构的性能。与最先进的 PPIIPRED 方法相比,在不平衡数据集上的实验结果表明,所提出的方法在严格数据集上的准确率提高了 1%,在较不严格数据集上提高了 2%。相应地,在平衡数据集上的结果表明,所提出的方法在严格数据集上的 AUC 值分别为 0.826 和 0.785,在较不严格数据集上的 AUC 值分别为 0.785 和 0.783。对于独立测试集,所提出的方法在严格数据集上的 AUC 值为 0.827,在较不严格数据集上的 AUC 值为 0.783。上述实验结果证明了所提出的 BERT-PPII 方法在预测 PPII 螺旋方面具有优越的性能。