College of Information, Qilu University of Technology(Shandong Academy of Sciences), Jinan, China.
Sci Rep. 2018 Jun 29;8(1):9856. doi: 10.1038/s41598-018-28084-8.
Protein secondary structure prediction is one of the most important and challenging problems in bioinformatics. Machine learning techniques have been applied to solve the problem and have gained substantial success in this research area. However there is still room for improvement toward the theoretical limit. In this paper, we present a novel method for protein secondary structure prediction based on a data partition and semi-random subspace method (PSRSM). Data partitioning is an important strategy for our method. First, the protein training dataset was partitioned into several subsets based on the length of the protein sequence. Then we trained base classifiers on the subspace data generated by the semi-random subspace method, and combined base classifiers by majority vote rule into ensemble classifiers on each subset. Multiple classifiers were trained on different subsets. These different classifiers were used to predict the secondary structures of different proteins according to the protein sequence length. Experiments are performed on 25PDB, CB513, CASP10, CASP11, CASP12, and T100 datasets, and the good performance of 86.38%, 84.53%, 85.51%, 85.89%, 85.55%, and 85.09% is achieved respectively. Experimental results showed that our method outperforms other state-of-the-art methods.
蛋白质二级结构预测是生物信息学中最重要和最具挑战性的问题之一。机器学习技术已被应用于解决该问题,并在该研究领域取得了实质性的成功。然而,要达到理论极限仍有改进的空间。本文提出了一种基于数据分区和半随机子空间方法(PSRSM)的蛋白质二级结构预测新方法。数据分区是我们方法的重要策略。首先,根据蛋白质序列的长度将蛋白质训练数据集划分为几个子集。然后,我们在半随机子空间方法生成的子空间数据上训练基本分类器,并通过多数投票规则将基本分类器组合成集成分类器,用于每个子集。在不同的子集中训练多个分类器。这些不同的分类器根据蛋白质序列长度预测不同蛋白质的二级结构。在 25PDB、CB513、CASP10、CASP11、CASP12 和 T100 数据集上进行了实验,分别取得了 86.38%、84.53%、85.51%、85.89%、85.55%和 85.09%的优异性能。实验结果表明,我们的方法优于其他最先进的方法。