Zhang Xu, Liu Yiwei, Wang Yaming, Zhang Liang, Feng Lin, Jin Bo, Zhang Hongzhe
College of Mechanical Engineering, Dalian University of Technology, Dalian, China.
School of Innovation and Entrepreneurship, Dalian University of Technology, Dalian, China.
Front Genet. 2022 May 23;13:769828. doi: 10.3389/fgene.2022.769828. eCollection 2022.
In the field of bioinformatics, understanding protein secondary structure is very important for exploring diseases and finding new treatments. Considering that the physical experiment-based protein secondary structure prediction methods are time-consuming and expensive, some pattern recognition and machine learning methods are proposed. However, most of the methods achieve quite similar performance, which seems to reach a model capacity bottleneck. As both model design and learning process can affect the model learning capacity, we pay attention to the latter part. To this end, a framework called Multistage Combination Classifier Augmented Model (MCCM) is proposed to solve the protein secondary structure prediction task. Specifically, first, a feature extraction module is introduced to extract features with different levels of learning difficulties. Second, multistage combination classifiers are proposed to learn decision boundaries for easy and hard samples, respectively, with the latter penalizing the loss value of the hard samples and finally improving the prediction performance of hard samples. Third, based on the Dirichlet distribution and information entropy measurement, a sample difficulty discrimination module is designed to assign samples with different learning difficulty levels to the aforementioned classifiers. The experimental results on the publicly available benchmark CB513 dataset show that our method outperforms most state-of-the-art models.
在生物信息学领域,理解蛋白质二级结构对于探索疾病和寻找新的治疗方法非常重要。鉴于基于物理实验的蛋白质二级结构预测方法既耗时又昂贵,因此提出了一些模式识别和机器学习方法。然而,大多数方法的性能相当相似,这似乎达到了模型能力瓶颈。由于模型设计和学习过程都会影响模型的学习能力,我们关注后者。为此,提出了一种名为多阶段组合分类器增强模型(MCCM)的框架来解决蛋白质二级结构预测任务。具体来说,首先,引入一个特征提取模块来提取具有不同学习难度水平的特征。其次,提出多阶段组合分类器,分别为简单样本和困难样本学习决策边界,后者对困难样本的损失值进行惩罚,最终提高困难样本的预测性能。第三,基于狄利克雷分布和信息熵度量,设计了一个样本难度判别模块,将具有不同学习难度水平的样本分配给上述分类器。在公开可用的基准CB513数据集上的实验结果表明,我们的方法优于大多数最先进的模型。