College of Computer and Control Engineering, Northeast Forestry University, Harbin 150040, China.
Biomolecules. 2024 Sep 13;14(9):1155. doi: 10.3390/biom14091155.
Protein secondary structure prediction (PSSP) plays a crucial role in resolving protein functions and properties. Significant progress has been made in this field in recent years, and the use of a variety of protein-related features, including amino acid sequences, position-specific score matrices (PSSM), amino acid properties, and secondary structure trend factors, to improve prediction accuracy is an important technical route for it. However, a comprehensive evaluation of the impact of these factor features in secondary structure prediction is lacking in the current work. This study quantitatively analyzes the impact of several major factors on secondary structure prediction models using a more explanatory four-class machine learning approach. The applicability of each factor in the different types of methods, the extent to which the different methods work on each factor, and the evaluation of the effect of multi-factor combinations are explored in detail. Through experiments and analyses, it was found that PSSM performs best in methods with strong high-dimensional features and complex feature extraction capabilities, while amino acid sequences, although performing poorly overall, perform relatively well in methods with strong linear processing capabilities. Also, the combination of amino acid properties and trend factors significantly improved the prediction performance. This study provides empirical evidence for future researchers to optimize multi-factor feature combinations and apply them to protein secondary structure prediction models, which is beneficial in further optimizing the use of these factors to enhance the performance of protein secondary structure prediction models.
蛋白质二级结构预测(PSSP)在解析蛋白质功能和性质方面起着至关重要的作用。近年来,该领域取得了重大进展,使用各种与蛋白质相关的特征,包括氨基酸序列、位置特异性评分矩阵(PSSM)、氨基酸性质和二级结构趋势因子,以提高预测准确性是其重要的技术途径。然而,目前的工作缺乏对这些因素特征在二级结构预测中影响的综合评估。本研究使用更具解释性的四分类机器学习方法,定量分析了几个主要因素对二级结构预测模型的影响。详细探讨了每个因素在不同类型方法中的适用性、不同方法对每个因素的作用程度以及多因素组合效果的评估。通过实验和分析发现,PSSM 在具有强高维特征和复杂特征提取能力的方法中表现最佳,而氨基酸序列虽然整体表现不佳,但在具有强线性处理能力的方法中表现相对较好。此外,氨基酸性质和趋势因子的组合显著提高了预测性能。本研究为未来的研究人员提供了经验证据,以优化多因素特征组合,并将其应用于蛋白质二级结构预测模型,这有利于进一步优化这些因素的利用,从而提高蛋白质二级结构预测模型的性能。