Charoenkwan Phasit, Chumnanpuen Pramote, Schaduangrat Nalini, Shoombuatong Watshara
Modern Management and Information Technology, College of Arts, Media and Technology, Chiang Mai University, Thailand.
Department of Zoology, Faculty of Science, Kasetsart University, Bangkok, Thailand.
J Biomol Struct Dyn. 2024 Feb 22:1-13. doi: 10.1080/07391102.2024.2318482.
Plant-allergenic proteins (PAPs) have the potential to induce allergic reactions in certain individuals. While these proteins are generally innocuous for the majority of people, they can elicit an immune response in those with particular sensitivities. Thus, screening and prioritizing the allergenic potential of plant proteins is indispensable for the development of diagnostic tools, therapeutic interventions or medications to treat allergic reactions. However, investigating the allergenic potential of plant proteins based on experimental methods is costly and labour-intensive. Therefore, we develop StackPAP, a three-layer stacking ensemble framework for accurate large-scale identification of PAPs. In StackPAP, at the first layer, we conducted a comprehensive analysis of an extensive set of feature descriptors. Subsequently, we selected and fused five potential sequence-based feature descriptors, including amphiphilic pseudo-amino acid composition, dipeptide deviation from expected mean, amino acid composition, pseudo amino acid composition and dipeptide composition. Additionally, we applied an efficient genetic algorithm (GA-SAR) to determine informative feature sets. In the second layer, 12 powerful machine learning (ML) methods, in combination with all the informative feature sets, were employed to construct a pool of base classifiers. Finally, 13 potential base classifiers were selected using the GA-SAR method and combined to develop the final meta-classifier. Our experimental results revealed the promising prediction performance of StackPAP, with an accuracy, Matthew's correlation coefficient and AUC of 0.984, 0.969 and 0.993, respectively, as judged by the independent test dataset. In conclusion, both cross-validation and independent test results indicated the superior performance of StackPAP compared with several ML-based classifiers. To accelerate the identification of the allergenicity of plant proteins, we developed a user-friendly web server for StackPAP (https://pmlabqsar.pythonanywhere.com/StackPAP). We anticipate that StackPAP will be an efficient and useful tool for rapidly screening PAPs from a vast number of plant proteins.
植物过敏原蛋白(PAPs)有可能在某些个体中引发过敏反应。虽然这些蛋白质对大多数人来说通常是无害的,但它们会在具有特殊敏感性的人群中引发免疫反应。因此,筛选植物蛋白的致敏潜力并确定其优先级对于开发诊断工具、治疗干预措施或治疗过敏反应的药物而言不可或缺。然而,基于实验方法研究植物蛋白的致敏潜力既昂贵又耗费人力。因此,我们开发了StackPAP,这是一个用于准确大规模识别PAPs的三层堆叠集成框架。在StackPAP中,第一层,我们对大量特征描述符进行了全面分析。随后,我们选择并融合了五个基于序列的潜在特征描述符,包括两亲性伪氨基酸组成、二肽与预期均值的偏差、氨基酸组成、伪氨基酸组成和二肽组成。此外,我们应用了一种高效的遗传算法(GA-SAR)来确定信息丰富的特征集。在第二层,结合所有信息丰富的特征集,使用12种强大的机器学习(ML)方法构建了一个基础分类器库。最后,使用GA-SAR方法选择了13个潜在的基础分类器并将其组合起来开发最终的元分类器。我们的实验结果显示StackPAP具有良好的预测性能,根据独立测试数据集判断,其准确率、马修斯相关系数和AUC分别为0.984、0.969和0.993。总之,交叉验证和独立测试结果均表明StackPAP的性能优于几个基于ML的分类器。为了加速植物蛋白致敏性的识别,我们为StackPAP开发了一个用户友好的网络服务器(https://pmlabqsar.pythonanywhere.com/StackPAP)。我们预计StackPAP将成为从大量植物蛋白中快速筛选PAPs的高效且有用的工具。