MultiToxPred 1.0：一种新颖的综合工具，使用集成机器学习方法预测 27 类蛋白质毒素。

MultiToxPred 1.0: a novel comprehensive tool for predicting 27 classes of protein toxins using an ensemble machine learning approach.

机构信息

Department of Chemical Engineering, Faculty of Engineering and Science, Universidad de La Frontera, Ave. Francisco Salazar, 01145, Temuco, Chile.

Departamento de Ciencias Básicas, Facultad de Ciencias, Universidad Santo Tomas, Temuco, Chile.

出版信息

BMC Bioinformatics. 2024 Apr 12;25(1):148. doi: 10.1186/s12859-024-05748-z.

DOI:10.1186/s12859-024-05748-z

PMID:38609877

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11010298/

Abstract

Protein toxins are defense mechanisms and adaptations found in various organisms and microorganisms, and their use in scientific research as therapeutic candidates is gaining relevance due to their effectiveness and specificity against cellular targets. However, discovering these toxins is time-consuming and expensive. In silico tools, particularly those based on machine learning and deep learning, have emerged as valuable resources to address this challenge. Existing tools primarily focus on binary classification, determining whether a protein is a toxin or not, and occasionally identifying specific types of toxins. For the first time, we propose a novel approach capable of classifying protein toxins into 27 distinct categories based on their mode of action within cells. To accomplish this, we assessed multiple machine learning techniques and found that an ensemble model incorporating the Light Gradient Boosting Machine and Quadratic Discriminant Analysis algorithms exhibited the best performance. During the tenfold cross-validation on the training dataset, our model exhibited notable metrics: 0.840 accuracy, 0.827 F1 score, 0.836 precision, 0.840 sensitivity, and 0.989 AUC. In the testing stage, using an independent dataset, the model achieved 0.846 accuracy, 0.838 F1 score, 0.847 precision, 0.849 sensitivity, and 0.991 AUC. These results present a powerful next-generation tool called MultiToxPred 1.0, accessible through a web application. We believe that MultiToxPred 1.0 has the potential to become an indispensable resource for researchers, facilitating the efficient identification of protein toxins. By leveraging this tool, scientists can accelerate their search for these toxins and advance their understanding of their therapeutic potential.

摘要

蛋白质毒素是各种生物和微生物中发现的防御机制和适应机制，由于其对细胞靶标的有效性和特异性，它们在科学研究中作为治疗候选物的使用越来越受到关注。然而，发现这些毒素是耗时且昂贵的。基于机器学习和深度学习的计算工具已成为应对这一挑战的有价值的资源。现有的工具主要侧重于二进制分类，确定蛋白质是否是毒素，偶尔会识别特定类型的毒素。我们首次提出了一种新方法，能够根据蛋白质毒素在细胞内的作用方式将其分类为 27 个不同的类别。为了实现这一目标，我们评估了多种机器学习技术，发现集成模型（Light Gradient Boosting Machine 和 Quadratic Discriminant Analysis 算法）表现出最佳性能。在训练数据集的十折交叉验证中，我们的模型表现出了显著的指标：0.840 的准确率、0.827 的 F1 分数、0.836 的精度、0.840 的敏感性和 0.989 的 AUC。在测试阶段，使用独立数据集，模型的准确率为 0.846，F1 分数为 0.838，精度为 0.847，敏感性为 0.849，AUC 为 0.991。这些结果呈现了一种名为 MultiToxPred 1.0 的强大的下一代工具，可通过网络应用程序访问。我们相信 MultiToxPred 1.0 有可能成为研究人员不可或缺的资源，促进蛋白质毒素的高效识别。通过利用这个工具，科学家可以加速对这些毒素的搜索，并深入了解它们的治疗潜力。