Krueger Tanja, Durmaz Damla A, Jimenez-Soto Luisa F
Walther-Straub Institute of Pharmacology and Toxicology, Ludwig-Maximilians-Universität in Munich, Goethestrasse, 80336, Munich, Bavaria, Germany.
Department of Informatics, Unit for Bioinformatics and Computational Biology, Technical University of Munich School of Computation, Information and Technology, Boltzmannstrasse, 85748, Garching/Munich, Bavaria, Germany.
BioData Min. 2025 Aug 8;18(1):52. doi: 10.1186/s13040-025-00469-2.
Bacterial exotoxins are secreted proteins able to affect target cells, and associated with diseases. Their accurate identification can enhance drug discovery and ensure the safety of bacteria-based medical applications. However, current toxin predictors prioritize broad coverage by mixing toxins from multiple biological kingdoms and diverse control sets. This general approach has proven sub-optimal for identifying niche toxins, such as bacterial exotoxins. Recent Protein Language Models offer an opportunity to improve toxin prediction by capturing global sequence context and biochemical properties from protein sequences.
We introduce Exo-Tox, a specialized predictor trained exclusively on curated datasets of bacterial exotoxins and secreted non-toxic bacterial proteins, represented as embeddings by Protein Language Models. Compared to Basic Local Alignment Search Tool (BLAST)-based methods and generalized toxin predictors, Exo-Tox outperforms across multiple metrics, achieving a Matthews correlation coefficient > 0.9. Notably, Exo-Tox's performance remains robust regardless of protein length or the presence of signal peptides. We analyze its limited transferability to bacteriophage proteins and non-secreted proteins.
Exo-Tox reliably identifies bacterial exotoxins, filling a niche overlooked by generalized predictors. Our findings highlight the importance of domain-specific training data and emphasize that specialized predictors are necessary for accurate classification. We provide open access to the model, training data, and usage guidelines via the LMU Munich Open Data repository.
细菌外毒素是能够影响靶细胞并与疾病相关的分泌蛋白。准确识别它们有助于药物研发,并确保基于细菌的医学应用的安全性。然而,当前的毒素预测工具通过混合来自多个生物界和不同对照集的毒素来优先考虑广泛的覆盖范围。事实证明,这种通用方法在识别特定生态位毒素(如细菌外毒素)方面并非最优。近期的蛋白质语言模型提供了一个机会,通过从蛋白质序列中捕捉全局序列上下文和生化特性来改进毒素预测。
我们引入了Exo-Tox,这是一种专门的预测工具,仅在经过整理的细菌外毒素和分泌型无毒细菌蛋白数据集上进行训练,这些数据集由蛋白质语言模型表示为嵌入。与基于基本局部比对搜索工具(BLAST)的方法和通用毒素预测工具相比,Exo-Tox在多个指标上表现更优,马修斯相关系数>0.9。值得注意的是,无论蛋白质长度或信号肽的存在与否,Exo-Tox的性能都保持稳健。我们分析了它对噬菌体蛋白和非分泌蛋白的有限可转移性。
Exo-Tox能够可靠地识别细菌外毒素,填补了通用预测工具所忽视的特定生态位。我们的研究结果强调了特定领域训练数据的重要性,并强调专业预测工具对于准确分类是必要的。我们通过慕尼黑大学开放数据存储库提供对该模型、训练数据和使用指南的开放访问。