González Lady L, Arias-Serrano Isaac, Villalba-Meneses Fernando, Navas-Boada Paulo, Cruz-Varela Jonathan
School of Biological Sciences and Engineering, University Yachay Tech, Urcuqui, Provincia de Imbabura, 100119, Ecuador.
F1000Res. 2025 Jun 20;13:981. doi: 10.12688/f1000research.154432.2. eCollection 2024.
The rise of antibiotic-resistant bacteria presents a pressing need for exploring new natural compounds with innovative mechanisms to replace existing antibiotics. Bacteriocins offer promising alternatives for developing therapeutic and preventive strategies in livestock, aquaculture, and human health. Specifically, those produced by LAB are recognized as GRAS and QPS. This study aims to develop a deep learning model specifically designed to classify bacteriocins by their LAB origin, using interpretable k-mer features and embedding vectors to enable applications in antimicrobial discover.
We developed a deep learning neural network for binary classification of bacteriocin amino acid sequences (BacLAB vs. Non-BacLAB). Features were extracted using k-mers (k=3,5,7,15,20) and vector embeddings (EV). Ten feature combinations were tested (e.g., EV, EV+5-mers+7-mers). Sequences were filtered by length (50-2000 AA) to ensure uniformity, and class balance was maintained (24,964 BacLAB vs. 25,000 Non-BacLAB). The model was trained on Google Colab, demonstrating computational accessibility without specialized hardware.
The '5-mers+7-mers+EV' group achieved the best performance, with k-fold cross-validation (k=30) showing: 9.90% loss, 90.14% accuracy, 90.30% precision, 90.10% recall and F1 score. Folder 22 stood out with 8.50% loss, 91.47% accuracy, and 91.00% precision, recall, and F1 score. Five sets of 100 LAB-specific k-mers were identified, revealing conserved motifs. Despite high accuracy, sequence length variation (50-2000 AA) may bias k-mer representation, favoring longer sequences. Additionally, experimental validation is required to confirm the biological activity of predicted bacteriocins. These aspects highlight directions for future research.
The model developed in this study achieved consistent results with those seen in the reviewed literature. It outperformed some studies by 3-10%. Its implementation in resource-limited settings is feasible via cloud platforms like Google Colab. The identified k-mers could guide the design of synthetic antimicrobials, pending further in vitro validation.
抗生素耐药菌的出现迫切需要探索具有创新机制的新型天然化合物来替代现有抗生素。细菌素为在畜牧、水产养殖和人类健康领域制定治疗和预防策略提供了有前景的替代方案。具体而言,由乳酸菌产生的细菌素被公认为是一般认为安全(GRAS)和合格假定安全(QPS)的物质。本研究旨在开发一种深度学习模型,专门用于根据细菌素的乳酸菌来源对其进行分类,使用可解释的k-mer特征和嵌入向量以实现其在抗菌发现中的应用。
我们开发了一个用于细菌素氨基酸序列二元分类(BacLAB与非BacLAB)的深度学习神经网络。使用k-mer(k = 3、5、7、15、2o)和向量嵌入(EV)提取特征。测试了十种特征组合(例如EV、EV + 5-mer + 7-mer)。通过长度(50 - 2000个氨基酸)对序列进行过滤以确保一致性,并保持类别平衡(24,964个BacLAB对25,000个非BacLAB)。该模型在谷歌Colab上进行训练,表明无需专用硬件即可实现计算访问。
“5-mer + 7-mer + EV”组表现最佳,30折交叉验证显示:损失率为9.90%,准确率为90.14%,精确率为90.30%,召回率为90.10%,F1分数为90.10%。第22折尤为突出,损失率为8.50%,准确率为91.47%,精确率、召回率和F1分数均为91.00%。确定了五组100个乳酸菌特异性k-mer,揭示了保守基序。尽管准确率较高,但序列长度变化(50 - 2000个氨基酸)可能会使k-mer表示产生偏差,有利于较长序列。此外,需要进行实验验证以确认预测细菌素的生物活性。这些方面突出了未来研究的方向。
本研究开发的模型取得了与综述文献一致的结果。它比一些研究的表现高出3 - 10%。通过谷歌Colab等云平台在资源有限的环境中实施该模型是可行的。所确定的k-mer可指导合成抗菌剂的设计,有待进一步的体外验证。