Ahmed Shafayat, Emon Muhit Islam, Moumi Nazifa Ahmed, Huang Lifu, Zhou Dawei, Vikesland Peter, Pruden Amy, Zhang Liqing
Department of Computer Science, Virginia Polytechnic Institute and State University, Blacksburg, USA.
Department of Civil and Environmental Engineering, Virginia Polytechnic Institute and State University, Blacksburg, USA.
Sci Rep. 2025 Aug 18;15(1):30174. doi: 10.1038/s41598-025-14545-4.
The evolution and spread of antibiotic resistance pose a global health challenge. Whole genome and metagenomic sequencing offer a promising approach to monitoring the spread, but typical alignment-based approaches for antibiotic resistance gene (ARG) detection are inherently limited in the ability to detect new variants. Large protein language models could present a powerful alternative but are limited by databases available for training. Here we introduce ProtAlign-ARG, a novel hybrid model combining a pre-trained protein language model and an alignment scoring-based model to expand the capacity for ARG detection from DNA sequencing data. ProtAlign-ARG learns from vast unannotated protein sequences, utilizing raw protein language model embeddings to improve the accuracy of ARG classification. In instances where the model lacks confidence, ProtAlign-ARG employs an alignment-based scoring method, incorporating bit scores and e-values to classify ARGs according to their corresponding classes of antibiotics. ProtAlign-ARG demonstrated remarkable accuracy in identifying and classifying ARGs, particularly excelling in recall compared to existing ARG identification and classification tools. We also extended ProtAlign-ARG to predict the functionality and mobility of ARGs, highlighting the model's robustness in various predictive tasks. A comprehensive comparison of ProtAlign-ARG with both the alignment-based scoring model and the pre-trained protein language model demonstrated the superior performance of ProtAlign-ARG.
抗生素耐药性的演变和传播对全球健康构成了挑战。全基因组和宏基因组测序为监测其传播提供了一种很有前景的方法,但用于检测抗生素耐药基因(ARG)的典型基于比对的方法在检测新变体的能力上存在固有局限。大型蛋白质语言模型可能是一种强大的替代方法,但受到可用于训练的数据库的限制。在此,我们介绍ProtAlign-ARG,这是一种新型混合模型,它结合了预训练的蛋白质语言模型和基于比对评分的模型,以扩展从DNA测序数据中检测ARG的能力。ProtAlign-ARG从大量未注释的蛋白质序列中学习,利用原始蛋白质语言模型嵌入来提高ARG分类的准确性。在模型缺乏信心的情况下,ProtAlign-ARG采用基于比对的评分方法,纳入比特分数和期望值,根据相应的抗生素类别对ARG进行分类。ProtAlign-ARG在识别和分类ARG方面表现出显著的准确性,特别是与现有的ARG识别和分类工具相比,在召回率方面表现出色。我们还扩展了ProtAlign-ARG以预测ARG的功能和移动性,突出了该模型在各种预测任务中的稳健性。将ProtAlign-ARG与基于比对的评分模型和预训练的蛋白质语言模型进行的全面比较证明了ProtAlign-ARG的优越性能。