Suppr超能文献

使用蛋白质语言模型和多实例学习预测病毒-宿主关联

Prediction of virus-host associations using protein language models and multiple instance learning.

作者信息

Liu Dan, Young Francesca, Lamb Kieran D, Robertson David L, Yuan Ke

机构信息

MRC-University of Glasgow Centre for Virus Research, Glasgow, United Kingdom.

School of Computing Science, University of Glasgow, Glasgow, United Kingdom.

出版信息

PLoS Comput Biol. 2024 Nov 19;20(11):e1012597. doi: 10.1371/journal.pcbi.1012597. eCollection 2024 Nov.

Abstract

Predicting virus-host associations is essential to determine the specific host species that viruses interact with, and discover if new viruses infect humans and animals. Currently, the host of the majority of viruses is unknown, particularly in microbiomes. To address this challenge, we introduce EvoMIL, a deep learning method that predicts the host species for viruses from viral sequences only. It also identifies important viral proteins that significantly contribute to host prediction. The method combines a pre-trained large protein language model (ESM) and attention-based multiple instance learning to allow protein-orientated predictions. Our results show that protein embeddings capture stronger predictive signals than sequence composition features, including amino acids, physiochemical properties, and DNA k-mers. In multi-host prediction tasks, EvoMIL achieves median F1 score improvements of 10.8%, 16.2%, and 4.9% in prokaryotic hosts, and 1.7%, 6.6% and 11.5% in eukaryotic hosts. EvoMIL binary classifiers achieve impressive AUC over 0.95 for all prokaryotic hosts and range from roughly 0.8 to 0.9 for eukaryotic hosts. Furthermore, EvoMIL identifies important proteins in the prediction task, capturing key functions involved in virus-host specificity.

摘要

预测病毒与宿主的关联对于确定病毒相互作用的特定宿主物种以及发现新病毒是否感染人类和动物至关重要。目前,大多数病毒的宿主尚不清楚,尤其是在微生物群落中。为应对这一挑战,我们引入了EvoMIL,这是一种深度学习方法,仅根据病毒序列预测病毒的宿主物种。它还能识别对宿主预测有显著贡献的重要病毒蛋白。该方法结合了预训练的大型蛋白质语言模型(ESM)和基于注意力的多实例学习,以实现基于蛋白质的预测。我们的结果表明,蛋白质嵌入比序列组成特征(包括氨基酸、理化性质和DNA k-mer)捕获更强的预测信号。在多宿主预测任务中,EvoMIL在原核宿主中的F1分数中位数提高了10.8%、16.2%和4.9%,在真核宿主中提高了1.7%、6.6%和11.5%。EvoMIL二元分类器在所有原核宿主中的AUC超过0.95,在真核宿主中的AUC约为0.8至0.9。此外,EvoMIL在预测任务中识别出重要蛋白质,捕获了病毒-宿主特异性所涉及的关键功能。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c42d/11614202/79b7fb5d6507/pcbi.1012597.g001.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验