Suppr超能文献

使用蛋白质语言模型和多实例学习预测病毒-宿主关联

Prediction of virus-host associations using protein language models and multiple instance learning.

作者信息

Liu Dan, Young Francesca, Lamb Kieran D, Robertson David L, Yuan Ke

机构信息

MRC-University of Glasgow Centre for Virus Research, Glasgow, United Kingdom.

School of Computing Science, University of Glasgow, Glasgow, United Kingdom.

出版信息

PLoS Comput Biol. 2024 Nov 19;20(11):e1012597. doi: 10.1371/journal.pcbi.1012597. eCollection 2024 Nov.

Abstract

Predicting virus-host associations is essential to determine the specific host species that viruses interact with, and discover if new viruses infect humans and animals. Currently, the host of the majority of viruses is unknown, particularly in microbiomes. To address this challenge, we introduce EvoMIL, a deep learning method that predicts the host species for viruses from viral sequences only. It also identifies important viral proteins that significantly contribute to host prediction. The method combines a pre-trained large protein language model (ESM) and attention-based multiple instance learning to allow protein-orientated predictions. Our results show that protein embeddings capture stronger predictive signals than sequence composition features, including amino acids, physiochemical properties, and DNA k-mers. In multi-host prediction tasks, EvoMIL achieves median F1 score improvements of 10.8%, 16.2%, and 4.9% in prokaryotic hosts, and 1.7%, 6.6% and 11.5% in eukaryotic hosts. EvoMIL binary classifiers achieve impressive AUC over 0.95 for all prokaryotic hosts and range from roughly 0.8 to 0.9 for eukaryotic hosts. Furthermore, EvoMIL identifies important proteins in the prediction task, capturing key functions involved in virus-host specificity.

摘要

预测病毒与宿主的关联对于确定病毒相互作用的特定宿主物种以及发现新病毒是否感染人类和动物至关重要。目前,大多数病毒的宿主尚不清楚,尤其是在微生物群落中。为应对这一挑战,我们引入了EvoMIL,这是一种深度学习方法,仅根据病毒序列预测病毒的宿主物种。它还能识别对宿主预测有显著贡献的重要病毒蛋白。该方法结合了预训练的大型蛋白质语言模型(ESM)和基于注意力的多实例学习,以实现基于蛋白质的预测。我们的结果表明,蛋白质嵌入比序列组成特征(包括氨基酸、理化性质和DNA k-mer)捕获更强的预测信号。在多宿主预测任务中,EvoMIL在原核宿主中的F1分数中位数提高了10.8%、16.2%和4.9%,在真核宿主中提高了1.7%、6.6%和11.5%。EvoMIL二元分类器在所有原核宿主中的AUC超过0.95,在真核宿主中的AUC约为0.8至0.9。此外,EvoMIL在预测任务中识别出重要蛋白质,捕获了病毒-宿主特异性所涉及的关键功能。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c42d/11614202/79b7fb5d6507/pcbi.1012597.g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验