• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

使用蛋白质语言模型和多实例学习预测病毒-宿主关联

Prediction of virus-host associations using protein language models and multiple instance learning.

作者信息

Liu Dan, Young Francesca, Lamb Kieran D, Robertson David L, Yuan Ke

机构信息

MRC-University of Glasgow Centre for Virus Research, Glasgow, United Kingdom.

School of Computing Science, University of Glasgow, Glasgow, United Kingdom.

出版信息

PLoS Comput Biol. 2024 Nov 19;20(11):e1012597. doi: 10.1371/journal.pcbi.1012597. eCollection 2024 Nov.

DOI:10.1371/journal.pcbi.1012597
PMID:39561204
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11614202/
Abstract

Predicting virus-host associations is essential to determine the specific host species that viruses interact with, and discover if new viruses infect humans and animals. Currently, the host of the majority of viruses is unknown, particularly in microbiomes. To address this challenge, we introduce EvoMIL, a deep learning method that predicts the host species for viruses from viral sequences only. It also identifies important viral proteins that significantly contribute to host prediction. The method combines a pre-trained large protein language model (ESM) and attention-based multiple instance learning to allow protein-orientated predictions. Our results show that protein embeddings capture stronger predictive signals than sequence composition features, including amino acids, physiochemical properties, and DNA k-mers. In multi-host prediction tasks, EvoMIL achieves median F1 score improvements of 10.8%, 16.2%, and 4.9% in prokaryotic hosts, and 1.7%, 6.6% and 11.5% in eukaryotic hosts. EvoMIL binary classifiers achieve impressive AUC over 0.95 for all prokaryotic hosts and range from roughly 0.8 to 0.9 for eukaryotic hosts. Furthermore, EvoMIL identifies important proteins in the prediction task, capturing key functions involved in virus-host specificity.

摘要

预测病毒与宿主的关联对于确定病毒相互作用的特定宿主物种以及发现新病毒是否感染人类和动物至关重要。目前,大多数病毒的宿主尚不清楚,尤其是在微生物群落中。为应对这一挑战,我们引入了EvoMIL,这是一种深度学习方法,仅根据病毒序列预测病毒的宿主物种。它还能识别对宿主预测有显著贡献的重要病毒蛋白。该方法结合了预训练的大型蛋白质语言模型(ESM)和基于注意力的多实例学习,以实现基于蛋白质的预测。我们的结果表明,蛋白质嵌入比序列组成特征(包括氨基酸、理化性质和DNA k-mer)捕获更强的预测信号。在多宿主预测任务中,EvoMIL在原核宿主中的F1分数中位数提高了10.8%、16.2%和4.9%,在真核宿主中提高了1.7%、6.6%和11.5%。EvoMIL二元分类器在所有原核宿主中的AUC超过0.95,在真核宿主中的AUC约为0.8至0.9。此外,EvoMIL在预测任务中识别出重要蛋白质,捕获了病毒-宿主特异性所涉及的关键功能。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c42d/11614202/b729ce143341/pcbi.1012597.g008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c42d/11614202/79b7fb5d6507/pcbi.1012597.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c42d/11614202/55279316a353/pcbi.1012597.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c42d/11614202/ad6ba5e32c8e/pcbi.1012597.g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c42d/11614202/c488fb290719/pcbi.1012597.g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c42d/11614202/aa4153a14cba/pcbi.1012597.g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c42d/11614202/b68b8874b94d/pcbi.1012597.g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c42d/11614202/e19814aa806d/pcbi.1012597.g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c42d/11614202/b729ce143341/pcbi.1012597.g008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c42d/11614202/79b7fb5d6507/pcbi.1012597.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c42d/11614202/55279316a353/pcbi.1012597.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c42d/11614202/ad6ba5e32c8e/pcbi.1012597.g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c42d/11614202/c488fb290719/pcbi.1012597.g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c42d/11614202/aa4153a14cba/pcbi.1012597.g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c42d/11614202/b68b8874b94d/pcbi.1012597.g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c42d/11614202/e19814aa806d/pcbi.1012597.g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c42d/11614202/b729ce143341/pcbi.1012597.g008.jpg

相似文献

1
Prediction of virus-host associations using protein language models and multiple instance learning.使用蛋白质语言模型和多实例学习预测病毒-宿主关联
PLoS Comput Biol. 2024 Nov 19;20(11):e1012597. doi: 10.1371/journal.pcbi.1012597. eCollection 2024 Nov.
2
Virus-host interactions predictor (VHIP): Machine learning approach to resolve microbial virus-host interaction networks.病毒-宿主相互作用预测器 (VHIP):一种用于解析微生物病毒-宿主相互作用网络的机器学习方法。
PLoS Comput Biol. 2024 Sep 18;20(9):e1011649. doi: 10.1371/journal.pcbi.1011649. eCollection 2024 Sep.
3
Prokaryotic virus host predictor: a Gaussian model for host prediction of prokaryotic viruses in metagenomics.原核病毒宿主预测器:一种用于宏基因组中原核病毒宿主预测的高斯模型。
BMC Biol. 2021 Jan 14;19(1):5. doi: 10.1186/s12915-020-00938-6.
4
Prediction of virus-host infectious association by supervised learning methods.通过监督学习方法预测病毒-宿主感染关联。
BMC Bioinformatics. 2017 Mar 14;18(Suppl 3):60. doi: 10.1186/s12859-017-1473-7.
5
Virtual 2D mapping of the viral proteome reveals host-specific modality distribution of molecular weight and isoelectric point.病毒蛋白质组的虚拟 2D 图谱揭示了分子量和等电点的宿主特异性模式分布。
Sci Rep. 2021 Oct 28;11(1):21291. doi: 10.1038/s41598-021-00797-3.
6
An Integrative Approach to Virus-Host Protein-Protein Interactions.一种病毒-宿主蛋白质-蛋白质相互作用的综合研究方法。
Methods Mol Biol. 2018;1819:175-196. doi: 10.1007/978-1-4939-8618-7_8.
7
Predicting host taxonomic information from viral genomes: A comparison of feature representations.从病毒基因组预测宿主分类学信息:特征表示的比较。
PLoS Comput Biol. 2020 May 26;16(5):e1007894. doi: 10.1371/journal.pcbi.1007894. eCollection 2020 May.
8
CBIL-VHPLI: a model for predicting viral-host protein-lncRNA interactions based on machine learning and transfer learning.CBIL-VHPLI:一种基于机器学习和迁移学习的预测病毒-宿主蛋白-lncRNA 相互作用的模型。
Sci Rep. 2024 Jul 30;14(1):17549. doi: 10.1038/s41598-024-68750-8.
9
HostNet: improved sequence representation in deep neural networks for virus-host prediction.宿主网络:用于病毒-宿主预测的深度神经网络中改进的序列表示。
BMC Bioinformatics. 2023 Dec 1;24(1):455. doi: 10.1186/s12859-023-05582-9.
10
Predicting Interactions between Virus and Host Proteins Using Repeat Patterns and Composition of Amino Acids.利用重复模式和氨基酸组成预测病毒和宿主蛋白之间的相互作用。
J Healthc Eng. 2018 May 9;2018:1391265. doi: 10.1155/2018/1391265. eCollection 2018.

引用本文的文献

1
The effect of taxonomic, host-dependent features and sample bias on virus host prediction using machine learning and short sequence k-mers.分类学、宿主依赖性特征和样本偏差对使用机器学习和短序列k-mer进行病毒宿主预测的影响。
Sci Rep. 2025 Aug 27;15(1):31592. doi: 10.1038/s41598-025-17123-w.
2
Protein Sequence Analysis landscape: A Systematic Review of Task Types, Databases, Datasets, Word Embeddings Methods, and Language Models.蛋白质序列分析全景:任务类型、数据库、数据集、词嵌入方法和语言模型的系统综述
Database (Oxford). 2025 May 30;2025. doi: 10.1093/database/baaf027.
3
Hidden challenges in evaluating spillover risk of zoonotic viruses using machine learning models.

本文引用的文献

1
iPHoP: An integrated machine learning framework to maximize host prediction for metagenome-derived viruses of archaea and bacteria.iPHoP:一种集成机器学习框架,用于最大化基于宏基因组的古菌和细菌病毒的宿主预测。
PLoS Biol. 2023 Apr 21;21(4):e3002083. doi: 10.1371/journal.pbio.3002083. eCollection 2023 Apr.
2
vHULK, a New Tool for Bacteriophage Host Prediction Based on Annotated Genomic Features and Neural Networks.vHULK,一种基于注释基因组特征和神经网络的噬菌体宿主预测新工具。
Phage (New Rochelle). 2022 Dec 1;3(4):204-212. doi: 10.1089/phage.2021.0016. Epub 2022 Dec 19.
3
Insights into the specificity for the interaction of the promiscuous SARS-CoV-2 nucleocapsid protein N-terminal domain with deoxyribonucleic acids.
使用机器学习模型评估人畜共患病毒溢出风险中的潜在挑战。
Commun Med (Lond). 2025 May 20;5(1):187. doi: 10.1038/s43856-025-00903-w.
4
Recent progress and future challenges in structure-based protein-protein interaction prediction.基于结构的蛋白质-蛋白质相互作用预测的最新进展与未来挑战
Mol Ther. 2025 May 7;33(5):2252-2268. doi: 10.1016/j.ymthe.2025.04.003. Epub 2025 Apr 6.
5
Recent advances in deep learning and language models for studying the microbiome.用于研究微生物组的深度学习和语言模型的最新进展。
Front Genet. 2025 Jan 7;15:1494474. doi: 10.3389/fgene.2024.1494474. eCollection 2024.
6
Protein Set Transformer: A protein-based genome language model to power high diversity viromics.蛋白质集变换器:一种为高多样性病毒组学提供支持的基于蛋白质的基因组语言模型。
Res Sq. 2024 Sep 23:rs.3.rs-4844047. doi: 10.21203/rs.3.rs-4844047/v1.
7
Protein Set Transformer: A protein-based genome language model to power high diversity viromics.蛋白质集变换器:一种为高多样性病毒组学提供支持的基于蛋白质的基因组语言模型。
bioRxiv. 2024 Jul 29:2024.07.26.605391. doi: 10.1101/2024.07.26.605391.
深入了解 SARS-CoV-2 核衣壳蛋白 N 端结构域与脱氧核糖核酸相互作用的特异性。
Int J Biol Macromol. 2022 Apr 1;203:466-480. doi: 10.1016/j.ijbiomac.2022.01.121. Epub 2022 Jan 22.
4
Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences.生物结构和功能源于将无监督学习扩展到 2.5 亿个蛋白质序列。
Proc Natl Acad Sci U S A. 2021 Apr 13;118(15). doi: 10.1073/pnas.2016239118.
5
SpacePHARER: sensitive identification of phages from CRISPR spacers in prokaryotic hosts.SpacePHARER:从原核宿主的CRISPR间隔序列中灵敏鉴定噬菌体
Bioinformatics. 2021 Oct 11;37(19):3364-3366. doi: 10.1093/bioinformatics/btab222.
6
Prokaryotic virus host predictor: a Gaussian model for host prediction of prokaryotic viruses in metagenomics.原核病毒宿主预测器:一种用于宏基因组中原核病毒宿主预测的高斯模型。
BMC Biol. 2021 Jan 14;19(1):5. doi: 10.1186/s12915-020-00938-6.
7
A Sweep of Earth's Virome Reveals Host-Guided Viral Protein Structural Mimicry and Points to Determinants of Human Disease.对地球病毒组的全面分析揭示了宿主引导的病毒蛋白结构模拟,并指出了人类疾病的决定因素。
Cell Syst. 2021 Jan 20;12(1):82-91.e3. doi: 10.1016/j.cels.2020.09.006. Epub 2020 Oct 13.
8
A network-based integrated framework for predicting virus-prokaryote interactions.一种基于网络的预测病毒与原核生物相互作用的综合框架。
NAR Genom Bioinform. 2020 Jun;2(2):lqaa044. doi: 10.1093/nargab/lqaa044. Epub 2020 Jun 23.
9
Predicting host taxonomic information from viral genomes: A comparison of feature representations.从病毒基因组预测宿主分类学信息:特征表示的比较。
PLoS Comput Biol. 2020 May 26;16(5):e1007894. doi: 10.1371/journal.pcbi.1007894. eCollection 2020 May.
10
A new coronavirus associated with human respiratory disease in China.一种在中国与人类呼吸道疾病相关的新型冠状病毒。
Nature. 2020 Mar;579(7798):265-269. doi: 10.1038/s41586-020-2008-3. Epub 2020 Feb 3.