Suppr超能文献

利用蛋白质语言模型对病毒前体SARS-CoV-2宿主因子进行反式因子预测。

TransFactor-prediction of pro-viral SARS-CoV-2 host factors using a protein language model.

作者信息

An Yang, Bergant Valter, Firmani Samuele, Grünke Corinna, Bonnal Batiste, Henrici Alexander, Pichlmair Andreas, Schubert Benjamin, Marsico Annalisa

机构信息

Computational Health Center, Helmholtz Center Munich, Neuherberg 85764, Germany.

School of Computation, Information and Technology, Technical University of Munich, Munich 80333, Germany.

出版信息

Bioinformatics. 2025 Sep 1;41(9). doi: 10.1093/bioinformatics/btaf491.

Abstract

MOTIVATION

Recent pandemics have revealed significant gaps in our understanding of viral pathogenesis, exposing an urgent need for methods to identify and prioritize key host proteins (host factors) as potential targets for antiviral treatments. De novo generation of experimental datasets is limited by their heterogeneity, and for looming future pandemics, may not be feasible due to limitations of experimental approaches.

RESULTS

Here, we present TransFactor, a computational framework for predicting and prioritizing candidate host factors using only protein sequence data. It leverages the pre-trained ESM-2 protein language model, fine-tuned on a limited set of experimentally determined host factors aggregated from 33 independent SARS-CoV-2 studies. TransFactor outperforms machine and deep learning baselines and its predictions align with Gene Ontology enrichments of known host factors, but also provide interpretability through a computational alanine scan, enabling the identification of pro-viral protein domains such as COMM, PX, and RRM, that may be used to direct experimental investigations of virus biology and guide rational design of antiviral therapies. Our findings demonstrate the potential of transformer-based models to advance host factor prediction, providing a framework extendable to orthogonal input modalities and other infectious diseases, enhancing our preparedness for current and future viral threats.

AVAILABILITY AND IMPLEMENTATION

Source code is available at https://github.com/marsico-lab/TransFactor. A full reproducibility package, including code, trained models, and data, is archived on Zenodo (https://doi.org/10.5281/zenodo.16793684).

摘要

动机

近期的大流行暴露出我们在病毒发病机制理解方面存在重大差距,凸显了迫切需要方法来识别关键宿主蛋白(宿主因子)并将其作为抗病毒治疗的潜在靶点进行优先级排序。从头生成实验数据集受到其异质性的限制,对于即将到来的未来大流行,由于实验方法的局限性,可能不可行。

结果

在此,我们提出了TransFactor,这是一个仅使用蛋白质序列数据来预测和优先排序候选宿主因子的计算框架。它利用了预训练的ESM-2蛋白质语言模型,并在从33项独立的SARS-CoV-2研究汇总的有限实验确定的宿主因子集上进行了微调。TransFactor优于机器学习和深度学习基线,其预测与已知宿主因子的基因本体富集一致,还通过计算丙氨酸扫描提供可解释性,从而能够识别可能用于指导病毒生物学实验研究和指导抗病毒疗法合理设计的病毒蛋白结构域,如COMM、PX和RRM。我们的研究结果证明了基于Transformer的模型在推进宿主因子预测方面的潜力,提供了一个可扩展到正交输入模式和其他传染病的框架,增强了我们对当前和未来病毒威胁的应对能力。

可用性和实现方式

源代码可在https://github.com/marsico-lab/TransFactor获取。一个完整的可重现包,包括代码、训练好的模型和数据,已存档于Zenodo(https://doi.org/10.5281/zenodo.16793684)。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/70f8/12449051/21946daf010f/btaf491f1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验