An Yang, Bergant Valter, Firmani Samuele, Grünke Corinna, Bonnal Batiste, Henrici Alexander, Pichlmair Andreas, Schubert Benjamin, Marsico Annalisa
Computational Health Center, Helmholtz Center Munich, Neuherberg 85764, Germany.
School of Computation, Information and Technology, Technical University of Munich, Munich 80333, Germany.
Bioinformatics. 2025 Sep 1;41(9). doi: 10.1093/bioinformatics/btaf491.
Recent pandemics have revealed significant gaps in our understanding of viral pathogenesis, exposing an urgent need for methods to identify and prioritize key host proteins (host factors) as potential targets for antiviral treatments. De novo generation of experimental datasets is limited by their heterogeneity, and for looming future pandemics, may not be feasible due to limitations of experimental approaches.
Here, we present TransFactor, a computational framework for predicting and prioritizing candidate host factors using only protein sequence data. It leverages the pre-trained ESM-2 protein language model, fine-tuned on a limited set of experimentally determined host factors aggregated from 33 independent SARS-CoV-2 studies. TransFactor outperforms machine and deep learning baselines and its predictions align with Gene Ontology enrichments of known host factors, but also provide interpretability through a computational alanine scan, enabling the identification of pro-viral protein domains such as COMM, PX, and RRM, that may be used to direct experimental investigations of virus biology and guide rational design of antiviral therapies. Our findings demonstrate the potential of transformer-based models to advance host factor prediction, providing a framework extendable to orthogonal input modalities and other infectious diseases, enhancing our preparedness for current and future viral threats.
Source code is available at https://github.com/marsico-lab/TransFactor. A full reproducibility package, including code, trained models, and data, is archived on Zenodo (https://doi.org/10.5281/zenodo.16793684).
近期的大流行暴露出我们在病毒发病机制理解方面存在重大差距,凸显了迫切需要方法来识别关键宿主蛋白(宿主因子)并将其作为抗病毒治疗的潜在靶点进行优先级排序。从头生成实验数据集受到其异质性的限制,对于即将到来的未来大流行,由于实验方法的局限性,可能不可行。
在此,我们提出了TransFactor,这是一个仅使用蛋白质序列数据来预测和优先排序候选宿主因子的计算框架。它利用了预训练的ESM-2蛋白质语言模型,并在从33项独立的SARS-CoV-2研究汇总的有限实验确定的宿主因子集上进行了微调。TransFactor优于机器学习和深度学习基线,其预测与已知宿主因子的基因本体富集一致,还通过计算丙氨酸扫描提供可解释性,从而能够识别可能用于指导病毒生物学实验研究和指导抗病毒疗法合理设计的病毒蛋白结构域,如COMM、PX和RRM。我们的研究结果证明了基于Transformer的模型在推进宿主因子预测方面的潜力,提供了一个可扩展到正交输入模式和其他传染病的框架,增强了我们对当前和未来病毒威胁的应对能力。