Di Gioacchino Andrea, Lecce Ivan, Greenbaum Benjamin D, Monasson Rémi, Cocco Simona
CNRS UMR 8023, Laboratory of Physics of the Ecole Normale Supérieure and PSL Research, Sorbonne Université, Paris, France.
Computational Oncology, Department of Epidemiology and Biostatistics, Memorial Sloan Kettering Cancer Center, New York, NY, USA.
Mol Biol Evol. 2025 Jun 4;42(6). doi: 10.1093/molbev/msaf127.
How viruses evolve largely depends on their hosts. To quantitatively characterize this dependence, we introduce Maximum Entropy Nucleotide Bias models (MENB) learned from single, di- and tri-nucleotide usage of viral sequences that infect a given host. We first use MENB to classify the viral family and the host of a virus from its genome, among four families of ssRNA viruses and three hosts. We show that both the viral family and the host leave a fingerprint in nucleotide motif usages that MENB models decode. Benchmarking our approach against state-of-the-art methods based on deep neural networks shows that MENB is rapid, interpretable and robust. Our approach is able to predict, with good accuracy, both the viral family and the host from a whole genomic sequence or a portion of it. MENB models also display promising out of sample generalization ability on viral sequences of new host taxa or new viral families. Our approach is also capable of identifying, within the limitations imposed by the three-host setting, intermediate hosts for well-known pathogenic strains of Influenza A subtypes and Human Coronavirus and recombinations and reassortments on specific genomic regions. Finally, MENB models can be used to track the adaptation to the new host, to shed light on the more relevant selective pressures that acted on motif usage during this process and to design new sequences with altered nucleotide usage at fixed amino-acid content.
病毒如何进化很大程度上取决于其宿主。为了定量描述这种依赖性,我们引入了最大熵核苷酸偏差模型(MENB),该模型是从感染给定宿主的病毒序列的单核苷酸、双核苷酸和三核苷酸使用情况中学习得到的。我们首先使用MENB从病毒基因组中对病毒家族和宿主进行分类,涉及四种单链RNA病毒家族和三种宿主。我们表明,病毒家族和宿主都会在MENB模型解码的核苷酸基序使用中留下印记。将我们的方法与基于深度神经网络的最先进方法进行基准测试表明,MENB快速、可解释且稳健。我们的方法能够从整个基因组序列或其一部分中准确预测病毒家族和宿主。MENB模型在新宿主分类群或新病毒家族的病毒序列上也显示出有前景的样本外泛化能力。我们的方法还能够在三宿主设置所施加的限制范围内,识别甲型流感亚型和人类冠状病毒的知名致病菌株的中间宿主以及特定基因组区域的重组和重配。最后,MENB模型可用于追踪对新宿主的适应性,揭示在此过程中作用于基序使用的更相关的选择压力,并设计在固定氨基酸含量下具有改变的核苷酸使用的新序列。