Davis Phillip E, Russell Joseph A
MRIGlobal, Gaithersburg, MD, United States.
Front Bioinform. 2025 Mar 18;5:1562668. doi: 10.3389/fbinf.2025.1562668. eCollection 2025.
Predicting phenotypic properties of a virus directly from its sequence data is an attractive goal for viral epidemiology. Here, we focus narrowly on the Orthocoronavirinae clade and demonstrate models that are powerfully predictive for a human-pathogen phenotype with 76.74% average precision and 85.96% average recall on the withheld test set groups, using only Orf1ab codon frequencies. We show alternative examples for other viral coding sequences and feature representations that do not perform well and discuss what distinguishes the models that are performant. These models point to a small subset of features, specifically 5 codons, that are critical to the success of the models. We discuss and contextualize how this observation may fit within a larger model for the role of translation in virus-host agreement.
直接从病毒的序列数据预测其表型特性是病毒流行病学的一个诱人目标。在这里,我们将重点聚焦于正冠状病毒亚科,并展示了一些模型,这些模型仅使用Orf1ab密码子频率,就能对人类病原体表型进行强有力的预测,在保留的测试集组上平均精确率为76.74%,平均召回率为85.96%。我们展示了其他病毒编码序列和特征表示的一些不太成功的例子,并讨论了表现良好的模型的独特之处。这些模型指出了一小部分特征,特别是5个密码子,它们对模型的成功至关重要。我们讨论并将这一观察结果与翻译在病毒-宿主适配中的作用的更大模型联系起来。