Suppr超能文献

通过深度学习和回归从小数据预测严重急性呼吸综合征冠状病毒2(SARS-CoV-2)刺突蛋白的进化

Forecasting SARS-CoV-2 spike protein evolution from small data by deep learning and regression.

作者信息

King Samuel, Chen Xinyi E, Ng Sarah W S, Rostin Kimia, Hahn Samuel V, Roberts Tylo, Schwab Janella C, Sekhon Parneet, Kagieva Madina, Reilly Taylor, Qi Ruo Chen, Salman Paarsa, Hong Ryan J, Ma Eric J, Hallam Steven J

机构信息

International Genetically Engineered Machine (iGEM) Team, University of British Columbia, Vancouver, BC, Canada.

Department of Botany, University of British Columbia, Vancouver, BC, Canada.

出版信息

Front Syst Biol. 2024 Apr 9;4:1284668. doi: 10.3389/fsysb.2024.1284668. eCollection 2024.

Abstract

The emergence of SARS-CoV-2 variants during the COVID-19 pandemic caused frequent global outbreaks that confounded public health efforts across many jurisdictions, highlighting the need for better understanding and prediction of viral evolution. Predictive models have been shown to support disease prevention efforts, such as with the seasonal influenza vaccine, but they require abundant data. For emerging viruses of concern, such models should ideally function with relatively sparse data typically encountered at the early stages of a viral outbreak. Conventional discrete approaches have proven difficult to develop due to the spurious and reversible nature of amino acid mutations and the overwhelming number of possible protein sequences adding computational complexity. We hypothesized that these challenges could be addressed by encoding discrete protein sequences into continuous numbers, effectively reducing the data size while enhancing the resolution of evolutionarily relevant differences. To this end, we developed a viral protein evolution prediction model (VPRE), which reduces amino acid sequences into continuous numbers by using an artificial neural network called a variational autoencoder (VAE) and models their most statistically likely evolutionary trajectories over time using Gaussian process (GP) regression. To demonstrate VPRE, we used a small amount of early SARS-CoV-2 spike protein sequences. We show that the VAE can be trained on a synthetic dataset based on this data. To recapitulate evolution along a phylogenetic path, we used only 104 spike protein sequences and trained the GP regression with the numerical variables to project evolution up to 5 months into the future. Our predictions contained novel variants and the most frequent prediction mapped primarily to a sequence that differed by only a single amino acid from the most reported spike protein within the prediction timeframe. Novel variants in the spike receptor binding domain (RBD) were capable of binding human angiotensin-converting enzyme 2 (ACE2) , with comparable or better binding than previously resolved RBD-ACE2 complexes. Together, these results indicate the utility and tractability of combining deep learning and regression to model viral protein evolution with relatively sparse datasets, toward developing more effective medical interventions.

摘要

在新冠疫情期间,严重急性呼吸综合征冠状病毒2(SARS-CoV-2)变体的出现导致全球频繁爆发疫情,给许多辖区的公共卫生工作带来了困扰,凸显了更好地理解和预测病毒进化的必要性。预测模型已被证明有助于疾病预防工作,如季节性流感疫苗,但它们需要大量数据。对于令人担忧的新兴病毒,此类模型理想情况下应能在病毒爆发早期通常遇到的相对稀疏的数据下运行。由于氨基酸突变具有虚假性和可逆性,以及可能的蛋白质序列数量众多增加了计算复杂性,传统的离散方法已证明难以开发。我们推测,通过将离散的蛋白质序列编码为连续数字,可以解决这些挑战,有效减少数据量,同时提高进化相关差异的分辨率。为此,我们开发了一种病毒蛋白质进化预测模型(VPRE),该模型使用一种称为变分自编码器(VAE)的人工神经网络将氨基酸序列简化为连续数字,并使用高斯过程(GP)回归对其随时间最具统计学可能性的进化轨迹进行建模。为了演示VPRE,我们使用了少量早期的SARS-CoV-2刺突蛋白序列。我们表明,可以基于这些数据在合成数据集上训练VAE。为了重现沿系统发育路径的进化,我们仅使用了104个刺突蛋白序列,并使用数值变量训练GP回归,以预测未来5个月的进化情况。我们的预测包含新变体,最频繁的预测主要映射到一个序列,该序列在预测时间范围内与报告最多的刺突蛋白仅相差一个氨基酸。刺突受体结合域(RBD)中的新变体能够结合人血管紧张素转换酶2(ACE2),其结合能力与先前解析的RBD-ACE2复合物相当或更好。总之,这些结果表明,结合深度学习和回归以在相对稀疏的数据集上对病毒蛋白质进化进行建模具有实用性和可操作性,有助于开发更有效的医学干预措施。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0407/12341966/d49ba18d3520/fsysb-04-1284668-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验