通过深度学习和回归从小数据预测严重急性呼吸综合征冠状病毒2（SARS-CoV-2）刺突蛋白的进化

Forecasting SARS-CoV-2 spike protein evolution from small data by deep learning and regression.

作者信息

King Samuel, Chen Xinyi E, Ng Sarah W S, Rostin Kimia, Hahn Samuel V, Roberts Tylo, Schwab Janella C, Sekhon Parneet, Kagieva Madina, Reilly Taylor, Qi Ruo Chen, Salman Paarsa, Hong Ryan J, Ma Eric J, Hallam Steven J

机构信息

International Genetically Engineered Machine (iGEM) Team, University of British Columbia, Vancouver, BC, Canada.

Department of Botany, University of British Columbia, Vancouver, BC, Canada.

出版信息

Front Syst Biol. 2024 Apr 9;4:1284668. doi: 10.3389/fsysb.2024.1284668. eCollection 2024.

DOI:10.3389/fsysb.2024.1284668

PMID:40809129

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12341966/

Abstract

The emergence of SARS-CoV-2 variants during the COVID-19 pandemic caused frequent global outbreaks that confounded public health efforts across many jurisdictions, highlighting the need for better understanding and prediction of viral evolution. Predictive models have been shown to support disease prevention efforts, such as with the seasonal influenza vaccine, but they require abundant data. For emerging viruses of concern, such models should ideally function with relatively sparse data typically encountered at the early stages of a viral outbreak. Conventional discrete approaches have proven difficult to develop due to the spurious and reversible nature of amino acid mutations and the overwhelming number of possible protein sequences adding computational complexity. We hypothesized that these challenges could be addressed by encoding discrete protein sequences into continuous numbers, effectively reducing the data size while enhancing the resolution of evolutionarily relevant differences. To this end, we developed a viral protein evolution prediction model (VPRE), which reduces amino acid sequences into continuous numbers by using an artificial neural network called a variational autoencoder (VAE) and models their most statistically likely evolutionary trajectories over time using Gaussian process (GP) regression. To demonstrate VPRE, we used a small amount of early SARS-CoV-2 spike protein sequences. We show that the VAE can be trained on a synthetic dataset based on this data. To recapitulate evolution along a phylogenetic path, we used only 104 spike protein sequences and trained the GP regression with the numerical variables to project evolution up to 5 months into the future. Our predictions contained novel variants and the most frequent prediction mapped primarily to a sequence that differed by only a single amino acid from the most reported spike protein within the prediction timeframe. Novel variants in the spike receptor binding domain (RBD) were capable of binding human angiotensin-converting enzyme 2 (ACE2) , with comparable or better binding than previously resolved RBD-ACE2 complexes. Together, these results indicate the utility and tractability of combining deep learning and regression to model viral protein evolution with relatively sparse datasets, toward developing more effective medical interventions.

摘要

在新冠疫情期间，严重急性呼吸综合征冠状病毒2（SARS-CoV-2）变体的出现导致全球频繁爆发疫情，给许多辖区的公共卫生工作带来了困扰，凸显了更好地理解和预测病毒进化的必要性。预测模型已被证明有助于疾病预防工作，如季节性流感疫苗，但它们需要大量数据。对于令人担忧的新兴病毒，此类模型理想情况下应能在病毒爆发早期通常遇到的相对稀疏的数据下运行。由于氨基酸突变具有虚假性和可逆性，以及可能的蛋白质序列数量众多增加了计算复杂性，传统的离散方法已证明难以开发。我们推测，通过将离散的蛋白质序列编码为连续数字，可以解决这些挑战，有效减少数据量，同时提高进化相关差异的分辨率。为此，我们开发了一种病毒蛋白质进化预测模型（VPRE），该模型使用一种称为变分自编码器（VAE）的人工神经网络将氨基酸序列简化为连续数字，并使用高斯过程（GP）回归对其随时间最具统计学可能性的进化轨迹进行建模。为了演示VPRE，我们使用了少量早期的SARS-CoV-2刺突蛋白序列。我们表明，可以基于这些数据在合成数据集上训练VAE。为了重现沿系统发育路径的进化，我们仅使用了104个刺突蛋白序列，并使用数值变量训练GP回归，以预测未来5个月的进化情况。我们的预测包含新变体，最频繁的预测主要映射到一个序列，该序列在预测时间范围内与报告最多的刺突蛋白仅相差一个氨基酸。刺突受体结合域（RBD）中的新变体能够结合人血管紧张素转换酶2（ACE2），其结合能力与先前解析的RBD-ACE2复合物相当或更好。总之，这些结果表明，结合深度学习和回归以在相对稀疏的数据集上对病毒蛋白质进化进行建模具有实用性和可操作性，有助于开发更有效的医学干预措施。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0407/12341966/d49ba18d3520/fsysb-04-1284668-g001.jpg

相似文献

Forecasting SARS-CoV-2 spike protein evolution from small data by deep learning and regression.通过深度学习和回归从小数据预测严重急性呼吸综合征冠状病毒2（SARS-CoV-2）刺突蛋白的进化

Front Syst Biol. 2024 Apr 9;4:1284668. doi: 10.3389/fsysb.2024.1284668. eCollection 2024.

Prescription of Controlled Substances: Benefits and Risks管制药品的处方：益处与风险

Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.在基层医疗机构或医院门诊环境中，如果患者出现以下症状和体征，可判断其是否患有 COVID-19。

Cochrane Database Syst Rev. 2022 May 20;5(5):CD013665. doi: 10.1002/14651858.CD013665.pub3.

Comparison of Two Modern Survival Prediction Tools, SORG-MLA and METSSS, in Patients With Symptomatic Long-bone Metastases Who Underwent Local Treatment With Surgery Followed by Radiotherapy and With Radiotherapy Alone.两种现代生存预测工具 SORG-MLA 和 METSSS 在接受手术联合放疗和单纯放疗治疗有症状长骨转移患者中的比较。

Clin Orthop Relat Res. 2024 Dec 1;482(12):2193-2208. doi: 10.1097/CORR.0000000000003185. Epub 2024 Jul 23.

Antibody tests for identification of current and past infection with SARS-CoV-2.抗体检测用于鉴定 SARS-CoV-2 的现症感染和既往感染。

Cochrane Database Syst Rev. 2022 Nov 17;11(11):CD013652. doi: 10.1002/14651858.CD013652.pub2.

Behavioral interventions to reduce risk for sexual transmission of HIV among men who have sex with men.降低男男性行为者中艾滋病毒性传播风险的行为干预措施。

Cochrane Database Syst Rev. 2008 Jul 16(3):CD001230. doi: 10.1002/14651858.CD001230.pub2.

Quantitative characterisation of extracellular vesicles designed to decoy or compete with SARS-CoV-2 reveals differential mode of action across variants of concern and highlights the diversity of Omicron.旨在与严重急性呼吸综合征冠状病毒2（SARS-CoV-2）诱饵或竞争的细胞外囊泡的定量表征揭示了针对不同关注变体的不同作用模式，并突出了奥密克戎的多样性。

Cell Commun Signal. 2025 Jul 2;23(1):323. doi: 10.1186/s12964-025-02223-x.

Sexual Harassment and Prevention Training性骚扰与预防培训

Measures implemented in the school setting to contain the COVID-19 pandemic.学校为控制 COVID-19 疫情而采取的措施。

Cochrane Database Syst Rev. 2022 Jan 17;1(1):CD015029. doi: 10.1002/14651858.CD015029.

Cost-effectiveness of using prognostic information to select women with breast cancer for adjuvant systemic therapy.利用预后信息为乳腺癌患者选择辅助性全身治疗的成本效益

Health Technol Assess. 2006 Sep;10(34):iii-iv, ix-xi, 1-204. doi: 10.3310/hta10340.

本文引用的文献

PyMC: a modern, and comprehensive probabilistic programming framework in Python.PyMC：Python 中一个现代且全面的概率编程框架。

PeerJ Comput Sci. 2023 Sep 1;9:e1516. doi: 10.7717/peerj-cs.1516. eCollection 2023.

In silico protein interaction screening uncovers DONSON's role in replication initiation.计算机蛋白质相互作用筛选揭示 DONSON 在复制起始中的作用。

Science. 2023 Sep 22;381(6664):eadi3448. doi: 10.1126/science.adi3448.

Predicting the antigenic evolution of SARS-COV-2 with deep learning.利用深度学习预测 SARS-COV-2 的抗原进化。

Nat Commun. 2023 Jun 13;14(1):3478. doi: 10.1038/s41467-023-39199-6.

Spheromers reveal robust T cell responses to the Pfizer/BioNTech vaccine and attenuated peripheral CD8 T cell responses post SARS-CoV-2 infection.球形聚集体揭示了对辉瑞/生物科技疫苗的强大 T 细胞反应，以及 SARS-CoV-2 感染后外周血 CD8 T 细胞反应减弱。

Immunity. 2023 Apr 11;56(4):864-878.e4. doi: 10.1016/j.immuni.2023.03.005. Epub 2023 Mar 16.

A pseudovirus system enables deep mutational scanning of the full SARS-CoV-2 spike.一种假病毒系统可实现对完整 SARS-CoV-2 刺突蛋白的深度突变扫描。

Cell. 2023 Mar 16;186(6):1263-1278.e20. doi: 10.1016/j.cell.2023.02.001. Epub 2023 Feb 13.

Angiotensin-converting enzyme 2-at the heart of the COVID-19 pandemic.血管紧张素转化酶 2——新冠疫情的核心。

Cell. 2023 Mar 2;186(5):906-922. doi: 10.1016/j.cell.2023.01.039. Epub 2023 Feb 2.

ColabFold: making protein folding accessible to all.ColabFold：让蛋白质折叠变得人人可用。

Nat Methods. 2022 Jun;19(6):679-682. doi: 10.1038/s41592-022-01488-1. Epub 2022 May 30.

Improved prediction of protein-protein interactions using AlphaFold2.利用 AlphaFold2 提高蛋白质-蛋白质相互作用预测的准确性。

Nat Commun. 2022 Mar 10;13(1):1265. doi: 10.1038/s41467-022-28865-w.

Predicting the mutational drivers of future SARS-CoV-2 variants of concern.预测未来引起关注的 SARS-CoV-2 变异株的突变驱动因素。

Sci Transl Med. 2022 Feb 23;14(633):eabk3445. doi: 10.1126/scitranslmed.abk3445.

Mutations on RBD of SARS-CoV-2 Omicron variant result in stronger binding to human ACE2 receptor.SARS-CoV-2 奥密克戎变异株 RBD 上的突变导致其与人类 ACE2 受体更强的结合。

Biochem Biophys Res Commun. 2022 Jan 29;590:34-41. doi: 10.1016/j.bbrc.2021.12.079. Epub 2021 Dec 24.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

通过深度学习和回归从小数据预测严重急性呼吸综合征冠状病毒2（SARS-CoV-2）刺突蛋白的进化

Forecasting SARS-CoV-2 spike protein evolution from small data by deep learning and regression.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

本文引用的文献