Park Jonathan J, Chen Sidi
Department of Genetics, Yale University School of Medicine, New Haven, CT, USA.
System Biology Institute, Yale University, West Haven, CT, USA.
Patterns (N Y). 2022 Feb 11;3(2):100407. doi: 10.1016/j.patter.2021.100407. Epub 2021 Nov 18.
The COVID-19 pandemic caused by SARS-CoV-2 has become a major threat across the globe. Here, we developed machine learning approaches to identify key pathogenic regions in coronavirus genomes. We trained and evaluated 7,562,625 models on 3,665 genomes including SARS-CoV-2, MERS-CoV, SARS-CoV, and other coronaviruses of human and animal origins to return quantitative and biologically interpretable signatures at nucleotide and amino acid resolutions. We identified hotspots across the SARS-CoV-2 genome, including previously unappreciated features in spike, RdRp, and other proteins. Finally, we integrated pathogenicity genomic profiles with B cell and T cell epitope predictions for enrichment of sequence targets to help guide vaccine development. These results provide a systematic map of predicted pathogenicity in SARS-CoV-2 that incorporates sequence, structural, and immunologic features, providing an unbiased collection of genetic elements for functional studies. This metavirome-based framework can also be applied for rapid characterization of new coronavirus strains or emerging pathogenic viruses.
由严重急性呼吸综合征冠状病毒2(SARS-CoV-2)引起的2019冠状病毒病大流行已成为全球的重大威胁。在此,我们开发了机器学习方法来识别冠状病毒基因组中的关键致病区域。我们在包括SARS-CoV-2、中东呼吸综合征冠状病毒(MERS-CoV)、严重急性呼吸综合征冠状病毒(SARS-CoV)以及其他人和动物源冠状病毒在内的3665个基因组上训练和评估了7562625个模型,以在核苷酸和氨基酸分辨率下返回定量且具有生物学可解释性的特征。我们在SARS-CoV-2基因组中识别出了热点区域,包括刺突蛋白、RNA依赖的RNA聚合酶(RdRp)和其他蛋白质中以前未被重视的特征。最后,我们将致病性基因组图谱与B细胞和T细胞表位预测相结合,以富集序列靶点,帮助指导疫苗开发。这些结果提供了一个包含序列、结构和免疫学特征的SARS-CoV-2预测致病性系统图谱,为功能研究提供了一组无偏差的遗传元件集合。这种基于宏病毒组的框架也可用于快速鉴定新型冠状病毒毒株或新出现的致病病毒。