Tomaszewski Tre, DeVries Ryan S, Dong Mengyi, Bhatia Gitanshu, Norsworthy Miles D, Zheng Xuying, Caetano-Anollés Gustavo
Department of Information Sciences, University of Illinois, Urbana, IL, USA.
Department of Food Science & Human Nutrition, University of Illinois, Urbana, IL, USA.
Evol Bioinform Online. 2020 Oct 23;16:1176934320965149. doi: 10.1177/1176934320965149. eCollection 2020.
The massive worldwide spread of the SARS-CoV-2 virus is fueling the COVID-19 pandemic. Since the first whole-genome sequence was published in January 2020, a growing database of tens of thousands of viral genomes has been constructed. This offers opportunities to study pathways of molecular change in the expanding viral population that can help identify molecular culprits of virulence and virus spread. Here we investigate the genomic accumulation of mutations at various time points of the early pandemic to identify changes in mutationally highly active genomic regions that are occurring worldwide. We used the Wuhan NC_045512.2 sequence as a reference and sampled 15 342 indexed sequences from GISAID, translating them into proteins and grouping them by month of deposition. The per-position amino acid frequencies and Shannon entropies of the coding sequences were calculated for each month, and a map of intrinsic disorder regions and binding sites was generated. The analysis revealed dominant variants, most of which were located in loop regions and on the surface of the proteins. Mutation entropy decreased between March and April of 2020 after steady increases at several sites, including the D614G mutation site of the spike (S) protein that was previously found associated with higher case fatality rates and at sites of the NSP12 polymerase and the NSP13 helicase proteins. Notable expanding mutations include R203K and G204R of the nucleocapsid (N) protein inter-domain linker region and G251V of the viroporin encoded by ORF3a between March and April. The regions spanning these mutations exhibited significant intrinsic disorder, which was enhanced and decreased by the N-protein and viroporin 3a protein mutations, respectively. These results predict an ongoing mutational shift from the spike and replication complex to other regions, especially to encoded molecules known to represent major β-interferon antagonists. The study provides valuable information for therapeutics and vaccine design, as well as insight into mutation tendencies that could facilitate preventive control.
严重急性呼吸综合征冠状病毒2(SARS-CoV-2)病毒在全球范围内的大规模传播正在推动新型冠状病毒肺炎(COVID-19)大流行。自2020年1月公布首个全基因组序列以来,已构建了一个包含数万个病毒基因组的不断增长的数据库。这为研究不断扩大的病毒群体中的分子变化途径提供了机会,有助于识别毒力和病毒传播的分子元凶。在此,我们研究了疫情早期不同时间点的突变基因组积累情况,以确定全球范围内突变高度活跃的基因组区域的变化。我们以武汉的NC_045512.2序列为参考,从全球共享流感数据倡议组织(GISAID)中抽取了15342条索引序列,将它们翻译成蛋白质,并按提交月份进行分组。每月计算编码序列的每个位置的氨基酸频率和香农熵,并生成内在无序区域和结合位点图谱。分析揭示了主要变体,其中大多数位于环区域和蛋白质表面。在包括刺突(S)蛋白的D614G突变位点(此前发现该位点与较高的病死率相关)以及非结构蛋白12(NSP12)聚合酶和非结构蛋白13(NSP13)解旋酶蛋白位点在内的几个位点稳定增加之后,2020年3月至4月间突变熵下降。值得注意的是,3月至4月间核衣壳(N)蛋白结构域间连接区的R203K和G204R以及开放阅读框3a(ORF3a)编码的病毒孔蛋白的G251V发生了突变扩展。跨越这些突变的区域表现出显著的内在无序性,分别被N蛋白和病毒孔蛋白3a蛋白突变增强和减弱。这些结果预测,正在发生从刺突和复制复合体到其他区域的突变转移,特别是到已知代表主要β干扰素拮抗剂的编码分子。该研究为治疗和疫苗设计提供了有价值的信息,也为有助于预防控制的突变趋势提供了见解。