MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, USA.
Broad Institute of MIT and Harvard, Cambridge, MA, USA.
Nat Commun. 2021 May 11;12(1):2642. doi: 10.1038/s41467-021-22905-7.
Despite its clinical importance, the SARS-CoV-2 gene set remains unresolved, hindering dissection of COVID-19 biology. We use comparative genomics to provide a high-confidence protein-coding gene set, characterize evolutionary constraint, and prioritize functional mutations. We select 44 Sarbecovirus genomes at ideally-suited evolutionary distances, and quantify protein-coding evolutionary signatures and overlapping constraint. We find strong protein-coding signatures for ORFs 3a, 6, 7a, 7b, 8, 9b, and a novel alternate-frame gene, ORF3c, whereas ORFs 2b, 3d/3d-2, 3b, 9c, and 10 lack protein-coding signatures or convincing experimental evidence of protein-coding function. Furthermore, we show no other conserved protein-coding genes remain to be discovered. Mutation analysis suggests ORF8 contributes to within-individual fitness but not person-to-person transmission. Cross-strain and within-strain evolutionary pressures agree, except for fewer-than-expected within-strain mutations in nsp3 and S1, and more-than-expected in nucleocapsid, which shows a cluster of mutations in a predicted B-cell epitope, suggesting immune-avoidance selection. Evolutionary histories of residues disrupted by spike-protein substitutions D614G, N501Y, E484K, and K417N/T provide clues about their biology, and we catalog likely-functional co-inherited mutations. Previously reported RNA-modification sites show no enrichment for conservation. Here we report a high-confidence gene set and evolutionary-history annotations providing valuable resources and insights on SARS-CoV-2 biology, mutations, and evolution.
尽管 SARS-CoV-2 的基因集具有重要的临床意义,但目前仍未得到解决,这阻碍了对 COVID-19 生物学的研究。我们利用比较基因组学提供了一个高可信度的蛋白质编码基因集,对其进化约束进行了特征描述,并对功能突变进行了优先级排序。我们选择了 44 个进化距离理想的 Sarbecovirus 基因组,对蛋白质编码的进化特征和重叠的约束进行了量化。我们发现 ORF3a、6、7a、7b、8、9b 和一个新的交替框架基因 ORF3c 具有强烈的蛋白质编码特征,而 ORF2b、3d/3d-2、3b、9c 和 10 缺乏蛋白质编码特征或有说服力的蛋白质编码功能的实验证据。此外,我们表明没有其他保守的蛋白质编码基因有待发现。突变分析表明,ORF8 有助于个体内的适应性,但对人与人之间的传播没有影响。跨株系和株内进化压力是一致的,但 nsp3 和 S1 中的株内突变少于预期,而核衣壳中的突变则多于预期,核衣壳中出现了一个预测的 B 细胞表位的突变簇,表明存在免疫逃避选择。刺突蛋白取代 D614G、N501Y、E484K 和 K417N/T 破坏的残基的进化历史提供了有关其生物学的线索,我们对可能的功能性共遗传突变进行了分类。先前报道的 RNA 修饰位点没有表现出保守性富集。在这里,我们报告了一个高可信度的基因集和进化历史注释,为 SARS-CoV-2 的生物学、突变和进化提供了有价值的资源和见解。