Darooneh Amir Hossein, Przedborski Michelle, Kohandel Mohammad
Department of Applied Mathematics, University of Waterloo, Waterloo, ON, Canada.
QRB Discov. 2021 Dec 13;3:e1. doi: 10.1017/qrd.2021.13. eCollection 2022.
The SARS-CoV-2 virus has made the largest pandemic of the 21st century, with hundreds of millions of cases and tens of millions of fatalities. Scientists all around the world are racing to develop vaccines and new pharmaceuticals to overcome the pandemic and offer effective treatments for COVID-19 disease. Consequently, there is an essential need to better understand how the pathogenesis of SARS-CoV-2 is affected by viral mutations and to determine the conserved segments in the viral genome that can serve as stable targets for novel therapeutics. Here, we introduce a text-mining method to estimate the mutability of genomic segments directly from a reference (ancestral) whole genome sequence. The method relies on calculating the importance of genomic segments based on their spatial distribution and frequency over the whole genome. To validate our approach, we perform a large-scale analysis of the viral mutations in nearly 80,000 publicly available SARS-CoV-2 predecessor whole genome sequences and show that these results are highly correlated with the segments predicted by the statistical method used for keyword detection. Importantly, these correlations are found to hold at the codon and gene levels, as well as for gene coding regions. Using the text-mining method, we further identify codon sequences that are potential candidates for siRNA-based antiviral drugs. Significantly, one of the candidates identified in this work corresponds to the first seven codons of an epitope of the spike glycoprotein, which is the only SARS-CoV-2 immunogenic peptide without a match to a human protein.
严重急性呼吸综合征冠状病毒2(SARS-CoV-2)引发了21世纪规模最大的大流行,造成数亿人感染,数千万人死亡。世界各地的科学家都在竞相研发疫苗和新型药物,以战胜这场大流行并为新冠肺炎提供有效治疗。因此,迫切需要更好地了解SARS-CoV-2的发病机制如何受到病毒突变的影响,并确定病毒基因组中可作为新型治疗药物稳定靶点的保守片段。在此,我们介绍一种文本挖掘方法,可直接从参考(祖先)全基因组序列估计基因组片段的可变性。该方法基于计算基因组片段在整个基因组中的空间分布和频率来确定其重要性。为验证我们的方法,我们对近80000条公开可用的SARS-CoV-2前身全基因组序列中的病毒突变进行了大规模分析,结果表明这些结果与用于关键词检测的统计方法预测的片段高度相关。重要的是,这些相关性在密码子和基因水平以及基因编码区域均成立。利用文本挖掘方法,我们进一步确定了基于小干扰RNA(siRNA)的抗病毒药物的潜在候选密码子序列。值得注意的是,这项工作中确定的候选序列之一对应于刺突糖蛋白一个表位的前七个密码子,该表位是SARS-CoV-2唯一与人蛋白无匹配的免疫原性肽。