Department of Computer Engineering, Qom Branch, Islamic Azad University, Qom, Iran.
Department of Biology, School of Basic Sciences, University of Qom, Qom, Iran.
Comput Biol Med. 2021 Jul;134:104471. doi: 10.1016/j.compbiomed.2021.104471. Epub 2021 May 8.
SARS-COV-2, Severe Acute Respiratory Syndrome (SARS), and the Middle East respiratory syndrome-related coronavirus (MERS) viruses are from the coronaviridae family; the former became a global pandemic (with low mortality rate) while the latter were confined to a limited region (with high mortality rates). To investigate the possible structural differences at basic levels for the three viruses, genomic and proteomic sequences were downloaded and converted to polynomial datasets. Seven attribute weighting (feature selection) models were employed to find the key differences in their genome's nucleotide sequence. Most attribute weighting models selected the final nucleotide sequences (from 29,000 nucleotide positions to the end of the genome) as significantly different among the three virus classes. The genome and proteome sequences of this hot zone area (which corresponds to the 3'UTR region and encodes for nucleoprotein (N)) and Spike (S) protein sequences (as the most important viral protein) were converted into binary images and were analyzed by image processing techniques and Convolutional deep Neural Network (CNN). Although the predictive accuracy of CNN for Spike (S) proteins was low (0.48%), the machine-based learning algorithms were able to classify the three members of coronaviridae viruses with 100% accuracy based on 3'UTR region. For the first time ever, the relationship between the possible structural differences of coronaviruses at the sequential levels and their pathogenesis are being reported, which paves the road to deciphering the high pathogenicity of the SARS-COV-2 virus.
SARS-CoV-2、严重急性呼吸综合征(SARS)和中东呼吸综合征相关冠状病毒(MERS)均属于冠状病毒科;前者引发了全球性大流行(死亡率低),而后者局限于有限区域(死亡率高)。为了研究这三种病毒在基本水平上可能存在的结构差异,我们下载并转换了基因组和蛋白质组序列为多项式数据集。我们采用了七种属性加权(特征选择)模型来寻找它们基因组核苷酸序列中的关键差异。大多数属性加权模型选择了基因组最后部分(从 29000 个核苷酸位置到基因组末端)的核苷酸序列在三种病毒类型之间存在显著差异。该热点区域(对应于 3'UTR 区域并编码核蛋白(N)和 Spike(S)蛋白序列)的基因组和蛋白质组序列被转换为二进制图像,并通过图像处理技术和卷积深度神经网络(CNN)进行分析。尽管 CNN 对 Spike(S)蛋白的预测准确率较低(0.48%),但基于 3'UTR 区域,基于机器学习的算法能够以 100%的准确率对冠状病毒科的三种成员进行分类。这是首次报道冠状病毒在序列水平上的可能结构差异与其发病机制之间的关系,为揭示 SARS-CoV-2 病毒的高致病性铺平了道路。