Alkady Walaa, ElBahnasy Khaled, Leiva Víctor, Gad Walaa
Department of Bioinformatics, Faculty of Computer and Information Sciences, Ain Shams University, Cairo, Egypt.
Department of Information Systems, Faculty of Computer and Information Sciences, Ain Shams University, Cairo, Egypt.
Chemometr Intell Lab Syst. 2022 May 15;224:104535. doi: 10.1016/j.chemolab.2022.104535. Epub 2022 Mar 15.
COVID-19 disease causes serious respiratory illnesses. Therefore, accurate identification of the viral infection cycle plays a key role in designing appropriate vaccines. The risk of this disease depends on proteins that interact with human receptors. In this paper, we formulate a novel model for COVID-19 named "amino acid encoding based prediction" (AAPred). This model is accurate, classifies the various coronavirus types, and distinguishes SARS-CoV-2 from other coronaviruses. With the AAPred model, we reduce the number of features to enhance its performance by selecting the most important ones employing statistical criteria. The protein sequence of SARS-CoV-2 for understanding the viral infection cycle is analyzed. Six machine learning classifiers related to decision trees, k-nearest neighbors, random forest, support vector machine, bagging ensemble, and gradient boosting are used to evaluate the model in terms of accuracy, precision, sensitivity, and specificity. We implement the obtained results computationally and apply them to real data from the National Genomics Data Center. The experimental results report that the AAPred model reduces the features to seven of them. The average accuracy of the 10-fold cross-validation is 98.69%, precision is 98.72%, sensitivity is 96.81%, and specificity is 97.72%. The features are selected utilizing information gain and classified with random forest. The proposed model predicts the type of Coronavirus and reduces the number of extracted features. We identify that SARS-CoV-2 has similar physicochemical characteristics in some regions of SARS-CoV. Also, we report that SARS-CoV-2 has similar infection cycles and sequences in some regions of SARS CoV indicating the affectedness of vaccines on SARS-CoV-2. A comparison with deep learning shows similar results with our method.
新冠病毒病会引发严重的呼吸系统疾病。因此,准确识别病毒感染周期在设计合适的疫苗方面起着关键作用。这种疾病的风险取决于与人类受体相互作用的蛋白质。在本文中,我们构建了一种名为“基于氨基酸编码预测”(AAPred)的新型新冠病毒模型。该模型准确无误,能够对各种冠状病毒类型进行分类,并将严重急性呼吸综合征冠状病毒2(SARS-CoV-2)与其他冠状病毒区分开来。借助AAPred模型,我们通过运用统计标准选择最重要的特征来减少特征数量,以提升其性能。我们对SARS-CoV-2的蛋白质序列进行了分析,以了解病毒感染周期。使用了与决策树、k近邻、随机森林、支持向量机、装袋集成和梯度提升相关的六种机器学习分类器,从准确性、精确性、敏感性和特异性方面对该模型进行评估。我们通过计算实现了所得结果,并将其应用于国家基因组数据中心的真实数据。实验结果表明,AAPred模型将特征减少到了七个。十折交叉验证的平均准确率为98.69%,精确率为98.72%,敏感性为96.81%,特异性为97.72%。这些特征是利用信息增益进行选择的,并使用随机森林进行分类。所提出的模型能够预测冠状病毒的类型并减少提取的特征数量。我们发现SARS-CoV-2在严重急性呼吸综合征冠状病毒(SARS-CoV)的某些区域具有相似的物理化学特征。此外,我们报告称SARS-CoV-2在SARS-CoV的某些区域具有相似的感染周期和序列,这表明疫苗对SARS-CoV-2有影响。与深度学习的比较显示,我们的方法得到了相似的结果。