Afify Heba M, Zanaty Muhammad S
Systems and Biomedical Engineering Department, Higher Institute of Engineering in El-Shorouk City, Cairo, Egypt.
Faculty of Computer and Information Sciences, Cairo, Egypt.
Med Biol Eng Comput. 2021 Sep;59(9):1723-1734. doi: 10.1007/s11517-021-02412-z. Epub 2021 Jul 22.
The rapid spread of coronavirus disease (COVID-19) has become a worldwide pandemic and affected more than 15 million patients reported in 27 countries. Therefore, the computational biology carrying this virus that correlates with the human population urgently needs to be understood. In this paper, the classification of the human protein sequences of COVID-19, according to the country, is presented based on machine learning algorithms. The proposed model is based on distinguishing 9238 sequences using three stages, including data preprocessing, data labeling, and classification. In the first stage, data preprocessing's function converts the amino acids of COVID-19 protein sequences into eight groups of numbers based on the amino acids' volume and dipole. It is based on the conjoint triad (CT) method. In the second stage, there are two methods for labeling data from 27 countries from 0 to 26. The first method is based on selecting one number for each country according to the code numbers of countries, while the second method is based on binary elements for each country. According to their countries, machine learning algorithms are used to discover different COVID-19 protein sequences in the last stage. The obtained results demonstrate 100% accuracy, 100% sensitivity, and 90% specificity via the country-based binary labeling method with a linear support vector machine (SVM) classifier. Furthermore, with significant infection data, the USA is more prone to correct classification compared to other countries with fewer data. The unbalanced data for COVID-19 protein sequences is considered a major issue, especially as the US's available data represents 76% of a total of 9238 sequences. The proposed model will act as a prediction tool for the COVID-19 protein sequences in different countries.
冠状病毒病(COVID-19)的迅速传播已成为全球大流行,影响了27个国家报告的超过1500万患者。因此,迫切需要了解与人类群体相关的携带这种病毒的计算生物学。本文基于机器学习算法,介绍了根据国家对COVID-19人类蛋白质序列进行的分类。所提出的模型基于三个阶段区分9238个序列,包括数据预处理、数据标记和分类。在第一阶段,数据预处理功能根据氨基酸的体积和偶极将COVID-19蛋白质序列的氨基酸转换为八组数字。它基于联合三联体(CT)方法。在第二阶段,有两种将来自27个国家的数据标记为0到26的方法。第一种方法是根据国家代码为每个国家选择一个数字,而第二种方法是为每个国家基于二进制元素。在最后阶段,使用机器学习算法根据国家来发现不同的COVID-19蛋白质序列。通过基于国家的二进制标记方法和线性支持向量机(SVM)分类器,获得的结果显示准确率为100%、灵敏度为100%、特异性为90%。此外,由于有大量感染数据,与其他数据较少的国家相比,美国更易于正确分类。COVID-19蛋白质序列的数据不平衡被认为是一个主要问题,特别是因为美国的可用数据占9238个序列总数的76%。所提出的模型将作为不同国家COVID-19蛋白质序列的预测工具。