通过机器学习算法对新冠病毒蛋白质序列进行计算预测。

Computational predictions for protein sequences of COVID-19 virus via machine learning algorithms.

作者信息

Afify Heba M, Zanaty Muhammad S

机构信息

Systems and Biomedical Engineering Department, Higher Institute of Engineering in El-Shorouk City, Cairo, Egypt.

Faculty of Computer and Information Sciences, Cairo, Egypt.

出版信息

Med Biol Eng Comput. 2021 Sep;59(9):1723-1734. doi: 10.1007/s11517-021-02412-z. Epub 2021 Jul 22.

DOI:10.1007/s11517-021-02412-z

PMID:34291385

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8295007/

Abstract

The rapid spread of coronavirus disease (COVID-19) has become a worldwide pandemic and affected more than 15 million patients reported in 27 countries. Therefore, the computational biology carrying this virus that correlates with the human population urgently needs to be understood. In this paper, the classification of the human protein sequences of COVID-19, according to the country, is presented based on machine learning algorithms. The proposed model is based on distinguishing 9238 sequences using three stages, including data preprocessing, data labeling, and classification. In the first stage, data preprocessing's function converts the amino acids of COVID-19 protein sequences into eight groups of numbers based on the amino acids' volume and dipole. It is based on the conjoint triad (CT) method. In the second stage, there are two methods for labeling data from 27 countries from 0 to 26. The first method is based on selecting one number for each country according to the code numbers of countries, while the second method is based on binary elements for each country. According to their countries, machine learning algorithms are used to discover different COVID-19 protein sequences in the last stage. The obtained results demonstrate 100% accuracy, 100% sensitivity, and 90% specificity via the country-based binary labeling method with a linear support vector machine (SVM) classifier. Furthermore, with significant infection data, the USA is more prone to correct classification compared to other countries with fewer data. The unbalanced data for COVID-19 protein sequences is considered a major issue, especially as the US's available data represents 76% of a total of 9238 sequences. The proposed model will act as a prediction tool for the COVID-19 protein sequences in different countries.

摘要

冠状病毒病（COVID-19）的迅速传播已成为全球大流行，影响了27个国家报告的超过1500万患者。因此，迫切需要了解与人类群体相关的携带这种病毒的计算生物学。本文基于机器学习算法，介绍了根据国家对COVID-19人类蛋白质序列进行的分类。所提出的模型基于三个阶段区分9238个序列，包括数据预处理、数据标记和分类。在第一阶段，数据预处理功能根据氨基酸的体积和偶极将COVID-19蛋白质序列的氨基酸转换为八组数字。它基于联合三联体（CT）方法。在第二阶段，有两种将来自27个国家的数据标记为0到26的方法。第一种方法是根据国家代码为每个国家选择一个数字，而第二种方法是为每个国家基于二进制元素。在最后阶段，使用机器学习算法根据国家来发现不同的COVID-19蛋白质序列。通过基于国家的二进制标记方法和线性支持向量机（SVM）分类器，获得的结果显示准确率为100%、灵敏度为100%、特异性为90%。此外，由于有大量感染数据，与其他数据较少的国家相比，美国更易于正确分类。COVID-19蛋白质序列的数据不平衡被认为是一个主要问题，特别是因为美国的可用数据占9238个序列总数的76%。所提出的模型将作为不同国家COVID-19蛋白质序列的预测工具。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1e80/8295007/27560732e76e/11517_2021_2412_Fig1_HTML.jpg

相似文献

Computational predictions for protein sequences of COVID-19 virus via machine learning algorithms.通过机器学习算法对新冠病毒蛋白质序列进行计算预测。

Med Biol Eng Comput. 2021 Sep;59(9):1723-1734. doi: 10.1007/s11517-021-02412-z. Epub 2021 Jul 22.

Machine learning techniques for sequence-based prediction of viral-host interactions between SARS-CoV-2 and human proteins.基于序列的 SARS-CoV-2 与人类蛋白质之间病毒-宿主相互作用的预测的机器学习技术。

Biomed J. 2020 Oct;43(5):438-450. doi: 10.1016/j.bj.2020.08.003. Epub 2020 Sep 3.

Transfer learning-based ensemble support vector machine model for automated COVID-19 detection using lung computerized tomography scan data.基于迁移学习的集成支持向量机模型，用于使用肺部计算机断层扫描数据自动检测 COVID-19。

Med Biol Eng Comput. 2021 Apr;59(4):825-839. doi: 10.1007/s11517-020-02299-2. Epub 2021 Mar 18.

Prediction of death status on the course of treatment in SARS-COV-2 patients with deep learning and machine learning methods.利用深度学习和机器学习方法预测 SARS-CoV-2 患者治疗过程中的死亡状态。

Comput Methods Programs Biomed. 2021 Apr;201:105951. doi: 10.1016/j.cmpb.2021.105951. Epub 2021 Jan 22.

Classification of SARS-CoV-2 and non-SARS-CoV-2 using machine learning algorithms.基于机器学习算法的 SARS-CoV-2 和非 SARS-CoV-2 分类。

Comput Biol Med. 2021 Sep;136:104650. doi: 10.1016/j.compbiomed.2021.104650. Epub 2021 Jul 21.

Assessment and classification of COVID-19 DNA sequence using pairwise features concatenation from multi-transformer and deep features with machine learning models.使用来自多变压器的成对特征串联和机器学习模型的深度特征对新冠病毒DNA序列进行评估和分类。

SLAS Technol. 2024 Aug;29(4):100147. doi: 10.1016/j.slast.2024.100147. Epub 2024 May 23.

Role of biological Data Mining and Machine Learning Techniques in Detecting and Diagnosing the Novel Coronavirus (COVID-19): A Systematic Review.生物数据挖掘和机器学习技术在检测和诊断新型冠状病毒 (COVID-19) 中的作用：系统评价。

J Med Syst. 2020 May 25;44(7):122. doi: 10.1007/s10916-020-01582-x.

Machine learning based COVID -19 disease recognition using CT images of SIRM database.基于机器学习，利用SIRM数据库的CT图像识别新冠肺炎疾病

J Med Eng Technol. 2022 Oct;46(7):590-603. doi: 10.1080/03091902.2022.2080883. Epub 2022 May 31.

An Overview of Supervised Machine Learning Methods and Data Analysis for COVID-19 Detection.用于 COVID-19 检测的监督机器学习方法和数据分析概述。

J Healthc Eng. 2021 Nov 22;2021:4733167. doi: 10.1155/2021/4733167. eCollection 2021.

Classification of SARS-CoV-2 viral genome sequences using Neurochaos Learning.利用神经混沌学习对 SARS-CoV-2 病毒基因组序列进行分类。

Med Biol Eng Comput. 2022 Aug;60(8):2245-2255. doi: 10.1007/s11517-022-02591-3. Epub 2022 Jun 7.

引用本文的文献

Determining human-coronavirus protein-protein interaction using machine intelligence.利用机器智能确定人类冠状病毒的蛋白质-蛋白质相互作用。

Med Nov Technol Devices. 2023 Jun;18:100228. doi: 10.1016/j.medntd.2023.100228. Epub 2023 Apr 6.

Detection of COVID-19 using deep learning techniques and classification methods.使用深度学习技术和分类方法检测新型冠状病毒肺炎

Inf Process Manag. 2022 Sep;59(5):103025. doi: 10.1016/j.ipm.2022.103025. Epub 2022 Jul 8.

Methodology-Centered Review of Molecular Modeling, Simulation, and Prediction of SARS-CoV-2.基于方法的 SARS-CoV-2 分子建模、模拟和预测综述。

Chem Rev. 2022 Jul 13;122(13):11287-11368. doi: 10.1021/acs.chemrev.1c00965. Epub 2022 May 20.

Applications of artificial intelligence in battling against covid-19: A literature review.人工智能在抗击新冠疫情中的应用：文献综述

Chaos Solitons Fractals. 2021 Jan;142:110338. doi: 10.1016/j.chaos.2020.110338. Epub 2020 Oct 3.

本文引用的文献

Biomed J. 2020 Oct;43(5):438-450. doi: 10.1016/j.bj.2020.08.003. Epub 2020 Sep 3.

SARS-CoV-2 genomic variations associated with mortality rate of COVID-19.SARS-CoV-2 基因组变异与 COVID-19 死亡率的关系。

J Hum Genet. 2020 Dec;65(12):1075-1082. doi: 10.1038/s10038-020-0808-9. Epub 2020 Jul 22.

SARS-CoV-2 viral spike G614 mutation exhibits higher case fatality rate.SARS-CoV-2 病毒刺突 G614 突变株具有更高的病死率。

Int J Clin Pract. 2020 Aug;74(8):e13525. doi: 10.1111/ijcp.13525. Epub 2020 Jun 3.

A SARS-CoV-2 protein interaction map reveals targets for drug repurposing.一种 SARS-CoV-2 蛋白相互作用图谱揭示了药物再利用的靶标。

Nature. 2020 Jul;583(7816):459-468. doi: 10.1038/s41586-020-2286-9. Epub 2020 Apr 30.

Machine learning using intrinsic genomic signatures for rapid classification of novel pathogens: COVID-19 case study.利用内在基因组特征进行机器学习，快速分类新型病原体：COVID-19 案例研究。

PLoS One. 2020 Apr 24;15(4):e0232391. doi: 10.1371/journal.pone.0232391. eCollection 2020.

COVID-19: Facts, Cultural Considerations, and Risk of Stigmatization.新型冠状病毒肺炎（COVID-19）：事实、文化因素考量和污名化风险。

J Transcult Nurs. 2020 Jul;31(4):326-332. doi: 10.1177/1043659620917724. Epub 2020 Apr 21.

Application of Needleman-Wunch Algorithm to identify mutation in DNA sequences of Corona virus.应用Needleman-Wunch算法识别冠状病毒DNA序列中的突变。

J Phys Conf Ser. 2019 May;1218(1):012031. doi: 10.1088/1742-6596/1218/1/012031. Epub 2019 May 1.

Using the spike protein feature to predict infection risk and monitor the evolutionary dynamic of coronavirus.利用刺突蛋白特征预测感染风险并监测冠状病毒的进化动态。

Infect Dis Poverty. 2020 Mar 25;9(1):33. doi: 10.1186/s40249-020-00649-8.

Protein Structure and Sequence Reanalysis of 2019-nCoV Genome Refutes Snakes as Its Intermediate Host and the Unique Similarity between Its Spike Protein Insertions and HIV-1.2019-nCoV 基因组的蛋白结构和序列再分析否定了蛇类是其中间宿主，以及其刺突蛋白插入与 HIV-1 之间的独特相似性。

J Proteome Res. 2020 Apr 3;19(4):1351-1360. doi: 10.1021/acs.jproteome.0c00129. Epub 2020 Mar 24.

Network-based drug repurposing for novel coronavirus 2019-nCoV/SARS-CoV-2.基于网络的2019新型冠状病毒（2019-nCoV/SARS-CoV-2）药物重新利用研究

Cell Discov. 2020 Mar 16;6:14. doi: 10.1038/s41421-020-0153-3. eCollection 2020.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

通过机器学习算法对新冠病毒蛋白质序列进行计算预测。

Computational predictions for protein sequences of COVID-19 virus via machine learning algorithms.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献