Suppr超能文献

使用机器学习算法分析数据预处理技术对新型冠状病毒肺炎诊断的影响。

Analyzing the effect of data preprocessing techniques using machine learning algorithms on the diagnosis of COVID-19.

作者信息

Erol Gizemnur, Uzbaş Betül, Yücelbaş Cüneyt, Yücelbaş Şule

机构信息

Konya Technical University Software Engineering Department Konya Turkey.

Konya Technical University Computer Engineering Department Konya Turkey.

出版信息

Concurr Comput. 2022 Dec 25;34(28):e7393. doi: 10.1002/cpe.7393. Epub 2022 Oct 18.

Abstract

Real-time polymerase chain reaction (RT-PCR) known as the swab test is a diagnostic test that can diagnose COVID-19 disease through respiratory samples in the laboratory. Due to the rapid spread of the coronavirus around the world, the RT-PCR test has become insufficient to get fast results. For this reason, the need for diagnostic methods to fill this gap has arisen and machine learning studies have started in this area. On the other hand, studying medical data is a challenging area because the data it contains is inconsistent, incomplete, difficult to scale, and very large. Additionally, some poor clinical decisions, irrelevant parameters, and limited medical data adversely affect the accuracy of studies performed. Therefore, considering the availability of datasets containing COVID-19 blood parameters, which are less in number than other medical datasets today, it is aimed to improve these existing datasets. In this direction, to obtain more consistent results in COVID-19 machine learning studies, the effect of data preprocessing techniques on the classification of COVID-19 data was investigated in this study. In this study primarily, encoding categorical feature and feature scaling processes were applied to the dataset with 15 features that contain blood data of 279 patients, including gender and age information. Then, the missingness of the dataset was eliminated by using both K-nearest neighbor algorithm (KNN) and chain equations multiple value assignment (MICE) methods. Data balancing has been done with synthetic minority oversampling technique (SMOTE), which is a data balancing method. The effect of data preprocessing techniques on ensemble learning algorithms bagging, AdaBoost, random forest and on popular classifier algorithms KNN classifier, support vector machine, logistic regression, artificial neural network, and decision tree classifiers have been analyzed. The highest accuracies obtained with the bagging classifier were 83.42% and 83.74% with KNN and MICE imputations by applying SMOTE, respectively. On the other hand, the highest accuracy ratio reached with the same classifier without SMOTE was 83.91% for the KNN imputation. In conclusion, certain data preprocessing techniques are examined comparatively and the effect of these data preprocessing techniques on success is presented and the importance of the right combination of data preprocessing to achieve success has been demonstrated by experimental studies.

摘要

实时聚合酶链反应(RT-PCR)即拭子检测,是一种诊断测试,可通过实验室中的呼吸道样本诊断新冠病毒疾病。由于冠状病毒在全球迅速传播,RT-PCR检测已不足以快速得出结果。因此,出现了填补这一空白的诊断方法的需求,机器学习研究也已在该领域展开。另一方面,研究医学数据是一个具有挑战性的领域,因为其中包含的数据不一致、不完整、难以扩展且非常庞大。此外,一些糟糕的临床决策、不相关的参数以及有限的医学数据会对所进行研究的准确性产生不利影响。因此,考虑到目前包含新冠病毒血液参数的数据集数量比其他医学数据集少,旨在改进这些现有数据集。在此方向上,为了在新冠病毒机器学习研究中获得更一致的结果,本研究调查了数据预处理技术对新冠病毒数据分类的影响。在本研究中,首先,对包含279名患者血液数据(包括性别和年龄信息)的15个特征的数据集应用了分类特征编码和特征缩放过程。然后,使用K近邻算法(KNN)和链式方程多值赋值(MICE)方法消除了数据集的缺失值。使用合成少数过采样技术(SMOTE)进行了数据平衡,这是一种数据平衡方法。分析了数据预处理技术对集成学习算法装袋法、自适应增强算法、随机森林以及对流行分类器算法KNN分类器、支持向量机、逻辑回归、人工神经网络和决策树分类器的影响。通过应用SMOTE,装袋分类器使用KNN插补和MICE插补分别获得的最高准确率为83.42%和83.74%。另一方面,在不使用SMOTE的情况下,同一分类器使用KNN插补达到的最高准确率为83.91%。总之,对某些数据预处理技术进行了比较研究,展示了这些数据预处理技术对成功率的影响,并且通过实验研究证明了数据预处理正确组合对于取得成功的重要性。

相似文献

引用本文的文献

本文引用的文献

9
Artificial intelligence-enabled rapid diagnosis of patients with COVID-19.人工智能助力 COVID-19 患者快速诊断。
Nat Med. 2020 Aug;26(8):1224-1228. doi: 10.1038/s41591-020-0931-3. Epub 2020 May 19.

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验