Department of Tropical Medicine and Parasitology, Seoul National University College of Medicine and Institute of Endemic Diseases, Seoul, 03080, Republic of Korea.
Department of Pharmacology, Yonsei University College of Medicine, Seoul, 03722, Republic of Korea; Severance Biomedical Science Institute, Yonsei University College of Medicine, Seoul, 03722, Republic of Korea.
Comput Biol Med. 2021 Feb;129:104151. doi: 10.1016/j.compbiomed.2020.104151. Epub 2020 Nov 28.
Rapid diagnosing is crucial for controlling malaria. Various studies have aimed at developing machine learning models to diagnose malaria using blood smear images; however, this approach has many limitations. This study developed a machine learning model for malaria diagnosis using patient information.
To construct datasets, we extracted patient information from the PubMed abstracts from 1956 to 2019. We used two datasets: a solely parasitic disease dataset and total dataset by adding information about other diseases. We compared six machine learning models: support vector machine, random forest (RF), multilayered perceptron, AdaBoost, gradient boosting (GB), and CatBoost. In addition, a synthetic minority oversampling technique (SMOTE) was employed to address the data imbalance problem.
Concerning the solely parasitic disease dataset, RF was found to be the best model regardless of using SMOTE. Concerning the total dataset, GB was found to be the best. However, after applying SMOTE, RF performed the best. Considering the imbalanced data, nationality was found to be the most important feature in malaria prediction. In case of the balanced data with SMOTE, the most important feature was symptom.
The results demonstrated that machine learning techniques can be successfully applied to predict malaria using patient information.
快速诊断对于控制疟疾至关重要。许多研究旨在开发使用血涂片图像诊断疟疾的机器学习模型,但这种方法存在许多局限性。本研究使用患者信息开发了一种疟疾诊断的机器学习模型。
为了构建数据集,我们从 1956 年至 2019 年的 PubMed 摘要中提取了患者信息。我们使用了两个数据集:仅寄生虫病数据集和通过添加其他疾病信息的总数据集。我们比较了六种机器学习模型:支持向量机、随机森林(RF)、多层感知机、AdaBoost、梯度提升(GB)和 CatBoost。此外,还采用了合成少数过采样技术(SMOTE)来解决数据不平衡问题。
关于仅寄生虫病数据集,无论是否使用 SMOTE,RF 都是最佳模型。关于总数据集,GB 是最佳模型。但是,在应用 SMOTE 后,RF 的表现最佳。考虑到不平衡数据,国籍是疟疾预测中最重要的特征。在具有 SMOTE 的平衡数据的情况下,最重要的特征是症状。
结果表明,机器学习技术可成功应用于使用患者信息预测疟疾。