利用全血细胞计数和高效液相色谱数据，通过机器学习对地中海贫血类型进行多类别分类。

Multiclass classification of thalassemia types using complete blood count and HPLC data with machine learning.

作者信息

Nasir Muhammad Umar, Zubair Muhammad, Naseem Muhammad Tahir, Shahzad Tariq, Saeed Ahmed, Adnan Khan Muhammad, Gandomi Amir H

机构信息

Faculty of Computing, Riphah International University, Islamabad, Pakistan.

School of Computing, IVY CMS, Lahore, Pakistan.

出版信息

Sci Rep. 2025 Jul 21;15(1):26379. doi: 10.1038/s41598-025-06594-6.

DOI:10.1038/s41598-025-06594-6

PMID:40691682

Abstract

Mild to severe anemia is caused by thalassemia, a common genetic disorder affecting over 100 countries worldwide, that results from the abnormality of one or several of the four globin genes. This leads to chronic hemolytic anemia and disrupted synthesis of hemoglobin chains, iron overload, and poor erythropoiesis. Although the diagnosis of thalassemia has improved globally along with the treatment and transfusion support, it is still a major problem in diagnosing in high-prevalence areas like Pakistan. This work aims to assess the performance of numerous combinations of machine learning methods to detect alpha and beta-thalassemia in their minor and major types. These results are obtained from CBC and HPLC analysis. The analyzed models are K-nearest Neighbor (KNN), Support Vector Machine (SVM), and Extreme Gradient Boosting (XGBoost). The study aims to examine the effectiveness of the developed models in discriminating thalassemia variants, especially in the light of Pakistani patients' data. The study found that XGBoost achieved the highest performance on both the CBC and HPLC datasets, with training accuracies of roughly 99.5% for CBC and 99.3% for HPLC. The test accuracy across both datasets was consistently high and thus the best model for detecting thalassemia in this research study. The imported SVM model, slightly less accurate than XGBoost, still has strong performance, particularly on the HPLC data where the cumulative testing accuracy of the model stood at 99.4%. As can be seen from the results, XGBoost specifically shows a very high accuracy of above 99% in the detection of thalassemia types using CBC and HPLC data for Pakistani patients. To the author's knowledge, this research is the first to predict alpha and beta-thalassemia in its major and minor forms using these diagnostic reports. These models indicate that they can offer significant support in detecting thalassemia in resource-constrained settings such as Pakistan. If deep learning is incorporated, even greater accuracy could be achieved.

摘要

轻度至重度贫血由地中海贫血引起，地中海贫血是一种常见的遗传性疾病，全球有100多个国家受其影响，它是由四个珠蛋白基因中的一个或几个异常导致的。这会引发慢性溶血性贫血、血红蛋白链合成中断、铁过载以及红细胞生成不良。尽管随着治疗和输血支持，全球范围内地中海贫血的诊断已有改善，但在巴基斯坦等高发地区，诊断仍是一个主要问题。这项工作旨在评估多种机器学习方法组合在检测轻型和重型α和β地中海贫血方面的性能。这些结果来自全血细胞计数（CBC）和高效液相色谱（HPLC）分析。所分析的模型有K近邻（KNN）、支持向量机（SVM）和极端梯度提升（XGBoost）。该研究旨在检验所开发模型在区分地中海贫血变体方面的有效性，特别是根据巴基斯坦患者的数据。研究发现，XGBoost在CBC和HPLC数据集上均取得了最高性能，CBC的训练准确率约为99.5%，HPLC的训练准确率约为99.3%。两个数据集的测试准确率一直很高，因此是本研究中检测地中海贫血的最佳模型。导入的SVM模型虽然比XGBoost略低，但仍具有强大性能，特别是在HPLC数据上，该模型的累积测试准确率为99.4%。从结果可以看出，XGBoost在使用巴基斯坦患者的CBC和HPLC数据检测地中海贫血类型时，特别显示出高于99%的非常高的准确率。据作者所知，本研究是首次使用这些诊断报告预测重型和轻型α和β地中海贫血。这些模型表明，它们可以在巴基斯坦等资源有限的环境中检测地中海贫血方面提供重要支持。如果纳入深度学习，可能会实现更高的准确率。