King Abdullah II School for Information Technology, The University of Jordan, Amman, Jordan.
Artif Intell Med. 2018 Jun;88:70-83. doi: 10.1016/j.artmed.2018.04.009. Epub 2018 May 3.
Thalassemia is considered one of the most common genetic blood disorders that has received excessive attention in the medical research fields worldwide. Under this context, one of the greatest challenges for healthcare professionals is to correctly differentiate normal individuals from asymptomatic thalassemia carriers. Usually, thalassemia diagnosis is based on certain measurable characteristic changes to blood cell counts and related indices. These characteristic changes can be derived easily when performing a complete blood count test (CBC) using a special fully automated blood analyzer or counter. However, the reliability of the CBC test alone is questionable with possible candidate characteristics that could be seen in other disorders, leading to misdiagnosis of thalassemia. Therefore, other costly and time-consuming tests should be performed that may cause serious consequences due to the delay in the correct diagnosis. To help overcoming these challenging diagnostic issues, this work presents a new novel dataset collected from Palestine Avenir Foundation for persons tested for thalassemia. We aim to compile a gold standard dataset for thalassemia and make it available for researchers in this field. Moreover, we use this dataset to predict the specific type of thalassemia known as beta thalassemia (β-thalassemia) based on hybrid data mining model. The proposed model consists of two main steps. First, to overcome the problem of the highly imbalanced class distribution in the dataset, a balancing technique called SMOTE is proposed and applied to handle this problem. In the second step, four classification models, namely k-nearest neighbors (k-NN), naïve Bayesian (NB), decision tree (DT) and the multilayer perceptron (MLP) neural network are used to differentiate between normal persons and those patients carrying β-thalassemia. Different evaluation metrics are used to assess the performance of the proposed model. The experimental results show that the SMOTE oversampling method can effectively improve the identification ratio of β-thalassemia carriers in a highly imbalanced class distribution. The results reveal also that the NB classifier achieved the best performance in differentiating between normal and β-thalassemia carriers at oversampling SMOTE ratio of 400%. This combination shows a specificity of 99.47% and a sensitivity of 98.81%.
地中海贫血症被认为是最常见的遗传性血液疾病之一,在全球医学研究领域受到了过多关注。在这种情况下,医疗保健专业人员面临的最大挑战之一是正确地区分正常个体和无症状地中海贫血携带者。通常,地中海贫血症的诊断基于血细胞计数和相关指标的某些可测量特征变化。使用特殊的全自动血液分析仪或计数器进行全血细胞计数 (CBC) 测试时,很容易得出这些特征变化。然而,仅依靠 CBC 测试的可靠性是值得怀疑的,因为可能会出现其他疾病的特征变化,导致地中海贫血症的误诊。因此,应该进行其他昂贵且耗时的测试,由于误诊可能会导致严重后果。为了帮助克服这些具有挑战性的诊断问题,本工作提出了一个从巴勒斯坦 Avenir 基金会收集的用于检测地中海贫血症的新的新颖数据集。我们旨在为该领域的研究人员编制一个地中海贫血症的黄金标准数据集。此外,我们使用该数据集基于混合数据挖掘模型预测称为β地中海贫血症 (β-地中海贫血症) 的特定类型的地中海贫血症。所提出的模型由两个主要步骤组成。首先,为了克服数据集高度不平衡的类分布问题,提出了一种称为 SMOTE 的平衡技术,并将其应用于处理此问题。在第二步中,使用四种分类模型,即 k-最近邻 (k-NN)、朴素贝叶斯 (NB)、决策树 (DT) 和多层感知机 (MLP) 神经网络,来区分正常人和携带β-地中海贫血症的患者。使用不同的评估指标来评估所提出模型的性能。实验结果表明,SMOTE 过采样方法可以有效地提高高度不平衡的类分布中β-地中海贫血症携带者的识别率。结果还表明,在 SMOTE 过采样比为 400%时,NB 分类器在区分正常人和β-地中海贫血症携带者方面表现出最佳性能。这种组合的特异性为 99.47%,敏感性为 98.81%。