Miah Haroon, Kollias Dimitrios, Pedone Giacinto Luca, Provan Drew, Chen Frederick
Centre of Immunobiology, Blizard Institute, Queen Mary University of London, London E1 2AT, UK.
Haematology Department, Barts Health NHS Trust, London E1 1BB, UK.
Diagnostics (Basel). 2024 Jun 26;14(13):1352. doi: 10.3390/diagnostics14131352.
Primary Immune Thrombocytopenia (ITP) is a rare autoimmune disease characterised by the immune-mediated destruction of peripheral blood platelets in patients leading to low platelet counts and bleeding. The diagnosis and effective management of ITP are challenging because there is no established test to confirm the disease and no biomarker with which one can predict the response to treatment and outcome. In this work, we conduct a feasibility study to check if machine learning can be applied effectively for the diagnosis of ITP using routine blood tests and demographic data in a non-acute outpatient setting. Various ML models, including Logistic Regression, Support Vector Machine, k-Nearest Neighbor, Decision Tree and Random Forest, were applied to data from the UK Adult ITP Registry and a general haematology clinic. Two different approaches were investigated: a demographic-unaware and a demographic-aware one. We conduct extensive experiments to evaluate the predictive performance of these models and approaches, as well as their bias. The results revealed that Decision Tree and Random Forest models were both superior and fair, achieving nearly perfect predictive and fairness scores, with platelet count identified as the most significant variable. Models not provided with demographic information performed better in terms of predictive accuracy but showed lower fairness scores, illustrating a trade-off between predictive performance and fairness.
原发性免疫性血小板减少症(ITP)是一种罕见的自身免疫性疾病,其特征是患者外周血血小板受到免疫介导的破坏,导致血小板计数降低和出血。ITP的诊断和有效管理具有挑战性,因为没有确定的检测方法来确诊该疾病,也没有生物标志物可用于预测治疗反应和结果。在这项研究中,我们进行了一项可行性研究,以检验机器学习是否可以利用非急性门诊环境中的常规血液检测和人口统计学数据有效地用于ITP的诊断。各种机器学习模型,包括逻辑回归、支持向量机、k近邻、决策树和随机森林,被应用于来自英国成人ITP登记处和一家普通血液学诊所的数据。研究了两种不同的方法:一种不考虑人口统计学因素,另一种考虑人口统计学因素。我们进行了广泛的实验,以评估这些模型和方法的预测性能及其偏差。结果表明,决策树和随机森林模型均表现出色且较为公平,预测和公平性得分近乎完美,血小板计数被确定为最显著的变量。未提供人口统计学信息的模型在预测准确性方面表现更好,但公平性得分较低,这说明了预测性能和公平性之间的权衡。