Drożdż Anna, Duggan Brian, Ruddock Mark W, Reid Cherith N, Kurth Mary Jo, Watt Joanne, Irvine Allister, Lamont John, Fitzgerald Peter, O'Rourke Declan, Curry David, Evans Mark, Boyd Ruth, Sousa Jose
Personal Health Data Science Group, Sano - Centre for Computational Personalised Medicine - International Research Foundation, Krakow, Poland.
South Eastern Health and Social Care Trust, Ulster Hospital Dundonald, Belfast, United Kingdom.
Front Oncol. 2024 May 8;14:1401071. doi: 10.3389/fonc.2024.1401071. eCollection 2024.
Detailed and invasive clinical investigations are required to identify the causes of haematuria. Highly unbalanced patient population (predominantly male) and a wide range of potential causes make the ability to correctly classify patients and identify patient-specific biomarkers a major challenge. Studies have shown that it is possible to improve the diagnosis using multi-marker analysis, even in unbalanced datasets, by applying advanced analytical methods. Here, we applied several machine learning algorithms to classify patients from the haematuria patient cohort (HaBio) by analysing multiple biomarkers and to identify the most relevant ones.
We applied several classification and feature selection methods (k-means clustering, decision trees, random forest with LIME explainer and CACTUS algorithm) to stratify patients into two groups: healthy (with no clear cause of haematuria) or sick (with an identified cause of haematuria e.g., bladder cancer, or infection). The classification performance of the models was compared. Biomarkers identified as important by the algorithms were also analysed in relation to their involvement in the pathological processes.
Results showed that a high unbalance in the datasets significantly affected the classification by random forest and decision trees, leading to the overestimation of the sick class and low model performance. CACTUS algorithm was more robust to the unbalance in the dataset. CACTUS obtained a balanced accuracy of 0.747 for both genders, 0.718 for females and 0.803 for males. The analysis showed that in the classification process for the whole dataset: microalbumin, male gender, and tPSA emerged as the most informative biomarkers. For males: age, microalbumin, tPSA, cystatin C, BTA, HAD and S100A4 were the most significant biomarkers while for females microalbumin, IL-8, pERK, and CXCL16.
CACTUS algorithm demonstrated improved performance compared with other methods such as decision trees and random forest. Additionally, we identified the most relevant biomarkers for the specific patient group, which could be considered in the future as novel biomarkers for diagnosis. Our results have the potential to inform future research and provide new personalised diagnostic approaches tailored directly to the needs of the individuals.
需要详细且侵入性的临床研究来确定血尿的病因。患者群体高度不均衡(以男性为主)以及多种潜在病因使得正确分类患者并识别患者特异性生物标志物成为一项重大挑战。研究表明,即使在不均衡的数据集中,通过应用先进的分析方法,使用多标志物分析也有可能改善诊断。在此,我们应用了几种机器学习算法,通过分析多种生物标志物对血尿患者队列(HaBio)中的患者进行分类,并识别出最相关的生物标志物。
我们应用了几种分类和特征选择方法(k均值聚类、决策树、带有LIME解释器的随机森林和CACTUS算法)将患者分为两组:健康组(无明确血尿病因)或患病组(有明确血尿病因,如膀胱癌或感染)。比较了模型的分类性能。还分析了算法确定为重要的生物标志物与其在病理过程中的参与情况。
结果表明,数据集中的高度不均衡显著影响了随机森林和决策树的分类,导致对患病组的高估以及模型性能较低。CACTUS算法对数据集中的不均衡更为稳健。CACTUS算法在男女两性中的平衡准确率分别为0.747、女性为0.718、男性为0.803。分析表明,在整个数据集的分类过程中:微量白蛋白、男性性别和总前列腺特异性抗原(tPSA)是信息含量最高的生物标志物。对于男性:年龄、微量白蛋白、tPSA、胱抑素C、膀胱肿瘤抗原(BTA)、羟基脲还原酶(HAD)和S100A4是最显著的生物标志物,而对于女性则是微量白蛋白、白细胞介素-8(IL-8)、磷酸化细胞外信号调节激酶(pERK)和CXC趋化因子配体16(CXCL16)。
与决策树和随机森林等其他方法相比,CACTUS算法表现出更好的性能。此外,我们为特定患者群体识别出了最相关的生物标志物,未来可将其视为新型诊断生物标志物。我们的结果有可能为未来的研究提供信息,并提供直接针对个体需求的新的个性化诊断方法。