Pons-Suñer Pedro, Signol François, Alvarez Noemi, Sargas Claudia, Dorado Sara, Ortí Jose Vicente Gil, Delgado Sanchis Juan A, Llop Marta, Arnal Laura, Llobet Rafael, Perez-Cortes Juan-Carlos, Ayala Rosa, Barragán Eva
ITI, Universitat Politècnica de València, Valencia, Spain.
Hospital Universitario 12 de Octubre, Imas12, Departament of Medicine, Complutense University, Madrid, Spain.
BMC Med Inform Decis Mak. 2025 May 1;25(1):179. doi: 10.1186/s12911-025-03001-y.
This study has two main objectives. First, to evaluate a feature selection methodology based on SEQENS, an algorithm for identifying relevant variables. Second, to validate machine learning models that predict the risk of complications in patients with acute myeloid leukemia (AML) using data available at diagnosis. Predictions are made at three time points: 90 days, six months, and one year post-diagnosis. These objectives represent fundamental steps toward the development of a tool to assist clinicians in therapeutic decision-making and provide insights into the risk factors associated with AML complications.
A dataset of 568 patients, including demographic, clinical, genetic (VAF), and cytogenetic information, was created by combining data from Hospital 12 de Octubre (Madrid, Spain) and Instituto de Investigación Sanitaria La Fe (Valencia, Spain). Feature selection based on an enhanced version of SEQENS was conducted for each time point, followed by the comparison of four classifiers (XGBoost, Multi-Layer Perceptron, Logistic Regression and Decision Tree) to assess the impact of feature selection on model performance.
SEQENS identified different relevant features for each prediction horizon, with Age, TP53, - 7/7Q, and EZH2 consistently relevant across all time points. The models were evaluated using 5-fold cross-validation, XGBoost achieve the highest average ROC-AUC scores of 0.81, 0.84, and 0.82 for 90-day, 6-month, and 1-year predictions, respectively. Generally, performance remained stable or improved after applying SEQENS-based feature selection. Evaluation on an external test set of 54 patients yielded ROC-AUC scores of 0.72 (90-day), 0.75 (6-month), and 0.68 (1-year).
The models achieved performance levels that suggest they could serve as therapeutic decision support tools at different times after diagnosis. The selected variables align with the European LeukemiaNet (ELN) 2022 risk classification, and the SEQENS-based feature selection effectively reduced the feature set while maintaining prediction accuracy.
本研究有两个主要目标。其一,评估基于SEQENS的特征选择方法,SEQENS是一种用于识别相关变量的算法。其二,使用诊断时可用的数据验证预测急性髓系白血病(AML)患者并发症风险的机器学习模型。在诊断后的三个时间点进行预测:90天、六个月和一年。这些目标是开发一种辅助临床医生进行治疗决策的工具的基本步骤,并深入了解与AML并发症相关的风险因素。
通过合并西班牙马德里12月12日医院和西班牙巴伦西亚La Fe卫生研究所的数据,创建了一个包含568名患者的数据集,其中包括人口统计学、临床、基因(VAF)和细胞遗传学信息。针对每个时间点进行基于SEQENS增强版的特征选择,随后比较四个分类器(XGBoost、多层感知器、逻辑回归和决策树),以评估特征选择对模型性能的影响。
SEQENS为每个预测期识别出不同的相关特征,年龄、TP53、-7/7q和EZH2在所有时间点始终相关。使用五折交叉验证对模型进行评估,XGBoost在90天、6个月和1年预测中的平均ROC-AUC得分分别达到最高的0.81、0.84和0.82。一般来说,应用基于SEQENS的特征选择后,性能保持稳定或有所提高。在54名患者的外部测试集上进行评估,得到的ROC-AUC得分分别为0.72(90天)、0.75(6个月)和0.68(1年)。
这些模型达到的性能水平表明,它们可在诊断后的不同时间作为治疗决策支持工具。所选变量与欧洲白血病网络(ELN)2022风险分类一致,基于SEQENS的特征选择有效地减少了特征集,同时保持了预测准确性。