Navarro-Cerdán J Ramón, Pons-Suñer Pedro, Arnal Laura, Arlandis Joaquim, Llobet Rafael, Perez-Cortes Juan-Carlos, Lara-Hernández Francisco, Moya-Valera Celeste, Quiroz-Rodriguez Maria Elena, Rojo-Martinez Gemma, Valdés Sergio, Montanya Eduard, Calle-Pascual Alfonso L, Franch-Nadal Josep, Delgado Elias, Castaño Luis, García-García Ana-Bárbara, Chaves Felipe Javier
Universitat Politècnica de València, Camí de Vera, s/n, 46022, València, Spain.
ITI, Universitat Politècnica de València, Camino de Vera s/n, 46022, València, Spain.
Med Biol Eng Comput. 2025 Apr 8. doi: 10.1007/s11517-025-03355-5.
Type 2 diabetes (T2D) is becoming one of the leading health problems in Western societies, diminishing quality of life and consuming a significant share of healthcare resources. This study presents machine learning models for T2D diagnosis and prognosis, developed using heterogeneous data from a Spanish population dataset (Di@bet.es study). The models were trained exclusively on individuals classified as controls and undiagnosed diabetics, ensuring that the results are not influenced by treatment effects or behavioral changes due to disease awareness. Two data domains are considered: environmental (patient lifestyle questionnaires and measurements) and clinical (biochemical and anthropometric measurements). The preprocessing pipeline consists of four key steps: geospatial data extraction, feature engineering, missing data imputation, and quasi-constancy filtering. Two working scenarios (Environmental and Healthcare) are defined based on the features used, and applied to two targets (diagnosis and prognosis), resulting in four distinct models. The feature subsets that best predict the target have been identified based on permutation importance and sequential backward selection, reducing the number of features and, consequently, the cost of predictions. In the Environmental scenario, models achieved an AUROC of 0.86 for diagnosis and 0.82 for prognosis. The Healthcare scenario performed better, with an AUROC of 0.96 for diagnosis and 0.88 for prognosis. A partial dependence analysis of the most relevant features is also presented. An online demo page showcasing the Environmental and Healthcare T2D prognosis models is available upon request.
2型糖尿病(T2D)正成为西方社会主要的健康问题之一,它降低了生活质量,并消耗了大量医疗资源。本研究提出了用于T2D诊断和预后的机器学习模型,这些模型是使用来自西班牙人群数据集(Di@bet.es研究)的异构数据开发的。这些模型仅在被归类为对照和未确诊糖尿病患者的个体上进行训练,以确保结果不受治疗效果或疾病认知导致的行为变化的影响。考虑了两个数据领域:环境数据(患者生活方式问卷和测量数据)和临床数据(生化和人体测量数据)。预处理管道包括四个关键步骤:地理空间数据提取、特征工程、缺失数据插补和准恒定性过滤。根据所使用的特征定义了两种工作场景(环境场景和医疗场景),并将其应用于两个目标(诊断和预后),从而产生了四个不同的模型。基于排列重要性和顺序向后选择,确定了最能预测目标的特征子集,减少了特征数量,从而降低了预测成本。在环境场景中,模型诊断的曲线下面积(AUROC)为0.86,预后的AUROC为0.82。医疗场景表现更好,诊断的AUROC为0.96,预后的AUROC为0.88。还对最相关特征进行了部分依赖分析。如有需要,可提供展示环境和医疗T2D预后模型的在线演示页面。