IEEE J Biomed Health Inform. 2021 Oct;25(10):4005-4016. doi: 10.1109/JBHI.2021.3077114. Epub 2021 Oct 5.
Diabetes mellitus is one of the major public health problems in the world due to its high prevalence and medical costs. The prevention effort necessitates reliable risk assessment models which can effectively identify high-risk individuals and enable healthcare practitioners to initiate appropriate preventive interventions. However, diabetes risk assessment models based on data analysis face multiple challenges, such as class imbalance and low identification rate. To cope with these challenges, this paper proposed an analytical framework based on data-driven approaches using large population data from the Henan Rural Cohort Study. A joint bagging-boosting model (JBM) was developed and validated. For the convenience of large-scale population screening, our study excluded laboratory variables and collinearity variables using the maximum likelihood ratio method to obtain accessibility variables. Then, we explored the effects of different methods for dealing with the unbalanced nature of the available data, including over-sampling and under-sampling methods. Finally, to improve the overall model performance, a joint model which combined the bagging and boosting algorithms with the stacking algorithm was constructed. The model we built demonstrated good discrimination, with an area under the curve (AUC) value of 0.885, and acceptable calibration (Brier score = 0.072). Compared with the benchmark model, the proposed framework improved the AUC value of the overall model performance by 13.5%, and the recall increased from 0.744 to 0.847. The proposed model contributes to the personalized management of diabetes, especially in medical resource-poor settings.
糖尿病是全球主要的公共卫生问题之一,其患病率和医疗费用都很高。预防工作需要可靠的风险评估模型,这些模型可以有效地识别高危个体,并使医疗保健从业者能够实施适当的预防措施。然而,基于数据分析的糖尿病风险评估模型面临着多重挑战,如类别不平衡和低识别率。为了应对这些挑战,本文提出了一个基于数据驱动方法的分析框架,该框架使用了来自河南农村队列研究的大型人群数据。开发并验证了一个联合装袋提升模型(JBM)。为了便于大规模人群筛查,我们的研究使用最大似然比法排除了实验室变量和共线性变量,以获得可访问变量。然后,我们探讨了处理可用数据不平衡性质的不同方法的效果,包括过采样和欠采样方法。最后,为了提高整体模型性能,构建了一个结合装袋和提升算法与堆叠算法的联合模型。我们建立的模型表现出良好的区分能力,曲线下面积(AUC)值为 0.885,校准效果可接受(Brier 得分=0.072)。与基准模型相比,所提出的框架将整体模型性能的 AUC 值提高了 13.5%,召回率从 0.744提高到 0.847。所提出的模型有助于糖尿病的个性化管理,特别是在医疗资源匮乏的环境中。