Kim Yeongmin, Choi Wongyung, Choi Woojeong, Ko Grace, Han Seonggyun, Kim Hwan-Cheol, Kim Dokyoon, Lee Dong-Gi, Shin Dong Wook, Lee Younghee
School of Computing, KAIST, Daejeon, Republic of Korea.
College of Veterinary Medicine and Research Institute for Veterinary Science, Seoul National University, Seoul, Republic of Korea.
BioData Min. 2024 May 25;17(1):14. doi: 10.1186/s13040-024-00366-0.
Supervised machine learning models have been widely used to predict and get insight into diseases by classifying patients based on personal health records. However, a class imbalance is an obstacle that disrupts the training of the models. In this study, we aimed to address class imbalance with a conditional normalizing flow model, one of the deep-learning-based semi-supervised models for anomaly detection. It is the first introduction of the normalizing flow algorithm for tabular biomedical data.
We collected personal health records from South Korean citizens (n = 706), featuring genetic data obtained from direct-to-customer service (microarray chip), medical health check-ups, and lifestyle log data. Based on the health check-up data, six chronic diseases were labeled (obesity, diabetes, hypertriglyceridemia, dyslipidemia, liver dysfunction, and hypertension). After preprocessing, supervised classification models and semi-supervised anomaly detection models, including conditional normalizing flow, were evaluated for the classification of diabetes, which had extreme target imbalance (about 2%), based on AUROC and AUPRC. In addition, we evaluated their performance under the assumption of insufficient collection for patients with other chronic diseases by undersampling disease-affected samples.
While LightGBM (the best-performing model among supervised classification models) showed AUPRC 0.16 and AUROC 0.82, conditional normalizing flow achieved AUPRC 0.34 and AUROC 0.83 during fifty evaluations of the classification of diabetes, whose base rate was very low, at 0.02. Moreover, conditional normalizing flow performed better than the supervised model under a few disease-affected data numbers for the other five chronic diseases - obesity, hypertriglyceridemia, dyslipidemia, liver dysfunction, and hypertension. For example, while LightGBM performed AUPRC 0.20 and AUROC 0.75, conditional normalizing flow showed AUPRC 0.30 and AUROC 0.74 when predicting obesity, while undersampling disease-affected samples (positive undersampling) lowered the base rate to 0.02.
Our research suggests the utility of conditional normalizing flow, particularly when the available cases are limited, for predicting chronic diseases using personal health records. This approach offers an effective solution to deal with sparse data and extreme class imbalances commonly encountered in the biomedical context.
监督式机器学习模型已被广泛用于通过基于个人健康记录对患者进行分类来预测疾病并深入了解疾病。然而,类别不平衡是干扰模型训练的一个障碍。在本研究中,我们旨在使用条件归一化流模型来解决类别不平衡问题,该模型是基于深度学习的用于异常检测的半监督模型之一。这是首次将归一化流算法引入表格生物医学数据。
我们收集了韩国公民的个人健康记录(n = 706),其特征包括从直接面向客户的服务(微阵列芯片)获得的基因数据、医学健康检查数据和生活方式日志数据。基于健康检查数据,对六种慢性病进行了标注(肥胖症、糖尿病、高甘油三酯血症、血脂异常、肝功能障碍和高血压)。经过预处理后,基于受试者工作特征曲线下面积(AUROC)和精确召回率曲线下面积(AUPRC),对包括条件归一化流在内的监督分类模型和半监督异常检测模型进行了评估,以对目标严重不平衡(约2%)的糖尿病进行分类。此外,我们通过对患病样本进行欠采样,在假设其他慢性病患者收集数据不足的情况下评估了它们的性能。
虽然LightGBM(监督分类模型中表现最佳的模型)的AUPRC为0.16,AUROC为0.82,但在对基础率非常低(0.02)的糖尿病进行五十次分类评估期间,条件归一化流的AUPRC为0.34,AUROC为0.83。此外,对于其他五种慢性病——肥胖症、高甘油三酯血症、血脂异常、肝功能障碍和高血压,在患病数据数量较少的情况下,条件归一化流的表现优于监督模型。例如,在预测肥胖症时,当对患病样本进行欠采样(阳性欠采样)将基础率降至0.02时,LightGBM的AUPRC为0.20,AUROC为0.75,而条件归一化流的AUPRC为0.30,AUROC为0.74。
我们的研究表明,条件归一化流在使用个人健康记录预测慢性病方面具有实用性,特别是在可用病例有限的情况下。这种方法为处理生物医学背景中常见的稀疏数据和极端类别不平衡提供了一种有效的解决方案。