• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

一种使用条件归一化流来解决个人健康记录中极端类别不平衡问题的机器学习方法。

A machine learning approach using conditional normalizing flow to address extreme class imbalance problems in personal health records.

作者信息

Kim Yeongmin, Choi Wongyung, Choi Woojeong, Ko Grace, Han Seonggyun, Kim Hwan-Cheol, Kim Dokyoon, Lee Dong-Gi, Shin Dong Wook, Lee Younghee

机构信息

School of Computing, KAIST, Daejeon, Republic of Korea.

College of Veterinary Medicine and Research Institute for Veterinary Science, Seoul National University, Seoul, Republic of Korea.

出版信息

BioData Min. 2024 May 25;17(1):14. doi: 10.1186/s13040-024-00366-0.

DOI:10.1186/s13040-024-00366-0
PMID:38796471
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11127363/
Abstract

BACKGROUND

Supervised machine learning models have been widely used to predict and get insight into diseases by classifying patients based on personal health records. However, a class imbalance is an obstacle that disrupts the training of the models. In this study, we aimed to address class imbalance with a conditional normalizing flow model, one of the deep-learning-based semi-supervised models for anomaly detection. It is the first introduction of the normalizing flow algorithm for tabular biomedical data.

METHODS

We collected personal health records from South Korean citizens (n = 706), featuring genetic data obtained from direct-to-customer service (microarray chip), medical health check-ups, and lifestyle log data. Based on the health check-up data, six chronic diseases were labeled (obesity, diabetes, hypertriglyceridemia, dyslipidemia, liver dysfunction, and hypertension). After preprocessing, supervised classification models and semi-supervised anomaly detection models, including conditional normalizing flow, were evaluated for the classification of diabetes, which had extreme target imbalance (about 2%), based on AUROC and AUPRC. In addition, we evaluated their performance under the assumption of insufficient collection for patients with other chronic diseases by undersampling disease-affected samples.

RESULTS

While LightGBM (the best-performing model among supervised classification models) showed AUPRC 0.16 and AUROC 0.82, conditional normalizing flow achieved AUPRC 0.34 and AUROC 0.83 during fifty evaluations of the classification of diabetes, whose base rate was very low, at 0.02. Moreover, conditional normalizing flow performed better than the supervised model under a few disease-affected data numbers for the other five chronic diseases - obesity, hypertriglyceridemia, dyslipidemia, liver dysfunction, and hypertension. For example, while LightGBM performed AUPRC 0.20 and AUROC 0.75, conditional normalizing flow showed AUPRC 0.30 and AUROC 0.74 when predicting obesity, while undersampling disease-affected samples (positive undersampling) lowered the base rate to 0.02.

CONCLUSIONS

Our research suggests the utility of conditional normalizing flow, particularly when the available cases are limited, for predicting chronic diseases using personal health records. This approach offers an effective solution to deal with sparse data and extreme class imbalances commonly encountered in the biomedical context.

摘要

背景

监督式机器学习模型已被广泛用于通过基于个人健康记录对患者进行分类来预测疾病并深入了解疾病。然而,类别不平衡是干扰模型训练的一个障碍。在本研究中,我们旨在使用条件归一化流模型来解决类别不平衡问题,该模型是基于深度学习的用于异常检测的半监督模型之一。这是首次将归一化流算法引入表格生物医学数据。

方法

我们收集了韩国公民的个人健康记录(n = 706),其特征包括从直接面向客户的服务(微阵列芯片)获得的基因数据、医学健康检查数据和生活方式日志数据。基于健康检查数据,对六种慢性病进行了标注(肥胖症、糖尿病、高甘油三酯血症、血脂异常、肝功能障碍和高血压)。经过预处理后,基于受试者工作特征曲线下面积(AUROC)和精确召回率曲线下面积(AUPRC),对包括条件归一化流在内的监督分类模型和半监督异常检测模型进行了评估,以对目标严重不平衡(约2%)的糖尿病进行分类。此外,我们通过对患病样本进行欠采样,在假设其他慢性病患者收集数据不足的情况下评估了它们的性能。

结果

虽然LightGBM(监督分类模型中表现最佳的模型)的AUPRC为0.16,AUROC为0.82,但在对基础率非常低(0.02)的糖尿病进行五十次分类评估期间,条件归一化流的AUPRC为0.34,AUROC为0.83。此外,对于其他五种慢性病——肥胖症、高甘油三酯血症、血脂异常、肝功能障碍和高血压,在患病数据数量较少的情况下,条件归一化流的表现优于监督模型。例如,在预测肥胖症时,当对患病样本进行欠采样(阳性欠采样)将基础率降至0.02时,LightGBM的AUPRC为0.20,AUROC为0.75,而条件归一化流的AUPRC为0.30,AUROC为0.74。

结论

我们的研究表明,条件归一化流在使用个人健康记录预测慢性病方面具有实用性,特别是在可用病例有限的情况下。这种方法为处理生物医学背景中常见的稀疏数据和极端类别不平衡提供了一种有效的解决方案。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7bfa/11127363/82e62f3d28a9/13040_2024_366_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7bfa/11127363/12554d997db2/13040_2024_366_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7bfa/11127363/e1952fc20ad3/13040_2024_366_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7bfa/11127363/80f832de5ca8/13040_2024_366_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7bfa/11127363/35905b5a6560/13040_2024_366_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7bfa/11127363/fdeec851baac/13040_2024_366_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7bfa/11127363/82e62f3d28a9/13040_2024_366_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7bfa/11127363/12554d997db2/13040_2024_366_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7bfa/11127363/e1952fc20ad3/13040_2024_366_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7bfa/11127363/80f832de5ca8/13040_2024_366_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7bfa/11127363/35905b5a6560/13040_2024_366_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7bfa/11127363/fdeec851baac/13040_2024_366_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7bfa/11127363/82e62f3d28a9/13040_2024_366_Fig6_HTML.jpg

相似文献

1
A machine learning approach using conditional normalizing flow to address extreme class imbalance problems in personal health records.一种使用条件归一化流来解决个人健康记录中极端类别不平衡问题的机器学习方法。
BioData Min. 2024 May 25;17(1):14. doi: 10.1186/s13040-024-00366-0.
2
Machine Learning Electronic Health Record Identification of Patients with Rheumatoid Arthritis: Algorithm Pipeline Development and Validation Study.机器学习在类风湿性关节炎患者电子健康记录识别中的应用:算法流程开发与验证研究。
JMIR Med Inform. 2020 Nov 30;8(11):e23930. doi: 10.2196/23930.
3
Establishment of noninvasive diabetes risk prediction model based on tongue features and machine learning techniques.基于舌象特征和机器学习技术的无创糖尿病风险预测模型的建立。
Int J Med Inform. 2021 May;149:104429. doi: 10.1016/j.ijmedinf.2021.104429. Epub 2021 Feb 22.
4
Self-Supervised Learning for Improved Optical Coherence Tomography Detection of Macular Telangiectasia Type 2.基于自监督学习的黄斑毛细血管扩张症 2 型光学相干断层扫描检测方法的研究
JAMA Ophthalmol. 2024 Mar 1;142(3):226-233. doi: 10.1001/jamaophthalmol.2023.6454.
5
An empirical evaluation of sampling methods for the classification of imbalanced data.不平衡数据分类的采样方法的实证评估。
PLoS One. 2022 Jul 28;17(7):e0271260. doi: 10.1371/journal.pone.0271260. eCollection 2022.
6
Explainable Machine Learning Techniques To Predict Amiodarone-Induced Thyroid Dysfunction Risk: Multicenter, Retrospective Study With External Validation.可解释机器学习技术预测胺碘酮诱导甲状腺功能障碍风险:多中心回顾性研究及外部验证。
J Med Internet Res. 2023 Feb 7;25:e43734. doi: 10.2196/43734.
7
Machine learning prediction of postoperative major adverse cardiovascular events in geriatric patients: a prospective cohort study.机器学习预测老年患者术后主要不良心血管事件:一项前瞻性队列研究。
BMC Anesthesiol. 2022 Sep 10;22(1):284. doi: 10.1186/s12871-022-01827-x.
8
Predicting Postoperative Mortality With Deep Neural Networks and Natural Language Processing: Model Development and Validation.使用深度神经网络和自然语言处理预测术后死亡率:模型开发与验证
JMIR Med Inform. 2022 May 10;10(5):e38241. doi: 10.2196/38241.
9
Predicting Fetal Alcohol Spectrum Disorders Using Machine Learning Techniques: Multisite Retrospective Cohort Study.使用机器学习技术预测胎儿酒精谱系障碍:多地点回顾性队列研究。
J Med Internet Res. 2023 Jul 18;25:e45041. doi: 10.2196/45041.
10
The performance of VCS(volume, conductivity, light scatter) parameters in distinguishing latent tuberculosis and active tuberculosis by using machine learning algorithm.使用机器学习算法区分潜伏性结核和活动性结核的 VCS(体积、传导率、光散射)参数的性能。
BMC Infect Dis. 2023 Dec 16;23(1):881. doi: 10.1186/s12879-023-08531-2.

引用本文的文献

1
Generative artificial intelligence in diabetes healthcare.糖尿病医疗保健中的生成式人工智能。
iScience. 2025 Jul 5;28(8):113051. doi: 10.1016/j.isci.2025.113051. eCollection 2025 Aug 15.

本文引用的文献

1
Invertible Modeling of Bidirectional Relationships in Neuroimaging With Normalizing Flows: Application to Brain Aging.基于归一化流的神经成像双向关系可逆建模:在脑老化中的应用
IEEE Trans Med Imaging. 2022 Sep;41(9):2331-2347. doi: 10.1109/TMI.2022.3161947. Epub 2022 Aug 31.
2
Concept and Proof of the Lifelog Bigdata Platform for Digital Healthcare and Precision Medicine on the Cloud.云端数字医疗与精准医学的生命大数据平台的概念与验证。
Yonsei Med J. 2022 Jan;63(Suppl):S84-S92. doi: 10.3349/ymj.2022.63.S84.
3
Digital medicine and the curse of dimensionality.
数字医学与维度诅咒
NPJ Digit Med. 2021 Oct 28;4(1):153. doi: 10.1038/s41746-021-00521-5.
4
Predicting next-day discharge via electronic health record access logs.通过电子健康记录访问日志预测次日出院。
J Am Med Inform Assoc. 2021 Nov 25;28(12):2670-2680. doi: 10.1093/jamia/ocab211.
5
COVID-19 identification in chest X-ray images on flat and hierarchical classification scenarios.基于平面和分层分类场景的胸部 X 射线图像中的 COVID-19 识别。
Comput Methods Programs Biomed. 2020 Oct;194:105532. doi: 10.1016/j.cmpb.2020.105532. Epub 2020 May 8.
6
Longitudinal assessment of carotid plaque texture in three-dimensional ultrasound images based on semi-supervised graph-based dimensionality reduction and feature selection.
Comput Biol Med. 2020 Jan;116:103586. doi: 10.1016/j.compbiomed.2019.103586. Epub 2019 Dec 14.
7
Predicting breast cancer risk using personal health data and machine learning models.利用个人健康数据和机器学习模型预测乳腺癌风险。
PLoS One. 2019 Dec 27;14(12):e0226765. doi: 10.1371/journal.pone.0226765. eCollection 2019.
8
Discovering the Type 2 Diabetes in Electronic Health Records Using the Sparse Balanced Support Vector Machine.利用稀疏平衡支持向量机从电子健康记录中发现 2 型糖尿病。
IEEE J Biomed Health Inform. 2020 Jan;24(1):235-246. doi: 10.1109/JBHI.2019.2899218. Epub 2019 Feb 13.
9
Impact of Personal Health Records and Wearables on Health Outcomes and Patient Response: Three-Arm Randomized Controlled Trial.个人健康记录和可穿戴设备对健康结果和患者反应的影响:三臂随机对照试验。
JMIR Mhealth Uhealth. 2019 Jan 4;7(1):e12070. doi: 10.2196/12070.
10
Impact of an Electronic Health Record-Integrated Personal Health Record on Patient Participation in Health Care: Development and Randomized Controlled Trial of MyHealthKeeper.电子健康记录整合的个人健康记录对患者参与医疗保健的影响:MyHealthKeeper的开发与随机对照试验
J Med Internet Res. 2017 Dec 7;19(12):e401. doi: 10.2196/jmir.8867.