• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

一种基于修正加权Gower距离的混合型数据聚类分析:模拟与实证分析

A modified and weighted Gower distance-based clustering analysis for mixed type data: a simulation and empirical analyses.

作者信息

Liu Pinyan, Yuan Han, Ning Yilin, Chakraborty Bibhas, Liu Nan, Peres Marco Aurélio

机构信息

Centre for Quantitative Medicine, Duke-NUS Medical School, 8 College Road, Singapore, 169857, Singapore.

Programme in Health Services and Systems Research, Duke-NUS Medical School, Singapore, Singapore.

出版信息

BMC Med Res Methodol. 2024 Dec 18;24(1):305. doi: 10.1186/s12874-024-02427-8.

DOI:10.1186/s12874-024-02427-8
PMID:39696017
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11654179/
Abstract

BACKGROUND

Traditional clustering techniques are typically restricted to either continuous or categorical variables. However, most real-world clinical data are mixed type. This study aims to introduce a clustering technique specifically designed for datasets containing both continuous and categorical variables to offer better clustering compatibility, adaptability, and interpretability than other mixed type techniques.

METHODS

This paper proposed a modified Gower distance incorporating feature importance as weights to maintain equal contributions between continuous and categorical features. The algorithm (DAFI) was evaluated using five simulated datasets with varying proportions of important features and real-world datasets from the 2011-2014 National Health and Nutrition Examination Survey (NHANES). Effectiveness was demonstrated through comparisons with 13 clustering techniques. Clustering performance was assessed using the adjusted Rand index (ARI) for accuracy in simulation studies and the silhouette score for cohesion and separation in NHANES. Additionally, multivariable logistic regression estimated the association between periodontitis (PD) and cardiovascular diseases (CVDs), adjusting for clusters in NHANES.

RESULTS

In simulation studies, the DAFI-Gower algorithm consistently performs better than baseline methods according to the adjusted Rand index in settings investigated, especially on datasets with more redundant features. In NHANES, 3,760 people were analyzed. DAFI-Gower achieves the highest silhouette score (0.79). Four distinct clusters with diverse health profiles were identified. By incorporating feature importance, we found that cluster formations were more strongly influenced by CVD-related factors. The association between periodontitis and cardiovascular diseases, after adjusting for clusters, reveals significant insights (adjusted OR 1.95, 95% CI 1.50 to 2.55, p = 0.012), highlighting severe periodontitis as a potential risk factor for cardiovascular diseases.

CONCLUSIONS

DAFI performed better than classic clustering baselines on both simulated and real-world datasets. It effectively captures cluster characteristics by considering feature importance, which is crucial in clinical settings where many variables may be similar or irrelevant. We envisage that DAFI offers an effective solution for mixed type clustering.

摘要

背景

传统的聚类技术通常仅限于处理连续变量或分类变量。然而,大多数现实世界中的临床数据是混合型的。本研究旨在引入一种专门为同时包含连续变量和分类变量的数据集设计的聚类技术,以提供比其他混合型技术更好的聚类兼容性、适应性和可解释性。

方法

本文提出了一种改进的Gower距离,将特征重要性作为权重纳入其中,以保持连续特征和分类特征之间的同等贡献。使用五个具有不同重要特征比例的模拟数据集以及2011 - 2014年国家健康与营养检查调查(NHANES)的真实世界数据集对该算法(DAFI)进行评估。通过与13种聚类技术进行比较来证明其有效性。在模拟研究中,使用调整后的兰德指数(ARI)评估聚类准确性,在NHANES中使用轮廓系数评估凝聚性和分离性。此外,多变量逻辑回归估计了牙周炎(PD)与心血管疾病(CVD)之间的关联,并在NHANES中对聚类进行了调整。

结果

在模拟研究中,根据调整后的兰德指数,在研究的设置中,DAFI - Gower算法始终比基线方法表现更好,特别是在具有更多冗余特征的数据集上。在NHANES中,对3760人进行了分析。DAFI - Gower获得了最高的轮廓系数(0.79)。识别出了四个具有不同健康状况的不同聚类。通过纳入特征重要性,我们发现聚类形成受心血管疾病相关因素的影响更大。在对聚类进行调整后,牙周炎与心血管疾病之间的关联揭示了重要见解(调整后的比值比为1.95,95%置信区间为1.50至2.55,p = 0.012),突出了重度牙周炎作为心血管疾病的潜在危险因素。

结论

DAFI在模拟数据集和真实世界数据集上的表现均优于经典聚类基线。它通过考虑特征重要性有效地捕捉聚类特征,这在许多变量可能相似或无关的临床环境中至关重要。我们设想DAFI为混合型聚类提供了一种有效的解决方案。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2a27/11654179/a1ef76017516/12874_2024_2427_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2a27/11654179/7e305becd6ce/12874_2024_2427_Figa_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2a27/11654179/11ae706069c0/12874_2024_2427_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2a27/11654179/038afcfd959b/12874_2024_2427_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2a27/11654179/d00228debb2d/12874_2024_2427_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2a27/11654179/a1ef76017516/12874_2024_2427_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2a27/11654179/7e305becd6ce/12874_2024_2427_Figa_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2a27/11654179/11ae706069c0/12874_2024_2427_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2a27/11654179/038afcfd959b/12874_2024_2427_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2a27/11654179/d00228debb2d/12874_2024_2427_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2a27/11654179/a1ef76017516/12874_2024_2427_Fig4_HTML.jpg

相似文献

1
A modified and weighted Gower distance-based clustering analysis for mixed type data: a simulation and empirical analyses.一种基于修正加权Gower距离的混合型数据聚类分析:模拟与实证分析
BMC Med Res Methodol. 2024 Dec 18;24(1):305. doi: 10.1186/s12874-024-02427-8.
2
Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区,服用抗叶酸抗疟药物的人群中,叶酸补充剂与疟疾易感性和严重程度的关系。
Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.
3
Simulation-derived best practices for clustering clinical data.基于模拟的临床数据聚类最佳实践。
J Biomed Inform. 2021 Jun;118:103788. doi: 10.1016/j.jbi.2021.103788. Epub 2021 Apr 20.
4
Head-to-head comparison of clustering methods for heterogeneous data: a simulation-driven benchmark.针对异质数据的聚类方法的头对头比较:基于模拟的基准测试。
Sci Rep. 2021 Feb 18;11(1):4202. doi: 10.1038/s41598-021-83340-8.
5
Identification of Clusters in a Population With Obesity Using Machine Learning: Secondary Analysis of The Maastricht Study.使用机器学习识别肥胖人群中的聚类:马斯特里赫特研究的二次分析
JMIR Med Inform. 2025 Feb 5;13:e64479. doi: 10.2196/64479.
6
An inversion-based clustering approach for complex clusters.基于倒置的复杂聚类聚类方法。
BMC Res Notes. 2024 May 12;17(1):133. doi: 10.1186/s13104-024-06791-y.
7
Kinetic Pattern Recognition in Home-Based Knee Rehabilitation Using Machine Learning Clustering Methods on the Slider Digital Physiotherapy Device: Prospective Observational Study.在基于家庭的膝关节康复中,利用滑块数字物理治疗设备上的机器学习聚类方法进行运动模式识别:前瞻性观察研究。
JMIR Form Res. 2025 Mar 18;9:e69150. doi: 10.2196/69150.
8
clusterBMA: Bayesian model averaging for clustering.聚类 BMA:用于聚类的贝叶斯模型平均。
PLoS One. 2023 Aug 21;18(8):e0288000. doi: 10.1371/journal.pone.0288000. eCollection 2023.
9
Association of cardiovascular health and periodontitis: a population-based study.心血管健康与牙周炎的相关性:一项基于人群的研究。
BMC Public Health. 2024 Feb 12;24(1):438. doi: 10.1186/s12889-024-18001-2.
10
[Association Between the Aggregate Index of Systemic Inflammation and Albuminuria: A Cross-Sectional Study of National Health and Nutrition Examination Survey 2007-2018].[全身炎症综合指数与蛋白尿之间的关联:2007 - 2018年美国国家健康与营养检查调查的横断面研究]
Sichuan Da Xue Xue Bao Yi Xue Ban. 2024 May 20;55(3):671-679. doi: 10.12182/20240560108.

引用本文的文献

1
Exploring the Transitivity Assumption in Network Meta-Analysis: A Novel Approach and Its Implications.探索网络荟萃分析中的传递性假设:一种新方法及其影响。
Stat Med. 2025 Mar 30;44(7):e70068. doi: 10.1002/sim.70068.

本文引用的文献

1
A scoping review of the clinical application of machine learning in data-driven population segmentation analysis.基于机器学习的临床应用的综述:在数据驱动的人群细分分析中的应用。
J Am Med Inform Assoc. 2023 Aug 18;30(9):1573-1582. doi: 10.1093/jamia/ocad111.
2
Generating synthetic mixed-type longitudinal electronic health records for artificial intelligent applications.为人工智能应用生成合成混合型纵向电子健康记录。
NPJ Digit Med. 2023 May 27;6(1):98. doi: 10.1038/s41746-023-00834-7.
3
Association between cardiovascular diseases and periodontal disease: more than what meets the eye.
心血管疾病与牙周病之间的关联:远不止表面所见。
Drug Target Insights. 2023 Feb 2;17:31-38. doi: 10.33393/dti.2023.2510. eCollection 2023 Jan-Dec.
4
Challenges of Clustering Multimodal Clinical Data: Review of Applications in Asthma Subtyping.多模态临床数据聚类的挑战:哮喘亚型分类中的应用综述
JMIR Med Inform. 2020 May 28;8(5):e16452. doi: 10.2196/16452.
5
Brain Imaging Genomics: Integrated Analysis and Machine Learning.脑成像基因组学:综合分析与机器学习
Proc IEEE Inst Electr Electron Eng. 2020 Jan;108(1):125-162. doi: 10.1109/JPROC.2019.2947272. Epub 2019 Oct 29.
6
Disease progression and treatment response in data-driven subgroups of type 2 diabetes compared with models based on simple clinical features: an analysis using clinical trial data.基于临床试验数据的分析:与基于简单临床特征的模型相比,数据驱动的 2 型糖尿病亚组的疾病进展和治疗反应。
Lancet Diabetes Endocrinol. 2019 Jun;7(6):442-451. doi: 10.1016/S2213-8587(19)30087-7. Epub 2019 Apr 29.
7
VarSelLCM: an R/C++ package for variable selection in model-based clustering of mixed-data with missing values.VarSelLCM:用于基于模型的混合数据缺失值聚类中变量选择的 R/C++ 包。
Bioinformatics. 2019 Apr 1;35(7):1255-1257. doi: 10.1093/bioinformatics/bty786.
8
Benefits of Population Segmentation Analysis for Developing Health Policy to Promote Patient-Centred Care.人群细分分析对制定以患者为中心的医疗保健卫生政策的益处。
Ann Acad Med Singap. 2017 Jul;46(7):287-289.
9
Surgical complications and their impact on patients' psychosocial well-being: a systematic review and meta-analysis.手术并发症及其对患者心理社会福祉的影响:一项系统评价和荟萃分析。
BMJ Open. 2016 Feb 16;6(2):e007224. doi: 10.1136/bmjopen-2014-007224.
10
Genomic, Proteomic, and Metabolomic Data Integration Strategies.基因组学、蛋白质组学和代谢组学数据整合策略。
Biomark Insights. 2015 Sep 7;10(Suppl 4):1-6. doi: 10.4137/BMI.S29511. eCollection 2015.