Suppr超能文献

一种基于修正加权Gower距离的混合型数据聚类分析:模拟与实证分析

A modified and weighted Gower distance-based clustering analysis for mixed type data: a simulation and empirical analyses.

作者信息

Liu Pinyan, Yuan Han, Ning Yilin, Chakraborty Bibhas, Liu Nan, Peres Marco Aurélio

机构信息

Centre for Quantitative Medicine, Duke-NUS Medical School, 8 College Road, Singapore, 169857, Singapore.

Programme in Health Services and Systems Research, Duke-NUS Medical School, Singapore, Singapore.

出版信息

BMC Med Res Methodol. 2024 Dec 18;24(1):305. doi: 10.1186/s12874-024-02427-8.

Abstract

BACKGROUND

Traditional clustering techniques are typically restricted to either continuous or categorical variables. However, most real-world clinical data are mixed type. This study aims to introduce a clustering technique specifically designed for datasets containing both continuous and categorical variables to offer better clustering compatibility, adaptability, and interpretability than other mixed type techniques.

METHODS

This paper proposed a modified Gower distance incorporating feature importance as weights to maintain equal contributions between continuous and categorical features. The algorithm (DAFI) was evaluated using five simulated datasets with varying proportions of important features and real-world datasets from the 2011-2014 National Health and Nutrition Examination Survey (NHANES). Effectiveness was demonstrated through comparisons with 13 clustering techniques. Clustering performance was assessed using the adjusted Rand index (ARI) for accuracy in simulation studies and the silhouette score for cohesion and separation in NHANES. Additionally, multivariable logistic regression estimated the association between periodontitis (PD) and cardiovascular diseases (CVDs), adjusting for clusters in NHANES.

RESULTS

In simulation studies, the DAFI-Gower algorithm consistently performs better than baseline methods according to the adjusted Rand index in settings investigated, especially on datasets with more redundant features. In NHANES, 3,760 people were analyzed. DAFI-Gower achieves the highest silhouette score (0.79). Four distinct clusters with diverse health profiles were identified. By incorporating feature importance, we found that cluster formations were more strongly influenced by CVD-related factors. The association between periodontitis and cardiovascular diseases, after adjusting for clusters, reveals significant insights (adjusted OR 1.95, 95% CI 1.50 to 2.55, p = 0.012), highlighting severe periodontitis as a potential risk factor for cardiovascular diseases.

CONCLUSIONS

DAFI performed better than classic clustering baselines on both simulated and real-world datasets. It effectively captures cluster characteristics by considering feature importance, which is crucial in clinical settings where many variables may be similar or irrelevant. We envisage that DAFI offers an effective solution for mixed type clustering.

摘要

背景

传统的聚类技术通常仅限于处理连续变量或分类变量。然而,大多数现实世界中的临床数据是混合型的。本研究旨在引入一种专门为同时包含连续变量和分类变量的数据集设计的聚类技术,以提供比其他混合型技术更好的聚类兼容性、适应性和可解释性。

方法

本文提出了一种改进的Gower距离,将特征重要性作为权重纳入其中,以保持连续特征和分类特征之间的同等贡献。使用五个具有不同重要特征比例的模拟数据集以及2011 - 2014年国家健康与营养检查调查(NHANES)的真实世界数据集对该算法(DAFI)进行评估。通过与13种聚类技术进行比较来证明其有效性。在模拟研究中,使用调整后的兰德指数(ARI)评估聚类准确性,在NHANES中使用轮廓系数评估凝聚性和分离性。此外,多变量逻辑回归估计了牙周炎(PD)与心血管疾病(CVD)之间的关联,并在NHANES中对聚类进行了调整。

结果

在模拟研究中,根据调整后的兰德指数,在研究的设置中,DAFI - Gower算法始终比基线方法表现更好,特别是在具有更多冗余特征的数据集上。在NHANES中,对3760人进行了分析。DAFI - Gower获得了最高的轮廓系数(0.79)。识别出了四个具有不同健康状况的不同聚类。通过纳入特征重要性,我们发现聚类形成受心血管疾病相关因素的影响更大。在对聚类进行调整后,牙周炎与心血管疾病之间的关联揭示了重要见解(调整后的比值比为1.95,95%置信区间为1.50至2.55,p = 0.012),突出了重度牙周炎作为心血管疾病的潜在危险因素。

结论

DAFI在模拟数据集和真实世界数据集上的表现均优于经典聚类基线。它通过考虑特征重要性有效地捕捉聚类特征,这在许多变量可能相似或无关的临床环境中至关重要。我们设想DAFI为混合型聚类提供了一种有效的解决方案。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2a27/11654179/7e305becd6ce/12874_2024_2427_Figa_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验