Park Beomjin, Park Changyi, Hong Sungchul, Choi Hosik
Department of Information and Statistics, Gyeongsang National University, Jinju, South Korea.
Department of Statistics, University of Seoul, Seoul, South Korea.
J Appl Stat. 2024 Jun 5;52(1):158-182. doi: 10.1080/02664763.2024.2362266. eCollection 2025.
Clustering is an essential technique that groups similar data points to uncover the underlying structure and features of the data. Although traditional clustering methods such as -means are widely utilized, they have limitations in identifying nonlinear clusters. Thus, alternative techniques, such as kernel -means and spectral clustering, have been developed to address this issue. However, another challenge arises when irrelevant variables are present in the data; this can be mitigated by employing variable selection methods such as the filter, wrapper, and embedded approaches. In this study, with a particular focus on kernel -means clustering, we propose an embedded variable selection method using a tensor product space along with a general analysis of variance kernel for nonlinear clustering. Comprehensive experiments involving simulations and real data analysis demonstrated that the proposed method achieves competitive performance compared to existing approaches. Thus, the proposed method may serve as a reliable tool for accurate cluster identification and variable selection to gain insights into complex datasets.
聚类是一种重要技术,它将相似的数据点分组以揭示数据的潜在结构和特征。尽管诸如K均值等传统聚类方法被广泛使用,但它们在识别非线性聚类方面存在局限性。因此,已开发出替代技术,如核K均值和谱聚类来解决此问题。然而,当数据中存在无关变量时会出现另一个挑战;这可以通过采用诸如过滤、包装和嵌入方法等变量选择方法来缓解。在本研究中,特别关注核K均值聚类,我们提出一种使用张量积空间以及用于非线性聚类的广义方差分析核的嵌入变量选择方法。涉及模拟和实际数据分析的综合实验表明,与现有方法相比,所提出的方法具有竞争力。因此,所提出的方法可作为一种可靠工具,用于准确的聚类识别和变量选择,以深入了解复杂数据集。