• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

同时聚类和变量选择:一种新算法和模型选择过程。

Simultaneous clustering and variable selection: A novel algorithm and model selection procedure.

机构信息

Section Leadership and Management, University of Amsterdam, Amsterdam, The Netherlands.

Department of Methodology and Statistics, Tilburg University, Tilburg, Netherlands.

出版信息

Behav Res Methods. 2023 Aug;55(5):2157-2174. doi: 10.3758/s13428-022-01795-7. Epub 2022 Sep 9.

DOI:10.3758/s13428-022-01795-7
PMID:36085542
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10439051/
Abstract

The growing availability of high-dimensional data sets offers behavioral scientists an unprecedented opportunity to integrate the information hidden in the novel types of data (e.g., genetic data, social media data, and GPS tracks, etc.,) and thereby obtain a more detailed and comprehensive view towards their research questions. In the context of clustering, analyzing the large volume of variables could potentially result in an accurate estimation or a novel discovery of underlying subgroups. However, a unique challenge is that the high-dimensional data sets likely involve a significant amount of irrelevant variables. These irrelevant variables do not contribute to the separation of clusters and they may mask cluster partitions. The current paper addresses this challenge by introducing a new clustering algorithm, called Cardinality K-means or CKM, and by proposing a novel model selection strategy. CKM is able to perform simultaneous clustering and variable selection with high stability. In two simulation studies and an empirical demonstration with genetic data, CKM consistently outperformed competing methods in terms of recovering cluster partitions and identifying signaling variables. Meanwhile, our novel model selection strategy determines the number of clusters based on a subset of variables that are most likely to be signaling variables. Through a simulation study, this strategy was found to result in a more accurate estimation of the number of clusters compared to the conventional strategy that utilizes the full set of variables. Our proposed CKM algorithm, together with the novel model selection strategy, has been implemented in a freely accessible R package.

摘要

高维数据集的日益普及为行为科学家提供了一个前所未有的机会,可以整合隐藏在新型数据(例如遗传数据、社交媒体数据和 GPS 轨迹等)中的信息,从而更详细、更全面地了解他们的研究问题。在聚类的背景下,分析大量的变量可能会导致对潜在亚组的准确估计或新发现。然而,一个独特的挑战是,高维数据集可能涉及大量不相关的变量。这些不相关的变量不会有助于聚类的分离,并且它们可能会掩盖聚类划分。本文通过引入一种新的聚类算法,称为基数 K-均值或 CKM,并提出一种新的模型选择策略来解决这一挑战。CKM 能够以高稳定性同时执行聚类和变量选择。在两项模拟研究和一项遗传数据分析中,CKM 在恢复聚类划分和识别信号变量方面始终优于竞争方法。同时,我们的新模型选择策略基于最有可能是信号变量的变量子集来确定聚类的数量。通过模拟研究,与使用全数据集的传统策略相比,该策略发现能够更准确地估计聚类的数量。我们提出的 CKM 算法以及新的模型选择策略已在一个免费的 R 包中实现。

相似文献

1
Simultaneous clustering and variable selection: A novel algorithm and model selection procedure.同时聚类和变量选择:一种新算法和模型选择过程。
Behav Res Methods. 2023 Aug;55(5):2157-2174. doi: 10.3758/s13428-022-01795-7. Epub 2022 Sep 9.
2
Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区,服用抗叶酸抗疟药物的人群中,叶酸补充剂与疟疾易感性和严重程度的关系。
Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.
3
Simultaneous gene clustering and subset selection for sample classification via MDL.通过最小描述长度实现用于样本分类的同步基因聚类和子集选择
Bioinformatics. 2003 Jun 12;19(9):1100-9. doi: 10.1093/bioinformatics/btg039.
4
Automated variable weighting in k-means type clustering.k均值类型聚类中的自动可变加权
IEEE Trans Pattern Anal Mach Intell. 2005 May;27(5):657-68. doi: 10.1109/TPAMI.2005.95.
5
Identifying clusters in genomics data by recursive partitioning.通过递归划分识别基因组学数据中的聚类。
Stat Appl Genet Mol Biol. 2013 Oct 1;12(5):637-52. doi: 10.1515/sagmb-2013-0016.
6
Simultaneous estimation of cluster number and feature sparsity in high-dimensional cluster analysis.高维聚类分析中同时估计簇数和特征稀疏性。
Biometrics. 2022 Jun;78(2):574-585. doi: 10.1111/biom.13449. Epub 2021 Mar 15.
7
A genetic algorithm using hyper-quadtrees for low-dimensional K-means clustering.一种使用超四叉树进行低维K均值聚类的遗传算法。
IEEE Trans Pattern Anal Mach Intell. 2006 Apr;28(4):533-43. doi: 10.1109/TPAMI.2006.66.
8
Iterative class discovery and feature selection using Minimal Spanning Trees.使用最小生成树的迭代类发现和特征选择
BMC Bioinformatics. 2004 Sep 8;5:126. doi: 10.1186/1471-2105-5-126.
9
caBIG VISDA: modeling, visualization, and discovery for cluster analysis of genomic data.caBIG VISDA:用于基因组数据聚类分析的建模、可视化与发现
BMC Bioinformatics. 2008 Sep 18;9:383. doi: 10.1186/1471-2105-9-383.
10
Automatic Spectroscopic Data Categorization by Clustering Analysis (ASCLAN): A Data-Driven Approach for Distinguishing Discriminatory Metabolites for Phenotypic Subclasses.自动光谱数据分析分类(ASCLAN):一种用于区分表型亚类中判别代谢物的基于数据驱动的方法。
Anal Chem. 2016 Jun 7;88(11):5670-9. doi: 10.1021/acs.analchem.5b04020. Epub 2016 May 13.

引用本文的文献

1
LACE-UP: An ensemble machine-learning method for health subtype classification on multidimensional binary data.LACE-UP:一种用于多维二元数据健康亚型分类的集成机器学习方法。
Proc Natl Acad Sci U S A. 2025 Apr 29;122(17):e2423341122. doi: 10.1073/pnas.2423341122. Epub 2025 Apr 23.

本文引用的文献

1
A Guide for Sparse PCA: Model Comparison and Applications.稀疏 PCA 指南:模型比较与应用。
Psychometrika. 2021 Dec;86(4):893-919. doi: 10.1007/s11336-021-09773-2. Epub 2021 Jun 29.
2
Machine learning improved classification of psychoses using clinical and biological stratification: Update from the bipolar-schizophrenia network for intermediate phenotypes (B-SNIP).机器学习利用临床和生物学分层改进了精神病的分类:双相情感障碍-精神分裂症中间表型网络(B-SNIP)的更新
Schizophr Res. 2019 Dec;214:60-69. doi: 10.1016/j.schres.2018.04.037. Epub 2018 May 26.
3
Machine Learning for Precision Psychiatry: Opportunities and Challenges.
机器学习在精准精神医学中的机遇与挑战。
Biol Psychiatry Cogn Neurosci Neuroimaging. 2018 Mar;3(3):223-230. doi: 10.1016/j.bpsc.2017.11.007. Epub 2017 Dec 6.
4
Is Romantic Desire Predictable? Machine Learning Applied to Initial Romantic Attraction.浪漫欲望是否可预测?应用机器学习分析初始浪漫吸引力
Psychol Sci. 2017 Oct;28(10):1478-1489. doi: 10.1177/0956797617714580. Epub 2017 Aug 30.
5
Choosing Prediction Over Explanation in Psychology: Lessons From Machine Learning.在心理学中选择预测而不是解释:来自机器学习的教训。
Perspect Psychol Sci. 2017 Nov;12(6):1100-1122. doi: 10.1177/1745691617693393. Epub 2017 Aug 25.
6
Can Big Data Fulfill Its Promise?大数据能兑现其承诺吗?
Circ Cardiovasc Qual Outcomes. 2016 Nov;9(6):679-682. doi: 10.1161/CIRCOUTCOMES.116.003097. Epub 2016 Nov 8.
7
Can genes play a role in explaining frequent job changes? An examination of gene-environment interaction from human capital theory.基因在解释频繁跳槽方面能起到作用吗?基于人力资本理论的基因-环境交互作用检验。
J Appl Psychol. 2016 Jul;101(7):1030-44. doi: 10.1037/apl0000093. Epub 2016 Apr 14.
8
A New Variable Weighting and Selection Procedure for K-means Cluster Analysis.一种用于K均值聚类分析的新变量加权与选择程序。
Multivariate Behav Res. 2008 Jan-Mar;43(1):77-108. doi: 10.1080/00273170701836695.
9
Oxytocin Pathway Genes: Evolutionary Ancient System Impacting on Human Affiliation, Sociality, and Psychopathology.催产素通路基因:影响人类亲和性、社会性和精神病理学的古老进化系统。
Biol Psychiatry. 2016 Feb 1;79(3):174-84. doi: 10.1016/j.biopsych.2015.08.008. Epub 2015 Aug 18.
10
Challenges of Big Data Analysis.大数据分析的挑战
Natl Sci Rev. 2014 Jun;1(2):293-314. doi: 10.1093/nsr/nwt032.