• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于模拟的临床数据聚类最佳实践。

Simulation-derived best practices for clustering clinical data.

机构信息

The Ohio State University College of Medicine, 370 W 9th Ave, Columbus, OH 43210, USA.

Department of Biomedical Informatics, The Ohio State University, 1800 Cannon Dr, Columbus, OH 43210, USA.

出版信息

J Biomed Inform. 2021 Jun;118:103788. doi: 10.1016/j.jbi.2021.103788. Epub 2021 Apr 20.

DOI:10.1016/j.jbi.2021.103788
PMID:33862229
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9017600/
Abstract

INTRODUCTION

Clustering analyses in clinical contexts hold promise to improve the understanding of patient phenotype and disease course in chronic and acute clinical medicine. However, work remains to ensure that solutions are rigorous, valid, and reproducible. In this paper, we evaluate best practices for dissimilarity matrix calculation and clustering on mixed-type, clinical data.

METHODS

We simulate clinical data to represent problems in clinical trials, cohort studies, and EHR data, including single-type datasets (binary, continuous, categorical) and 4 data mixtures. We test 5 single distance metrics (Jaccard, Hamming, Gower, Manhattan, Euclidean) and 3 mixed distance metrics (DAISY, Supersom, and Mercator) with 3 clustering algorithms (hierarchical (HC), k-medoids, self-organizing maps (SOM)). We quantitatively and visually validate by Adjusted Rand Index (ARI) and silhouette width (SW). We applied our best methods to two real-world data sets: (1) 21 features collected on 247 patients with chronic lymphocytic leukemia, and (2) 40 features collected on 6000 patients admitted to an intensive care unit.

RESULTS

HC outperformed k-medoids and SOM by ARI across data types. DAISY produced the highest mean ARI for mixed data types for all mixtures except unbalanced mixtures dominated by continuous data. Compared to other methods, DAISY with HC uncovered superior, separable clusters in both real-world data sets.

DISCUSSION

Selecting an appropriate mixed-type metric allows the investigator to obtain optimal separation of patient clusters and get maximum use of their data. Superior metrics for mixed-type data handle multiple data types using multiple, type-focused distances. Better subclassification of disease opens avenues for targeted treatments, precision medicine, clinical decision support, and improved patient outcomes.

摘要

简介

在临床环境中进行聚类分析有望改善对慢性和急性临床医学中患者表型和疾病过程的理解。然而,仍需要努力确保解决方案具有严格性、有效性和可重复性。在本文中,我们评估了针对混合类型临床数据的相似性矩阵计算和聚类的最佳实践。

方法

我们模拟临床数据以代表临床试验、队列研究和电子健康记录 (EHR) 数据中的问题,包括单类型数据集(二进制、连续、分类)和 4 种数据混合物。我们测试了 5 种单一距离度量(Jaccard、Hamming、Gower、曼哈顿、欧几里得)和 3 种混合距离度量(DAISY、Supersom 和 Mercator)与 3 种聚类算法(层次聚类 (HC)、k-均值聚类、自组织映射 (SOM))。我们通过调整兰德指数 (ARI) 和轮廓宽度 (SW) 进行定量和可视化验证。我们将最佳方法应用于两个真实世界数据集:(1)在 247 名慢性淋巴细胞白血病患者中收集的 21 个特征,以及(2)在 6000 名入住重症监护病房的患者中收集的 40 个特征。

结果

HC 在所有数据类型上的 ARI 均优于 k-均值聚类和 SOM。对于除连续数据为主的不平衡混合物外的所有混合物,DAISY 产生的混合数据类型的平均 ARI 最高。与其他方法相比,DAISY 与 HC 一起在两个真实世界的数据集均能发现更优、可分离的聚类。

讨论

选择适当的混合类型度量可以使研究人员获得患者聚类的最佳分离效果,并最大程度地利用其数据。用于混合类型数据的优越度量使用多种、针对类型的距离来处理多种数据类型。更好的疾病细分可以为靶向治疗、精准医学、临床决策支持和改善患者预后开辟途径。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fbc3/9017600/c2395cda70a8/nihms-1786862-f0007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fbc3/9017600/c4cf77f5a7d6/nihms-1786862-f0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fbc3/9017600/b6f174b85d1d/nihms-1786862-f0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fbc3/9017600/e260861e01fa/nihms-1786862-f0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fbc3/9017600/4373857be02f/nihms-1786862-f0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fbc3/9017600/42a5af90bb72/nihms-1786862-f0005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fbc3/9017600/85a350b4cbcb/nihms-1786862-f0006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fbc3/9017600/c2395cda70a8/nihms-1786862-f0007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fbc3/9017600/c4cf77f5a7d6/nihms-1786862-f0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fbc3/9017600/b6f174b85d1d/nihms-1786862-f0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fbc3/9017600/e260861e01fa/nihms-1786862-f0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fbc3/9017600/4373857be02f/nihms-1786862-f0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fbc3/9017600/42a5af90bb72/nihms-1786862-f0005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fbc3/9017600/85a350b4cbcb/nihms-1786862-f0006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fbc3/9017600/c2395cda70a8/nihms-1786862-f0007.jpg

相似文献

1
Simulation-derived best practices for clustering clinical data.基于模拟的临床数据聚类最佳实践。
J Biomed Inform. 2021 Jun;118:103788. doi: 10.1016/j.jbi.2021.103788. Epub 2021 Apr 20.
2
Head-to-head comparison of clustering methods for heterogeneous data: a simulation-driven benchmark.针对异质数据的聚类方法的头对头比较:基于模拟的基准测试。
Sci Rep. 2021 Feb 18;11(1):4202. doi: 10.1038/s41598-021-83340-8.
3
The potential of clustering methods to define intersection test scenarios: Assessing real-life performance of AEB.聚类方法在定义交叉口测试场景中的潜力:评估 AEB 的实际性能。
Accid Anal Prev. 2018 Apr;113:1-11. doi: 10.1016/j.aap.2018.01.010. Epub 2018 Jan 30.
4
SillyPutty: Improved clustering by optimizing the silhouette width.SillyPutty:通过优化轮廓宽度实现聚类改进。
PLoS One. 2024 Jun 7;19(6):e0300358. doi: 10.1371/journal.pone.0300358. eCollection 2024.
5
Analysis of fMRI data using improved self-organizing mapping and spatio-temporal metric hierarchical clustering.使用改进的自组织映射和时空度量层次聚类对功能磁共振成像数据进行分析。
IEEE Trans Med Imaging. 2008 Oct;27(10):1472-83. doi: 10.1109/TMI.2008.923987.
6
In simulated data and health records, latent class analysis was the optimum multimorbidity clustering algorithm.在模拟数据和健康记录中,潜在类别分析是最优的多病种聚类算法。
J Clin Epidemiol. 2022 Dec;152:164-175. doi: 10.1016/j.jclinepi.2022.10.011. Epub 2022 Oct 11.
7
Evaluation of standard and semantically-augmented distance metrics for neurology patients.评估标准和语义增强距离度量在神经病学患者中的应用。
BMC Med Inform Decis Mak. 2020 Aug 26;20(1):203. doi: 10.1186/s12911-020-01217-8.
8
Interval data clustering using self-organizing maps based on adaptive Mahalanobis distances.基于自适应马氏距离的自组织映射的区间数据聚类。
Neural Netw. 2013 Oct;46:124-32. doi: 10.1016/j.neunet.2013.04.009. Epub 2013 May 7.
9
Classification of bioinformatics workflows using weighted versions of partitioning and hierarchical clustering algorithms.使用分区和层次聚类算法的加权版本对生物信息学工作流程进行分类。
BMC Bioinformatics. 2015 Mar 3;16:68. doi: 10.1186/s12859-015-0508-1.
10
Sheep's coping style can be identified by unsupervised machine learning from unlabeled data.通过对无标签数据进行无监督机器学习,可以识别出绵羊的应对方式。
Behav Processes. 2022 Jan;194:104559. doi: 10.1016/j.beproc.2021.104559. Epub 2021 Nov 25.

引用本文的文献

1
Artificial intelligence in pediatric allergy research.人工智能在儿科过敏研究中的应用
Eur J Pediatr. 2024 Dec 21;184(1):98. doi: 10.1007/s00431-024-05925-5.
2
Topological Structures in the Space of Treatment-Naïve Patients with Chronic Lymphocytic Leukemia.初治慢性淋巴细胞白血病患者空间中的拓扑结构
Cancers (Basel). 2024 Jul 26;16(15):2662. doi: 10.3390/cancers16152662.
3
A robust clustering strategy for stratification unveils unique patient subgroups in acutely decompensated cirrhosis.一项稳健的聚类分层策略揭示了急性失代偿性肝硬化中独特的患者亚组。

本文引用的文献

1
Pattern recognition in lymphoid malignancies using CytoGPS and Mercator.使用CytoGPS和墨卡托投影法对淋巴系统恶性肿瘤进行模式识别。
BMC Bioinformatics. 2021 Mar 1;22(1):100. doi: 10.1186/s12859-021-03992-1.
2
Mercator: a pipeline for multi-method, unsupervised visualization and distance generation.墨卡托投影法:一种用于多方法、无监督可视化和距离生成的管道。
Bioinformatics. 2021 Sep 9;37(17):2780-2781. doi: 10.1093/bioinformatics/btab037.
3
Unsupervised machine learning and prognostic factors of survival in chronic lymphocytic leukemia.
J Transl Med. 2024 Jun 27;22(1):599. doi: 10.1186/s12967-024-05386-2.
4
SillyPutty: Improved clustering by optimizing the silhouette width.SillyPutty:通过优化轮廓宽度实现聚类改进。
PLoS One. 2024 Jun 7;19(6):e0300358. doi: 10.1371/journal.pone.0300358. eCollection 2024.
5
SillyPutty: Improved clustering by optimizing the silhouette width.橡皮泥:通过优化轮廓宽度改进聚类
bioRxiv. 2023 Nov 11:2023.11.07.566055. doi: 10.1101/2023.11.07.566055.
6
Unsupervised clustering analysis of comprehensive health status and its influencing factors on women of childbearing age: a cross-sectional study from a province in central China.基于中国中部一省的横断面研究:未监督聚类分析育龄妇女的综合健康状况及其影响因素。
BMC Public Health. 2023 Nov 9;23(1):2206. doi: 10.1186/s12889-023-17096-3.
7
Which congestion presentation pattern on the physical findings is associated with future adverse events? A cluster analysis in the multicenter acute heart failure registry.哪种物理检查表现模式与未来不良事件相关?多中心急性心力衰竭注册研究中的聚类分析。
Clin Res Cardiol. 2023 Aug;112(8):1108-1118. doi: 10.1007/s00392-023-02201-8. Epub 2023 Apr 12.
8
K-medoids clustering of hospital admission characteristics to classify severity of influenza virus infection.基于住院特征的 K-medoids 聚类分析对流感病毒感染严重程度进行分类。
Influenza Other Respir Viruses. 2023 Mar 7;17(3):e13120. doi: 10.1111/irv.13120. eCollection 2023 Mar.
9
A cohesin-associated gene score may predict immune checkpoint blockade in hepatocellular carcinoma.黏连蛋白相关基因评分可能预测肝癌的免疫检查点阻断治疗。
FEBS Open Bio. 2022 Oct;12(10):1857-1874. doi: 10.1002/2211-5463.13474. Epub 2022 Sep 2.
10
Using Unsupervised Machine Learning Methods to Cluster Comorbidities in a Population-Based Cohort of Patients With Rheumatoid Arthritis.使用无监督机器学习方法对类风湿关节炎患者的人群队列中的合并症进行聚类。
Arthritis Care Res (Hoboken). 2023 Feb;75(2):210-219. doi: 10.1002/acr.24973. Epub 2022 Sep 15.
非监督机器学习与慢性淋巴细胞白血病生存的预后因素。
J Am Med Inform Assoc. 2020 Jul 1;27(7):1019-1027. doi: 10.1093/jamia/ocaa060.
4
Detecting Systemic Data Quality Issues in Electronic Health Records.检测电子健康记录中的系统性数据质量问题。
Stud Health Technol Inform. 2019 Aug 21;264:383-387. doi: 10.3233/SHTI190248.
5
Assessing clinical heterogeneity in sepsis through treatment patterns and machine learning.通过治疗模式和机器学习评估脓毒症的临床异质性。
J Am Med Inform Assoc. 2019 Dec 1;26(12):1466-1477. doi: 10.1093/jamia/ocz106.
6
A cluster-based approach for integrating clinical management of Medicare beneficiaries with multiple chronic conditions.一种基于聚类的方法,用于整合 Medicare 多重慢性病受益人的临床管理。
PLoS One. 2019 Jun 19;14(6):e0217696. doi: 10.1371/journal.pone.0217696. eCollection 2019.
7
Identifying clinically important COPD sub-types using data-driven approaches in primary care population based electronic health records.利用初级保健人群基于电子健康记录的数据分析方法识别有临床意义的 COPD 亚型。
BMC Med Inform Decis Mak. 2019 Apr 18;19(1):86. doi: 10.1186/s12911-019-0805-0.
8
Applying Machine Learning Algorithms to Segment High-Cost Patient Populations.应用机器学习算法对高费用患者人群进行细分。
J Gen Intern Med. 2019 Feb;34(2):211-217. doi: 10.1007/s11606-018-4760-8. Epub 2018 Dec 12.
9
Subgroups of High-Cost Medicare Advantage Patients: an Observational Study.高成本医疗保险优势患者亚组:一项观察性研究。
J Gen Intern Med. 2019 Feb;34(2):218-225. doi: 10.1007/s11606-018-4759-1. Epub 2018 Dec 3.
10
Inpatient portal clusters: identifying user groups based on portal features.住院患者门户群集:基于门户功能识别用户群体。
J Am Med Inform Assoc. 2019 Jan 1;26(1):28-36. doi: 10.1093/jamia/ocy147.