对包含人口统计学和诊断代码的数据集进行聚类。

Clustering datasets with demographics and diagnosis codes.

作者信息

Zhong Haodi, Loukides Grigorios, Gwadera Robert

机构信息

Department of Informatics, King's College London, London, UK.

School of Computer Science, Cardiff University, Cardiff, UK.

出版信息

J Biomed Inform. 2020 Feb;102:103360. doi: 10.1016/j.jbi.2019.103360. Epub 2020 Jan 3.

DOI:10.1016/j.jbi.2019.103360

PMID:31904428

Abstract

Clustering data derived from Electronic Health Record (EHR) systems is important to discover relationships between the clinical profiles of patients and as a preprocessing step for analysis tasks, such as classification. However, the heterogeneity of these data makes the application of existing clustering methods difficult and calls for new clustering approaches. In this paper, we propose the first approach for clustering a dataset in which each record contains a patient's values in demographic attributes and their set of diagnosis codes. Our approach represents the dataset in a binary form in which the features are selected demographic values, as well as combinations (patterns) of frequent and correlated diagnosis codes. This representation enables measuring similarity between records using cosine similarity, an effective measure for binary-represented data, and finding compact, well-separated clusters through hierarchical clustering. Our experiments using two publicly available EHR datasets, comprised of over 26,000 and 52,000 records, demonstrate that our approach is able to construct clusters with correlated demographics and diagnosis codes, and that it is efficient and scalable.

摘要

对源自电子健康记录（EHR）系统的数据进行聚类，对于发现患者临床特征之间的关系以及作为诸如分类等分析任务的预处理步骤而言非常重要。然而，这些数据的异质性使得现有聚类方法的应用变得困难，并需要新的聚类方法。在本文中，我们提出了第一种对数据集进行聚类的方法，其中每条记录都包含患者在人口统计学属性方面的值及其诊断代码集。我们的方法以二进制形式表示数据集，其中特征是选定的人口统计学值以及频繁和相关诊断代码的组合（模式）。这种表示方式能够使用余弦相似度来测量记录之间的相似度，余弦相似度是一种用于二进制表示数据的有效度量，并通过层次聚类找到紧凑、分离良好的聚类。我们使用两个公开可用的EHR数据集进行的实验，这两个数据集分别包含超过26000条和52000条记录，表明我们的方法能够构建具有相关人口统计学和诊断代码的聚类，并且该方法高效且可扩展。

相似文献

Clustering datasets with demographics and diagnosis codes.

J Biomed Inform. 2020 Feb;102:103360. doi: 10.1016/j.jbi.2019.103360. Epub 2020 Jan 3.

Clustering Demographics and Sequences of Diagnosis Codes.

IEEE J Biomed Health Inform. 2022 May;26(5):2351-2359. doi: 10.1109/JBHI.2021.3129461. Epub 2022 May 5.

Anonymizing datasets with demographics and diagnosis codes in the presence of utility constraints.

J Biomed Inform. 2017 Jan;65:76-96. doi: 10.1016/j.jbi.2016.11.001. Epub 2016 Nov 8.

A clustering approach for detecting implausible observation values in electronic health records data.

BMC Med Inform Decis Mak. 2019 Jul 23;19(1):142. doi: 10.1186/s12911-019-0852-6.

An empirical evaluation of supervised learning approaches in assigning diagnosis codes to electronic medical records.

Artif Intell Med. 2015 Oct;65(2):155-66. doi: 10.1016/j.artmed.2015.04.007. Epub 2015 May 15.

Employing heat maps to mine associations in structured routine care data.

Artif Intell Med. 2014 Feb;60(2):79-88. doi: 10.1016/j.artmed.2013.12.003. Epub 2013 Dec 15.

Automated grouping of medical codes via multiview banded spectral clustering.

J Biomed Inform. 2019 Dec;100:103322. doi: 10.1016/j.jbi.2019.103322. Epub 2019 Oct 28.

Identifying and characterizing highly similar notes in big clinical note datasets.

J Biomed Inform. 2018 Jun;82:63-69. doi: 10.1016/j.jbi.2018.04.009. Epub 2018 Apr 19.

EHR problem list clustering for improved topic-space navigation.

BMC Med Inform Decis Mak. 2019 Apr 4;19(Suppl 3):72. doi: 10.1186/s12911-019-0789-9.

Integrating Multimodal Electronic Health Records for Diagnosis Prediction.

AMIA Annu Symp Proc. 2022 Feb 21;2021:726-735. eCollection 2021.

引用本文的文献

Noisecut: a python package for noise-tolerant classification of binary data using prior knowledge integration and max-cut solutions.

BMC Bioinformatics. 2024 Apr 20;25(1):155. doi: 10.1186/s12859-024-05769-8.

Factors associated with resistance to SARS-CoV-2 infection discovered using large-scale medical record data and machine learning.

PLoS One. 2023 Feb 22;18(2):e0278466. doi: 10.1371/journal.pone.0278466. eCollection 2023.

A Framework for Automatic Clustering of EHR Messages Using a Spatial Clustering Approach.

Healthcare (Basel). 2023 Jan 30;11(3):390. doi: 10.3390/healthcare11030390.

Health service research definition builder: An R Shiny application for exploring diagnosis codes associated with services reported in routinely collected health data.

PLoS One. 2023 Jan 12;18(1):e0266154. doi: 10.1371/journal.pone.0266154. eCollection 2023.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

对包含人口统计学和诊断代码的数据集进行聚类。

Clustering datasets with demographics and diagnosis codes.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献