Suppr超能文献

对包含人口统计学和诊断代码的数据集进行聚类。

Clustering datasets with demographics and diagnosis codes.

作者信息

Zhong Haodi, Loukides Grigorios, Gwadera Robert

机构信息

Department of Informatics, King's College London, London, UK.

School of Computer Science, Cardiff University, Cardiff, UK.

出版信息

J Biomed Inform. 2020 Feb;102:103360. doi: 10.1016/j.jbi.2019.103360. Epub 2020 Jan 3.

Abstract

Clustering data derived from Electronic Health Record (EHR) systems is important to discover relationships between the clinical profiles of patients and as a preprocessing step for analysis tasks, such as classification. However, the heterogeneity of these data makes the application of existing clustering methods difficult and calls for new clustering approaches. In this paper, we propose the first approach for clustering a dataset in which each record contains a patient's values in demographic attributes and their set of diagnosis codes. Our approach represents the dataset in a binary form in which the features are selected demographic values, as well as combinations (patterns) of frequent and correlated diagnosis codes. This representation enables measuring similarity between records using cosine similarity, an effective measure for binary-represented data, and finding compact, well-separated clusters through hierarchical clustering. Our experiments using two publicly available EHR datasets, comprised of over 26,000 and 52,000 records, demonstrate that our approach is able to construct clusters with correlated demographics and diagnosis codes, and that it is efficient and scalable.

摘要

对源自电子健康记录(EHR)系统的数据进行聚类,对于发现患者临床特征之间的关系以及作为诸如分类等分析任务的预处理步骤而言非常重要。然而,这些数据的异质性使得现有聚类方法的应用变得困难,并需要新的聚类方法。在本文中,我们提出了第一种对数据集进行聚类的方法,其中每条记录都包含患者在人口统计学属性方面的值及其诊断代码集。我们的方法以二进制形式表示数据集,其中特征是选定的人口统计学值以及频繁和相关诊断代码的组合(模式)。这种表示方式能够使用余弦相似度来测量记录之间的相似度,余弦相似度是一种用于二进制表示数据的有效度量,并通过层次聚类找到紧凑、分离良好的聚类。我们使用两个公开可用的EHR数据集进行的实验,这两个数据集分别包含超过26000条和52000条记录,表明我们的方法能够构建具有相关人口统计学和诊断代码的聚类,并且该方法高效且可扩展。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验