IEEE J Biomed Health Inform. 2022 May;26(5):2351-2359. doi: 10.1109/JBHI.2021.3129461. Epub 2022 May 5.
A Relational-Sequential dataset (or RS-dataset for short) contains records comprised of a patient's values in demographic attributes and their sequence of diagnosis codes. The task of clustering an RS-dataset is helpful for analyses ranging from pattern mining to classification. However, existing methods are not appropriate to perform this task. Thus, we initiate a study of how an RS-dataset can be clustered effectively and efficiently. We formalize the task of clustering an RS-dataset as an optimization problem. At the heart of the problem is a distance measure we design to quantify the pairwise similarity between records of an RS-dataset. Our measure uses a tree structure that encodes hierarchical relationships between records, based on their demographics, as well as an edit-distance-like measure that captures both the sequentiality and the semantic similarity of diagnosis codes. We also develop an algorithm which first identifies k representative records (centers), for a given k, and then constructs k clusters, each containing one center and the records that are closer to the center compared to other centers. Experiments using two Electronic Health Record datasets demonstrate that our algorithm constructs compact and well-separated clusters, which preserve meaningful relationships between demographics and sequences of diagnosis codes, while being efficient and scalable.
关系-序列数据集(简称 RS 数据集)包含记录,这些记录由患者在人口统计学属性中的值及其诊断代码序列组成。对 RS 数据集进行聚类的任务有助于从模式挖掘到分类的各种分析。然而,现有的方法并不适合执行此任务。因此,我们开始研究如何有效地和有效地对 RS 数据集进行聚类。我们将 RS 数据集的聚类任务形式化为一个优化问题。该问题的核心是我们设计的一种距离度量标准,用于量化 RS 数据集记录之间的成对相似性。我们的度量标准使用基于记录的人口统计学信息的树结构来编码记录之间的层次关系,以及一种类似于编辑距离的度量标准,用于捕获诊断代码的顺序和语义相似性。我们还开发了一种算法,该算法首先为给定的 k 识别 k 个代表性记录(中心),然后构建 k 个聚类,每个聚类包含一个中心和与其他中心相比更接近中心的记录。使用两个电子健康记录数据集进行的实验表明,我们的算法构建了紧凑且分离良好的聚类,这些聚类保留了人口统计学信息和诊断代码序列之间有意义的关系,同时具有高效性和可扩展性。