在大型临床笔记数据集中识别和描述高度相似的笔记。

Identifying and characterizing highly similar notes in big clinical note datasets.

机构信息

UCSD Health Department of Biomedical Informatics, University of California, San Diego, 9500 Gilman Dr, La Jolla, CA 92093, USA; Department of Anesthesiology, University of California, San Diego, 200 West Arbor Dr, San Diego, CA 92103, USA.

UCSD Health Department of Biomedical Informatics, University of California, San Diego, 9500 Gilman Dr, La Jolla, CA 92093, USA.

出版信息

J Biomed Inform. 2018 Jun;82:63-69. doi: 10.1016/j.jbi.2018.04.009. Epub 2018 Apr 19.

DOI:10.1016/j.jbi.2018.04.009

PMID:29679685

Abstract

BACKGROUND

Big clinical note datasets found in electronic health records (EHR) present substantial opportunities to train accurate statistical models that identify patterns in patient diagnosis and outcomes. However, near-to-exact duplication in note texts is a common issue in many clinical note datasets. We aimed to use a scalable algorithm to de-duplicate notes and further characterize the sources of duplication.

METHODS

We use an approximation algorithm to minimize pairwise comparisons consisting of three phases: (1) Minhashing with Locality Sensitive Hashing; (2) a clustering method using tree-structured disjoint sets; and (3) classification of near-duplicates (exact copies, common machine output notes, or similar notes) via pairwise comparison of notes in each cluster. We use the Jaccard Similarity (JS) to measure similarity between two documents. We analyzed two big clinical note datasets: our institutional dataset and MIMIC-III.

RESULTS

There were 1,528,940 notes analyzed from our institution. The de-duplication algorithm completed in 36.3 h. When the JS threshold was set at 0.7, the total number of clusters was 82,371 (total notes = 304,418). Among all JS thresholds, no clusters contained pairs of notes that were incorrectly clustered. When the JS threshold was set at 0.9 or 1.0, the de-duplication algorithm captured 100% of all random pairs with their JS at least as high as the set thresholds from the validation set. Similar performance was noted when analyzing the MIMIC-III dataset.

CONCLUSIONS

We showed that among the EHR from our institution and from the publicly-available MIMIC-III dataset, there were a significant number of near-to-exact duplicated notes.

摘要

背景

电子健康记录 (EHR) 中包含的大型临床笔记数据集为训练准确识别患者诊断和结果模式的统计模型提供了巨大机会。然而，许多临床笔记数据集中都存在笔记文本近乎完全重复的问题。我们旨在使用可扩展算法来消除重复的笔记，并进一步描述重复的来源。

方法

我们使用近似算法来最小化由三个阶段组成的两两比较：（1）使用局部敏感哈希的 Minhashing；（2）使用树状不相交集的聚类方法；（3）通过比较每个聚类中的笔记来对近似重复（完全复制、常见机器输出笔记或相似笔记）进行分类。我们使用 Jaccard 相似度 (JS) 来衡量两个文档之间的相似度。我们分析了两个大型临床笔记数据集：我们的机构数据集和 MIMIC-III。

结果

我们的机构数据集共分析了 1528940 条笔记。去重算法在 36.3 小时内完成。当 JS 阈值设置为 0.7 时，总聚类数为 82371（总笔记数=304418）。在所有的 JS 阈值中，没有一个聚类包含被错误聚类的对。当 JS 阈值设置为 0.9 或 1.0 时，去重算法捕获了所有随机对，它们的 JS 至少与验证集中设定的阈值一样高。在分析 MIMIC-III 数据集时也注意到了类似的性能。