• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

在大型临床笔记数据集 中识别和描述高度相似的笔记。

Identifying and characterizing highly similar notes in big clinical note datasets.

机构信息

UCSD Health Department of Biomedical Informatics, University of California, San Diego, 9500 Gilman Dr, La Jolla, CA 92093, USA; Department of Anesthesiology, University of California, San Diego, 200 West Arbor Dr, San Diego, CA 92103, USA.

UCSD Health Department of Biomedical Informatics, University of California, San Diego, 9500 Gilman Dr, La Jolla, CA 92093, USA.

出版信息

J Biomed Inform. 2018 Jun;82:63-69. doi: 10.1016/j.jbi.2018.04.009. Epub 2018 Apr 19.

DOI:10.1016/j.jbi.2018.04.009
PMID:29679685
Abstract

BACKGROUND

Big clinical note datasets found in electronic health records (EHR) present substantial opportunities to train accurate statistical models that identify patterns in patient diagnosis and outcomes. However, near-to-exact duplication in note texts is a common issue in many clinical note datasets. We aimed to use a scalable algorithm to de-duplicate notes and further characterize the sources of duplication.

METHODS

We use an approximation algorithm to minimize pairwise comparisons consisting of three phases: (1) Minhashing with Locality Sensitive Hashing; (2) a clustering method using tree-structured disjoint sets; and (3) classification of near-duplicates (exact copies, common machine output notes, or similar notes) via pairwise comparison of notes in each cluster. We use the Jaccard Similarity (JS) to measure similarity between two documents. We analyzed two big clinical note datasets: our institutional dataset and MIMIC-III.

RESULTS

There were 1,528,940 notes analyzed from our institution. The de-duplication algorithm completed in 36.3 h. When the JS threshold was set at 0.7, the total number of clusters was 82,371 (total notes = 304,418). Among all JS thresholds, no clusters contained pairs of notes that were incorrectly clustered. When the JS threshold was set at 0.9 or 1.0, the de-duplication algorithm captured 100% of all random pairs with their JS at least as high as the set thresholds from the validation set. Similar performance was noted when analyzing the MIMIC-III dataset.

CONCLUSIONS

We showed that among the EHR from our institution and from the publicly-available MIMIC-III dataset, there were a significant number of near-to-exact duplicated notes.

摘要

背景

电子健康记录 (EHR) 中包含的大型临床笔记数据集为训练准确识别患者诊断和结果模式的统计模型提供了巨大机会。然而,许多临床笔记数据集中都存在笔记文本近乎完全重复的问题。我们旨在使用可扩展算法来消除重复的笔记,并进一步描述重复的来源。

方法

我们使用近似算法来最小化由三个阶段组成的两两比较:(1)使用局部敏感哈希的 Minhashing;(2)使用树状不相交集的聚类方法;(3)通过比较每个聚类中的笔记来对近似重复(完全复制、常见机器输出笔记或相似笔记)进行分类。我们使用 Jaccard 相似度 (JS) 来衡量两个文档之间的相似度。我们分析了两个大型临床笔记数据集:我们的机构数据集和 MIMIC-III。

结果

我们的机构数据集共分析了 1528940 条笔记。去重算法在 36.3 小时内完成。当 JS 阈值设置为 0.7 时,总聚类数为 82371(总笔记数=304418)。在所有的 JS 阈值中,没有一个聚类包含被错误聚类的对。当 JS 阈值设置为 0.9 或 1.0 时,去重算法捕获了所有随机对,它们的 JS 至少与验证集中设定的阈值一样高。在分析 MIMIC-III 数据集时也注意到了类似的性能。

结论

我们表明,在我们机构的 EHR 中以及在公开的 MIMIC-III 数据集之间,存在大量近乎完全重复的笔记。

相似文献

1
Identifying and characterizing highly similar notes in big clinical note datasets.在大型临床笔记数据集 中识别和描述高度相似的笔记。
J Biomed Inform. 2018 Jun;82:63-69. doi: 10.1016/j.jbi.2018.04.009. Epub 2018 Apr 19.
2
Customization scenarios for de-identification of clinical notes.临床记录去识别的定制化场景。
BMC Med Inform Decis Mak. 2020 Jan 30;20(1):14. doi: 10.1186/s12911-020-1026-2.
3
Comparison of 2 Natural Language Processing Methods for Identification of Bleeding Among Critically Ill Patients.比较 2 种自然语言处理方法在识别危重症患者出血中的应用。
JAMA Netw Open. 2018 Oct 5;1(6):e183451. doi: 10.1001/jamanetworkopen.2018.3451.
4
Automated feature selection of predictors in electronic medical records data.电子病历数据中预测指标的自动特征选择
Biometrics. 2019 Mar;75(1):268-277. doi: 10.1111/biom.12987. Epub 2019 Apr 2.
5
Building a tobacco user registry by extracting multiple smoking behaviors from clinical notes.从临床记录中提取多种吸烟行为来建立一个烟草使用者登记册。
BMC Med Inform Decis Mak. 2019 Jul 25;19(1):141. doi: 10.1186/s12911-019-0863-3.
6
Development of a generalizable natural language processing pipeline to extract physician-reported pain from clinical reports: Generated using publicly-available datasets and tested on institutional clinical reports for cancer patients with bone metastases.开发一种可推广的自然语言处理管道,从临床报告中提取医生报告的疼痛:使用公开可用的数据集生成,并在患有骨转移的癌症患者的机构临床报告上进行测试。
J Biomed Inform. 2021 Aug;120:103864. doi: 10.1016/j.jbi.2021.103864. Epub 2021 Jul 12.
7
Artificial Intelligence Learning Semantics via External Resources for Classifying Diagnosis Codes in Discharge Notes.人工智能通过外部资源学习语义以对出院小结中的诊断代码进行分类。
J Med Internet Res. 2017 Nov 6;19(11):e380. doi: 10.2196/jmir.8344.
8
Detecting clinically relevant new information in clinical notes across specialties and settings.检测跨专业和设置的临床记录中的临床相关新信息。
BMC Med Inform Decis Mak. 2017 Jul 5;17(Suppl 2):68. doi: 10.1186/s12911-017-0464-y.
9
Prevalence and Sources of Duplicate Information in the Electronic Medical Record.电子病历中重复信息的流行率和来源。
JAMA Netw Open. 2022 Sep 1;5(9):e2233348. doi: 10.1001/jamanetworkopen.2022.33348.
10
Leveraging existing corpora for de-identification of psychiatric notes using domain adaptation.利用现有语料库,通过领域自适应对精神科病历进行去识别化处理。
AMIA Annu Symp Proc. 2018 Apr 16;2017:1070-1079. eCollection 2017.

引用本文的文献

1
Improving Clinical Documentation with Artificial Intelligence: A Systematic Review.利用人工智能改善临床文档记录:一项系统综述。
Perspect Health Inf Manag. 2024 Jun 1;21(2):1d. eCollection 2024 Summer-Fall.
2
An efficient learning based approach for automatic record deduplication with benchmark datasets.一种基于高效学习的自动记录去重方法及基准数据集
Sci Rep. 2024 Jul 15;14(1):16254. doi: 10.1038/s41598-024-63242-1.
3
Artificial intelligence: revolutionizing cardiology with large language models.人工智能:大语言模型颠覆心脏病学。
Eur Heart J. 2024 Feb 1;45(5):332-345. doi: 10.1093/eurheartj/ehad838.
4
Integrating Structured and Unstructured EHR Data for Predicting Mortality by Machine Learning and Latent Dirichlet Allocation Method.基于机器学习和潜在狄利克雷分配方法的整合结构化和非结构化电子健康记录数据预测死亡率。
Int J Environ Res Public Health. 2023 Feb 28;20(5):4340. doi: 10.3390/ijerph20054340.
5
Development of an Open-Source Annotated Glaucoma Medication Dataset From Clinical Notes in the Electronic Health Record.从电子健康记录中的临床记录中开发开源标注青光眼药物数据集。
Transl Vis Sci Technol. 2022 Nov 1;11(11):20. doi: 10.1167/tvst.11.11.20.
6
RadBERT: Adapting Transformer-based Language Models to Radiology.RadBERT:使基于Transformer的语言模型适用于放射学领域。
Radiol Artif Intell. 2022 Jun 15;4(4):e210258. doi: 10.1148/ryai.210258. eCollection 2022 Jul.
7
Predicting the Mortality of ICU Patients by Topic Model with Machine-Learning Techniques.运用机器学习技术的主题模型预测重症监护病房患者的死亡率
Healthcare (Basel). 2022 Jun 11;10(6):1087. doi: 10.3390/healthcare10061087.
8
Impact of Different Approaches to Preparing Notes for Analysis With Natural Language Processing on the Performance of Prediction Models in Intensive Care.不同的自然语言处理分析笔记准备方法对重症监护预测模型性能的影响
Crit Care Explor. 2021 Jun 11;3(6):e0450. doi: 10.1097/CCE.0000000000000450. eCollection 2021 Jun.
9
CAS: corpus of clinical cases in French.法语临床病例语料库。
J Biomed Semantics. 2020 Aug 6;11(1):7. doi: 10.1186/s13326-020-00225-x.
10
The Postencounter Form System: Viewpoint on Efficient Data Collection Within Electronic Health Records.会诊后表单系统:关于电子健康记录中高效数据收集的观点
JMIR Form Res. 2020 Apr 6;4(4):e17429. doi: 10.2196/17429.