用于病历记录的冗余感知主题建模

Redundancy-aware topic modeling for patient record notes.

作者信息

Cohen Raphael, Aviram Iddo, Elhadad Michael, Elhadad Noémie

机构信息

Department of Computer Science, Ben Gurion University, Beer Sheva, Israel.

Department of Biomedical Informatics, Columbia University, New York, New York, United States of America.

出版信息

PLoS One. 2014 Feb 13;9(2):e87555. doi: 10.1371/journal.pone.0087555. eCollection 2014.

DOI:10.1371/journal.pone.0087555

PMID:24551060

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3923754/

Abstract

The clinical notes in a given patient record contain much redundancy, in large part due to clinicians' documentation habit of copying from previous notes in the record and pasting into a new note. Previous work has shown that this redundancy has a negative impact on the quality of text mining and topic modeling in particular. In this paper we describe a novel variant of Latent Dirichlet Allocation (LDA) topic modeling, Red-LDA, which takes into account the inherent redundancy of patient records when modeling content of clinical notes. To assess the value of Red-LDA, we experiment with three baselines and our novel redundancy-aware topic modeling method: given a large collection of patient records, (i) apply vanilla LDA to all documents in all input records; (ii) identify and remove all redundancy by chosing a single representative document for each record as input to LDA; (iii) identify and remove all redundant paragraphs in each record, leaving partial, non-redundant documents as input to LDA; and (iv) apply Red-LDA to all documents in all input records. Both quantitative evaluation carried out through log-likelihood on held-out data and topic coherence of produced topics and qualitative assessment of topics carried out by physicians show that Red-LDA produces superior models to all three baseline strategies. This research contributes to the emerging field of understanding the characteristics of the electronic health record and how to account for them in the framework of data mining. The code for the two redundancy-elimination baselines and Red-LDA is made publicly available to the community.

摘要

给定患者记录中的临床笔记存在大量冗余，很大程度上是由于临床医生有从记录中的先前笔记复制并粘贴到新笔记中的记录习惯。先前的研究表明，这种冗余尤其会对文本挖掘和主题建模的质量产生负面影响。在本文中，我们描述了一种潜在狄利克雷分配（LDA）主题建模的新颖变体，即Red-LDA，它在对临床笔记内容进行建模时考虑了患者记录中固有的冗余。为了评估Red-LDA的价值，我们使用三个基线和我们新颖的冗余感知主题建模方法进行了实验：给定大量患者记录，（i）将普通LDA应用于所有输入记录中的所有文档；（ii）通过为每个记录选择一个代表性文档作为LDA的输入来识别并去除所有冗余；（iii）识别并去除每个记录中的所有冗余段落，将部分非冗余文档作为LDA的输入；以及（iv）将Red-LDA应用于所有输入记录中的所有文档。通过对留出数据进行对数似然性以及对生成主题的主题连贯性进行的定量评估，以及医生对主题进行的定性评估均表明，Red-LDA生成的模型优于所有三种基线策略。这项研究有助于新兴的理解电子健康记录特征以及如何在数据挖掘框架中考虑这些特征的领域。两种冗余消除基线和Red-LDA的代码已向社区公开。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7f87/3923754/1837d39ff4f1/pone.0087555.g001.jpg

相似文献

Redundancy-aware topic modeling for patient record notes.

PLoS One. 2014 Feb 13;9(2):e87555. doi: 10.1371/journal.pone.0087555. eCollection 2014.

Redundancy in electronic health record corpora: analysis, impact on text mining performance and mitigation strategies.

BMC Bioinformatics. 2013 Jan 16;14:10. doi: 10.1186/1471-2105-14-10.

Cardiology record multi-label classification using latent Dirichlet allocation.

Comput Methods Programs Biomed. 2018 Oct;164:111-119. doi: 10.1016/j.cmpb.2018.07.002. Epub 2018 Jul 17.

Web content topic modeling using LDA and HTML tags.

PeerJ Comput Sci. 2023 Jul 11;9:e1459. doi: 10.7717/peerj-cs.1459. eCollection 2023.

Quantifying clinical narrative redundancy in an electronic health record.

J Am Med Inform Assoc. 2010 Jan-Feb;17(1):49-53. doi: 10.1197/jamia.M3390.

Length and Redundancy of Outpatient Progress Notes Across a Decade at an Academic Medical Center.

JAMA Netw Open. 2021 Jul 1;4(7):e2115334. doi: 10.1001/jamanetworkopen.2021.15334.

Eliciting Insights From Chat Logs of the 25X5 Symposium to Reduce Documentation Burden: Novel Application of Topic Modeling.

J Med Internet Res. 2023 May 17;25:e45645. doi: 10.2196/45645.

An integrated clustering and BERT framework for improved topic modeling.

Int J Inf Technol. 2023;15(4):2187-2195. doi: 10.1007/s41870-023-01268-w. Epub 2023 May 6.

Mining heterogeneous clinical notes by multi-modal latent topic model.

PLoS One. 2021 Apr 8;16(4):e0249622. doi: 10.1371/journal.pone.0249622. eCollection 2021.

Predicting early psychiatric readmission with natural language processing of narrative discharge summaries.

Transl Psychiatry. 2016 Oct 18;6(10):e921. doi: 10.1038/tp.2015.182.

引用本文的文献

Social determinants of health extraction from clinical notes across institutions using large language models.

NPJ Digit Med. 2025 May 17;8(1):287. doi: 10.1038/s41746-025-01645-8.

Finding Long-COVID: temporal topic modeling of electronic health records from the N3C and RECOVER programs.

NPJ Digit Med. 2024 Oct 21;7(1):296. doi: 10.1038/s41746-024-01286-3.

Estimating the Severity of Oral Lesions Via Analysis of Cone Beam Computed Tomography Reports: A Proposed Deep Learning Model.

Int Dent J. 2025 Feb;75(1):135-143. doi: 10.1016/j.identj.2024.06.015. Epub 2024 Jul 26.

Finding Long-COVID: Temporal Topic Modeling of Electronic Health Records from the N3C and RECOVER Programs.

medRxiv. 2024 Jun 11:2023.09.11.23295259. doi: 10.1101/2023.09.11.23295259.

Large Language Models for Social Determinants of Health Information Extraction from Clinical Notes - A Generalizable Approach across Institutions.

medRxiv. 2024 May 22:2024.05.21.24307726. doi: 10.1101/2024.05.21.24307726.

Integrating unsupervised and supervised learning techniques to predict traumatic brain injury: A population-based study.

Intell Based Med. 2023;8. doi: 10.1016/j.ibmed.2023.100118. Epub 2023 Nov 8.

Finding Potential Adverse Events in the Unstructured Text of Electronic Health Care Records: Development of the Shakespeare Method.

JMIRx Med. 2021 Aug 11;2(3):e27017. doi: 10.2196/27017.

A novel multiple kernel fuzzy topic modeling technique for biomedical data.

BMC Bioinformatics. 2022 Jul 12;23(1):275. doi: 10.1186/s12859-022-04780-1.

Hierarchical lifelong topic modeling using rules extracted from network communities.

PLoS One. 2022 Mar 3;17(3):e0264481. doi: 10.1371/journal.pone.0264481. eCollection 2022.

Identification of social determinants of health using multi-label classification of electronic health record clinical notes.

JAMIA Open. 2021 Feb 9;4(3):ooaa069. doi: 10.1093/jamiaopen/ooaa069. eCollection 2021 Jul.

本文引用的文献

Redundancy in electronic health record corpora: analysis, impact on text mining performance and mitigation strategies.

BMC Bioinformatics. 2013 Jan 16;14:10. doi: 10.1186/1471-2105-14-10.

Risk stratification of ICU patients using topic models inferred from unstructured progress notes.

AMIA Annu Symp Proc. 2012;2012:505-11. Epub 2012 Nov 3.

Investigating drug repositioning opportunities in FDA drug labels through topic modeling.

BMC Bioinformatics. 2012;13 Suppl 15(Suppl 15):S6. doi: 10.1186/1471-2105-13-S15-S6. Epub 2012 Sep 11.

Next-generation phenotyping of electronic health records.

J Am Med Inform Assoc. 2013 Jan 1;20(1):117-21. doi: 10.1136/amiajnl-2012-001145. Epub 2012 Sep 6.

Mining FDA drug labels using an unsupervised learning technique--topic modeling.

BMC Bioinformatics. 2011 Oct 18;12 Suppl 10(Suppl 10):S11. doi: 10.1186/1471-2105-12-S10-S11.

Exploring subdomain variation in biomedical language.

BMC Bioinformatics. 2011 May 27;12:212. doi: 10.1186/1471-2105-12-212.

Finding complex biological relationships in recent PubMed articles using Bio-LDA.

PLoS One. 2011 Mar 23;6(3):e17243. doi: 10.1371/journal.pone.0017243.

Dialect topic modeling for improved consumer medical search.

AMIA Annu Symp Proc. 2010 Nov 13;2010:132-6.

Clinical Case-based Retrieval Using Latent Topic Analysis.

AMIA Annu Symp Proc. 2010 Nov 13;2010:26-30.

Copy and paste of electronic health records: a modern medical illness.

Am J Med. 2010 May;123(5):e9. doi: 10.1016/j.amjmed.2009.10.012.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

用于病历记录的冗余感知主题建模

Redundancy-aware topic modeling for patient record notes.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献