评估在普通病历语料库中进行大规模自然语言处理的可行性：词汇分析

Assessing the feasibility of large-scale natural language processing in a corpus of ordinary medical records: a lexical analysis.

作者信息

Hersh W R, Campbell E M, Malveau S E

机构信息

Division of Medical Informatics and Outcomes Research, Oregon Health Sciences University, USA.

出版信息

Proc AMIA Annu Fall Symp. 1997:580-4.

PMID:9357692

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2233467/

Abstract

OBJECTIVE

Identify the lexical content of a large corpus of ordinary medical records to assess the feasibility of large-scale natural language processing.

METHODS

A corpus of 560 megabytes of medical record text from an academic medical center was broken into individual words and compared with the words in six medical vocabularies, a common word list, and a database of patient names. Unrecognized words were assessed for algorithmic and contextual approaches to identifying more words, while the remainder were analyzed for spelling correctness.

RESULTS

About 60% of the words occurred in the medical vocabularies, common word list, or names database. Of the remainder, one-third were recognizable by other means. Of the remaining unrecognizable words, over three-fourths represented correctly spelled real words and the rest were misspellings.

CONCLUSIONS

Large-scale generalized natural language processing methods for the medical record will require expansion of existing vocabularies, spelling error correction, and other algorithmic approaches to map words into those from clinical vocabularies.

摘要

目的

识别大量普通病历的词汇内容，以评估大规模自然语言处理的可行性。

方法

将来自一所学术医疗中心的560兆字节病历文本语料库拆分为单个单词，并与六个医学词汇表、一个常用单词列表和一个患者姓名数据库中的单词进行比较。对未识别的单词评估用于识别更多单词的算法和上下文方法，而其余单词则分析其拼写正确性。

结果

约60%的单词出现在医学词汇表、常用单词列表或姓名数据库中。其余单词中，三分之一可通过其他方式识别。在其余无法识别的单词中，超过四分之三代表拼写正确的真实单词，其余为拼写错误。

结论

用于病历的大规模通用自然语言处理方法将需要扩展现有词汇表、校正拼写错误以及采用其他算法方法将单词映射到临床词汇表中的单词。

相似文献

Assessing the feasibility of large-scale natural language processing in a corpus of ordinary medical records: a lexical analysis.评估在普通病历语料库中进行大规模自然语言处理的可行性：词汇分析

Proc AMIA Annu Fall Symp. 1997:580-4.

An efficient prototype method to identify and correct misspellings in clinical text.一种用于识别和纠正临床文本中拼写错误的高效原型方法。

BMC Res Notes. 2019 Jan 18;12(1):42. doi: 10.1186/s13104-019-4073-y.

Towards a unified medical lexicon for French.迈向统一的法语医学词汇表。

Stud Health Technol Inform. 2003;95:415-20.

Automated misspelling detection and correction in clinical free-text records.临床自由文本记录中的自动拼写错误检测与纠正

J Biomed Inform. 2015 Jun;55:188-95. doi: 10.1016/j.jbi.2015.04.008. Epub 2015 Apr 24.

A technique for semantic classification of unknown words using UMLS resources.一种使用统一医学语言系统（UMLS）资源对未知单词进行语义分类的技术。

Proc AMIA Symp. 1999:716-20.

A semantic lexicon for medical language processing.用于医学语言处理的语义词典。

J Am Med Inform Assoc. 1999 May-Jun;6(3):205-18. doi: 10.1136/jamia.1999.0060205.

Automatic acquisition of synonyms from French UMLS for enhanced search of EHRs.从法语统一医学语言系统自动获取同义词以增强电子健康记录搜索功能。

Stud Health Technol Inform. 2008;136:809-14.

Empirical, automated vocabulary discovery using large text corpora and advanced natural language processing tools.使用大型文本语料库和先进自然语言处理工具进行经验性的自动词汇发现。

Proc AMIA Annu Fall Symp. 1996:159-63.

Achieving automated narrative text interpretation using phrases in the electronic medical record.利用电子病历中的短语实现自动化叙述文本解读。

Proc AMIA Annu Fall Symp. 1996:532-6.

Analysis of biomedical text for chemical names: a comparison of three methods.用于化学名称的生物医学文本分析：三种方法的比较

Proc AMIA Symp. 1999:176-80.

引用本文的文献

Leveraging large language models to mimic domain expert labeling in unstructured text-based electronic healthcare records in non-english languages.利用大语言模型在非英语的基于文本的非结构化电子健康记录中模拟领域专家标注。

BMC Med Inform Decis Mak. 2025 Mar 31;25(1):154. doi: 10.1186/s12911-025-02871-6.

MLM-based typographical error correction of unstructured medical texts for named entity recognition.基于 MLM 的非结构化医疗文本命名实体识别的排版错误校正。

BMC Bioinformatics. 2022 Nov 16;23(1):486. doi: 10.1186/s12859-022-05035-9.

Identifying relevant medical reports from an assorted report collection using the multinomial naïve Bayes classifier and the UMLS.使用多项式朴素贝叶斯分类器和统一医学语言系统（UMLS）从各种报告集合中识别相关医学报告。

Indian J Med Inform. 2007;2(1).

JMIR Med Inform. 2021 Feb 22;9(2):e25530. doi: 10.2196/25530.

An efficient prototype method to identify and correct misspellings in clinical text.一种用于识别和纠正临床文本中拼写错误的高效原型方法。

BMC Res Notes. 2019 Jan 18;12(1):42. doi: 10.1186/s13104-019-4073-y.

KneeTex: an ontology-driven system for information extraction from MRI reports.KneeTex：一个用于从MRI报告中提取信息的本体驱动系统。

J Biomed Semantics. 2015 Sep 7;6:34. doi: 10.1186/s13326-015-0033-1. eCollection 2015.

FlexiTerm: a flexible term recognition method.FlexiTerm：一种灵活的术语识别方法。

J Biomed Semantics. 2013 Oct 10;4(1):27. doi: 10.1186/2041-1480-4-27.

Identification of misspelled words without a comprehensive dictionary using prevalence analysis.使用流行率分析在没有综合词典的情况下识别拼写错误的单词。

AMIA Annu Symp Proc. 2007 Oct 11;2007:751-5.

A UMLS-based spell checker for natural language processing in vaccine safety.一种基于统一医学语言系统的疫苗安全性自然语言处理拼写检查器。

BMC Med Inform Decis Mak. 2007 Feb 12;7:3. doi: 10.1186/1472-6947-7-3.

本文引用的文献

Outcome analysis: considerations for an electronic health record.结果分析：电子健康记录的考量因素

MD Comput. 1997 Jan-Feb;14(1):50-6.

Planned NLM/AHCPR large-scale vocabulary test: using UMLS technology to determine the extent to which controlled vocabularies cover terminology needed for health care and public health.美国国立医学图书馆/卫生保健政策与研究局计划的大规模词汇测试：利用统一医学语言系统技术确定受控词汇涵盖卫生保健和公共卫生所需术语的程度。

J Am Med Inform Assoc. 1996 Jul-Aug;3(4):281-7. doi: 10.1136/jamia.1996.96413136.

UMLS knowledge for biomedical language processing.用于生物医学语言处理的统一医学语言系统知识。

Bull Med Libr Assoc. 1993 Apr;81(2):184-94.

The Unified Medical Language System.统一医学语言系统

Methods Inf Med. 1993 Aug;32(4):281-91. doi: 10.1055/s-0038-1634945.

A general natural-language text processor for clinical radiology.一种用于临床放射学的通用自然语言文本处理器。

J Am Med Inform Assoc. 1994 Mar-Apr;1(2):161-74. doi: 10.1136/jamia.1994.95236146.

Natural language processing and the representation of clinical data.自然语言处理与临床数据的表示

J Am Med Inform Assoc. 1994 Mar-Apr;1(2):142-60. doi: 10.1136/jamia.1994.95236145.

Knowledge-based approaches to the maintenance of a large controlled medical terminology.基于知识的方法来维护大型受控医学术语集。

J Am Med Inform Assoc. 1994 Jan-Feb;1(1):35-50. doi: 10.1136/jamia.1994.95236135.

Unlocking clinical data from narrative reports: a study of natural language processing.从叙述性报告中解锁临床数据：一项自然语言处理研究

Ann Intern Med. 1995 May 1;122(9):681-8. doi: 10.7326/0003-4819-122-9-199505010-00007.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验