Hersh W R, Campbell E M, Malveau S E
Division of Medical Informatics and Outcomes Research, Oregon Health Sciences University, USA.
Proc AMIA Annu Fall Symp. 1997:580-4.
Identify the lexical content of a large corpus of ordinary medical records to assess the feasibility of large-scale natural language processing.
A corpus of 560 megabytes of medical record text from an academic medical center was broken into individual words and compared with the words in six medical vocabularies, a common word list, and a database of patient names. Unrecognized words were assessed for algorithmic and contextual approaches to identifying more words, while the remainder were analyzed for spelling correctness.
About 60% of the words occurred in the medical vocabularies, common word list, or names database. Of the remainder, one-third were recognizable by other means. Of the remaining unrecognizable words, over three-fourths represented correctly spelled real words and the rest were misspellings.
Large-scale generalized natural language processing methods for the medical record will require expansion of existing vocabularies, spelling error correction, and other algorithmic approaches to map words into those from clinical vocabularies.
识别大量普通病历的词汇内容,以评估大规模自然语言处理的可行性。
将来自一所学术医疗中心的560兆字节病历文本语料库拆分为单个单词,并与六个医学词汇表、一个常用单词列表和一个患者姓名数据库中的单词进行比较。对未识别的单词评估用于识别更多单词的算法和上下文方法,而其余单词则分析其拼写正确性。
约60%的单词出现在医学词汇表、常用单词列表或姓名数据库中。其余单词中,三分之一可通过其他方式识别。在其余无法识别的单词中,超过四分之三代表拼写正确的真实单词,其余为拼写错误。
用于病历的大规模通用自然语言处理方法将需要扩展现有词汇表、校正拼写错误以及采用其他算法方法将单词映射到临床词汇表中的单词。