Tolentino Herman D, Matters Michael D, Walop Wikke, Law Barbara, Tong Wesley, Liu Fang, Fontelo Paul, Kohl Katrin, Payne Daniel C
Bacterial Vaccine-Preventable Diseases Branch, Epidemiology and Surveillance Division, National Immunization Program, Centers for Disease Control and Prevention, Atlanta, GA 30333, USA.
BMC Med Inform Decis Mak. 2007 Feb 12;7:3. doi: 10.1186/1472-6947-7-3.
The Institute of Medicine has identified patient safety as a key goal for health care in the United States. Detecting vaccine adverse events is an important public health activity that contributes to patient safety. Reports about adverse events following immunization (AEFI) from surveillance systems contain free-text components that can be analyzed using natural language processing. To extract Unified Medical Language System (UMLS) concepts from free text and classify AEFI reports based on concepts they contain, we first needed to clean the text by expanding abbreviations and shortcuts and correcting spelling errors. Our objective in this paper was to create a UMLS-based spelling error correction tool as a first step in the natural language processing (NLP) pipeline for AEFI reports.
We developed spell checking algorithms using open source tools. We used de-identified AEFI surveillance reports to create free-text data sets for analysis. After expansion of abbreviated clinical terms and shortcuts, we performed spelling correction in four steps: (1) error detection, (2) word list generation, (3) word list disambiguation and (4) error correction. We then measured the performance of the resulting spell checker by comparing it to manual correction.
We used 12,056 words to train the spell checker and tested its performance on 8,131 words. During testing, sensitivity, specificity, and positive predictive value (PPV) for the spell checker were 74% (95% CI: 74-75), 100% (95% CI: 100-100), and 47% (95% CI: 46%-48%), respectively.
We created a prototype spell checker that can be used to process AEFI reports. We used the UMLS Specialist Lexicon as the primary source of dictionary terms and the WordNet lexicon as a secondary source. We used the UMLS as a domain-specific source of dictionary terms to compare potentially misspelled words in the corpus. The prototype sensitivity was comparable to currently available tools, but the specificity was much superior. The slow processing speed may be improved by trimming it down to the most useful component algorithms. Other investigators may find the methods we developed useful for cleaning text using lexicons specific to their area of interest.
美国医学研究所已将患者安全确定为医疗保健的关键目标。检测疫苗不良事件是一项有助于患者安全的重要公共卫生活动。来自监测系统的免疫接种后不良事件(AEFI)报告包含可使用自然语言处理进行分析的自由文本成分。为了从自由文本中提取统一医学语言系统(UMLS)概念并根据其中包含的概念对AEFI报告进行分类,我们首先需要通过扩展缩写和简写以及纠正拼写错误来清理文本。本文的目的是创建一个基于UMLS的拼写错误纠正工具,作为AEFI报告自然语言处理(NLP)流程的第一步。
我们使用开源工具开发了拼写检查算法。我们使用去识别化的AEFI监测报告来创建用于分析的自由文本数据集。在扩展缩写的临床术语和简写后,我们分四个步骤进行拼写纠正:(1)错误检测,(2)单词列表生成,(3)单词列表消歧和(4)错误纠正。然后,我们将生成的拼写检查器与人工纠正进行比较,以衡量其性能。
我们使用12,056个单词训练拼写检查器,并在8,131个单词上测试其性能。在测试期间,拼写检查器的敏感性、特异性和阳性预测值(PPV)分别为74%(95%CI:74 - 75)、100%(95%CI:100 - 100)和47%(95%CI:46% - 48%)。
我们创建了一个可用于处理AEFI报告的原型拼写检查器。我们将UMLS专家词典用作词典术语的主要来源,将WordNet词典用作次要来源。我们使用UMLS作为特定领域的词典术语来源,以比较语料库中可能拼写错误的单词。该原型的敏感性与现有工具相当,但特异性要高得多。通过将其精简为最有用的组件算法,可能会提高处理速度。其他研究人员可能会发现我们开发的方法对于使用特定于其感兴趣领域的词典清理文本很有用。