Workman T Elizabeth, Shao Yijun, Divita Guy, Zeng-Treitler Qing
The George Washington University, Biomedical Informatics Center, 2600 Virginia Ave, Suite 506, Washington, DC, 20037, USA.
Division of Epidemiology, University of Utah School of Medicine, 295 Chipeta Way, Salt Lake City, UT, 84132, USA.
BMC Res Notes. 2019 Jan 18;12(1):42. doi: 10.1186/s13104-019-4073-y.
Misspellings in clinical free text present challenges to natural language processing. With an objective to identify misspellings and their corrections, we developed a prototype spelling analysis method that implements Word2Vec, Levenshtein edit distance constraints, a lexical resource, and corpus term frequencies. We used the prototype method to process two different corpora, surgical pathology reports, and emergency department progress and visit notes, extracted from Veterans Health Administration resources. We evaluated performance by measuring positive predictive value and performing an error analysis of false positive output, using four classifications. We also performed an analysis of spelling errors in each corpus, using common error classifications.
In this small-scale study utilizing a total of 76,786 clinical notes, the prototype method achieved positive predictive values of 0.9057 and 0.8979, respectively, for the surgical pathology reports, and emergency department progress and visit notes, in identifying and correcting misspelled words. False positives varied by corpus. Spelling error types were similar among the two corpora, however, the authors of emergency department progress and visit notes made over four times as many errors. Overall, the results of this study suggest that this method could also perform sufficiently in identifying misspellings in other clinical document types.
临床自由文本中的拼写错误给自然语言处理带来了挑战。为了识别拼写错误及其纠正方法,我们开发了一种原型拼写分析方法,该方法实现了Word2Vec、莱文斯坦编辑距离约束、词汇资源和语料库词频。我们使用该原型方法处理了两个不同的语料库,即从退伍军人健康管理局资源中提取的外科病理报告以及急诊科病程记录和就诊记录。我们通过测量阳性预测值并使用四种分类方法对假阳性输出进行错误分析来评估性能。我们还使用常见错误分类方法对每个语料库中的拼写错误进行了分析。
在这项总共使用76,786份临床记录的小规模研究中,该原型方法在识别和纠正外科病理报告以及急诊科病程记录和就诊记录中的拼写错误方面,阳性预测值分别达到了0.9057和0.8979。假阳性因语料库而异。两个语料库中的拼写错误类型相似,然而,急诊科病程记录和就诊记录的作者所犯错误数量是前者的四倍多。总体而言,本研究结果表明该方法在识别其他临床文档类型中的拼写错误方面也能充分发挥作用。