Kim Juyong, Weiss Jeremy C, Ravikumar Pradeep
Machine Learning Department, Carnegie Mellon University, Pittsburgh, PA 15213.
Heinz College of Information Systems and Public Policy, Carnegie Mellon University, Pittsburgh, PA 15213.
Proc Mach Learn Res. 2022 Apr;174:234-247.
Spelling correction is a particularly important problem in clinical natural language processing because of the abundant occurrence of misspellings in medical records. However, the scarcity of labeled datasets in a clinical context makes it hard to build a machine learning system for such clinical spelling correction. In this work, we present a probabilistic model of correcting misspellings based on a simple conditional independence assumption, which leads to a modular decomposition into a language model and a corruption model. With a deep character-level language model trained on a large clinical corpus, and a simple edit-based corruption model, we can build a spelling correction model with small or no real data. Experimental results show that our model significantly outperforms baselines on two healthcare spelling correction datasets.
在临床自然语言处理中,拼写纠错是一个尤为重要的问题,因为医疗记录中存在大量拼写错误。然而,临床环境中标注数据集的稀缺使得难以构建用于此类临床拼写纠错的机器学习系统。在这项工作中,我们基于一个简单的条件独立性假设提出了一种纠正拼写错误的概率模型,该模型可模块化分解为语言模型和错误生成模型。通过在大型临床语料库上训练的深度字符级语言模型以及简单的基于编辑的错误生成模型,我们可以构建一个几乎不需要真实数据的拼写纠错模型。实验结果表明,我们的模型在两个医疗拼写纠错数据集上显著优于基线模型。