Suppr超能文献

通过条件独立性对临床文本进行上下文敏感的拼写校正

Context-Sensitive Spelling Correction of Clinical Text via Conditional Independence.

作者信息

Kim Juyong, Weiss Jeremy C, Ravikumar Pradeep

机构信息

Machine Learning Department, Carnegie Mellon University, Pittsburgh, PA 15213.

Heinz College of Information Systems and Public Policy, Carnegie Mellon University, Pittsburgh, PA 15213.

出版信息

Proc Mach Learn Res. 2022 Apr;174:234-247.

Abstract

Spelling correction is a particularly important problem in clinical natural language processing because of the abundant occurrence of misspellings in medical records. However, the scarcity of labeled datasets in a clinical context makes it hard to build a machine learning system for such clinical spelling correction. In this work, we present a probabilistic model of correcting misspellings based on a simple conditional independence assumption, which leads to a modular decomposition into a language model and a corruption model. With a deep character-level language model trained on a large clinical corpus, and a simple edit-based corruption model, we can build a spelling correction model with small or no real data. Experimental results show that our model significantly outperforms baselines on two healthcare spelling correction datasets.

摘要

在临床自然语言处理中,拼写纠错是一个尤为重要的问题,因为医疗记录中存在大量拼写错误。然而,临床环境中标注数据集的稀缺使得难以构建用于此类临床拼写纠错的机器学习系统。在这项工作中,我们基于一个简单的条件独立性假设提出了一种纠正拼写错误的概率模型,该模型可模块化分解为语言模型和错误生成模型。通过在大型临床语料库上训练的深度字符级语言模型以及简单的基于编辑的错误生成模型,我们可以构建一个几乎不需要真实数据的拼写纠错模型。实验结果表明,我们的模型在两个医疗拼写纠错数据集上显著优于基线模型。

相似文献

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验