Finley Gregory P, Pakhomov Serguei V S, McEwan Reed, Melton Genevieve B
Institute for Health Informatics; Department of Surgery.
Institute for Health Informatics; College of Pharmacy University of Minnesota, Minneapolis, MN.
AMIA Annu Symp Proc. 2017 Feb 10;2016:560-569. eCollection 2016.
Abbreviation disambiguation in clinical texts is a problem handled well by fully supervised machine learning methods. Acquiring training data, however, is expensive and would be impractical for large numbers of abbreviations in specialized corpora. An alternative is a semi-supervised approach, in which training data are automatically generated by substituting long forms in natural text with their corresponding abbreviations. Most prior implementations of this method either focus on very few abbreviations or do not test on real-world data. We present a realistic use case by testing several semi-supervised classification algorithms on a large hand-annotated medical record of occurrences of 74 ambiguous abbreviations. Despite notable differences between training and test corpora, classifiers achieve up to 90% accuracy. Our tests demonstrate that semi-supervised abbreviation disambiguation is a viable and extensible option for medical NLP systems.
临床文本中的缩写消除歧义问题可通过完全监督的机器学习方法得到很好的处理。然而,获取训练数据成本高昂,对于专业语料库中的大量缩写来说是不切实际的。一种替代方法是半监督方法,其中训练数据通过用自然文本中的长形式替换其相应缩写自动生成。该方法以前的大多数实现要么只关注极少数缩写,要么没有在真实数据上进行测试。我们通过在一份包含74个歧义缩写出现情况的大型人工标注医疗记录上测试几种半监督分类算法,展示了一个实际应用案例。尽管训练语料库和测试语料库之间存在显著差异,但分类器的准确率高达90%。我们的测试表明,半监督缩写消除歧义对于医学自然语言处理系统来说是一个可行且可扩展的选择。