Oleynik Michel, Kreuzthaler Markus, Schulz Stefan
Institute for Medical Informatics, Statistics and Documentation, Medical University of Graz, Austria.
Stud Health Technol Inform. 2017;245:539-543.
Clinical narratives are typically produced under time pressure, which incites the use of abbreviations and acronyms. To expand such short forms in a correct way eases text comprehension and further semantic processing. We propose a completely unsupervised and data-driven algorithm for the resolution of non-lexicalised and potentially ambiguous abbreviations. Based on the lookup of word bigrams and unigrams extracted from a corpus of 30,000 pseudonymised cardiology reports in German, our method achieved an F
临床叙述通常是在时间压力下生成的,这促使人们使用缩写和首字母缩略词。以正确的方式展开这些简短形式有助于文本理解和进一步的语义处理。我们提出了一种完全无监督且数据驱动的算法,用于解决非词汇化且可能有歧义的缩写。基于从30000份德语假名化心脏病学报告语料库中提取的单词二元组和一元组的查找,我们的方法在200个文本摘录的测试集上评估,F1分数达到了0.91。结果在统计学上显著优于基线方法(p < 0.001),表明当有大量相似文本语料库时,一种简单且与领域无关的策略可能足以解决缩写问题。需要进一步开展工作,将该策略与句子和缩写检测模块相结合,使其适用于首字母缩略词解析,并使用不同的数据集进行评估。