Shinohara E Y, Aramaki E, Imai T, Miura Y, Tonoike M, Ohkuma T, Masuichi H, Ohe K
Department of Planning, Information and Management, The University of Tokyo Hospital, Tokyo, Japan.
Methods Inf Med. 2013;52(1):51-61. doi: 10.3414/ME12-01-0040. Epub 2012 Dec 7.
One of the barriers for the effective use of computerized health-care related text is the ambiguity of abbreviations. To date, the task of disambiguating abbreviations has been treated as a classification task based on surrounding words. Application of this framework for languages that have no word boundaries requires pre-processing to segment a sentence into separate word sequences. While the segmentation processing is often a source of problem, it is unknown whether word information is really requisite for abbreviation expansion.
The present study examined and compared abbreviation expansion methods with and without the incorporation of word information as a preliminary study.
We implemented two abbreviation expansion methods: 1) a morpheme-based method that relied on word information and therefore required pre-processing, and 2) a character-based method that relied on simple character information. We compared the expansion accuracies for these two methods using eight medical abbreviations. Experimental data were automatically built as a pseudo-annotated corpus using the Internet.
As a result of the experiment, accuracies for the character-based method were from 0.890 to 0.942 while accuracies for the morpheme-based method were from 0.796 to 0.932. The character-based method significantly outperformed the morpheme-based method for three of the eight abbreviations (p < 0.05). For the remaining five abbreviations, no significant differences were found between the two methods.
Character information may be a good alternative in terms of simplicity to morphological information for abbreviation expansion in English medical abbreviations appeared in Japanese texts on the Internet.
有效利用计算机化的医疗相关文本的障碍之一是缩写的歧义性。迄今为止,消除缩写歧义的任务一直被视为基于周围单词的分类任务。对于没有单词边界的语言,应用此框架需要进行预处理,将句子分割成单独的单词序列。虽然分割处理往往是问题的一个来源,但缩写扩展是否真的需要单词信息尚不清楚。
本研究作为一项初步研究,检验并比较了纳入和未纳入单词信息的缩写扩展方法。
我们实施了两种缩写扩展方法:1)基于词素的方法,该方法依赖单词信息,因此需要预处理;2)基于字符的方法,该方法依赖简单的字符信息。我们使用八个医学缩写比较了这两种方法的扩展准确率。实验数据通过互联网自动构建为一个伪注释语料库。
实验结果显示,基于字符的方法的准确率在0.890至0.942之间,而基于词素的方法的准确率在0.796至0.932之间。在八个缩写中的三个上,基于字符的方法显著优于基于词素的方法(p < 0.05)。对于其余五个缩写,两种方法之间未发现显著差异。
对于互联网上日语文本中出现的英语医学缩写的缩写扩展,就简单性而言,字符信息可能是形态信息的一个良好替代。