Moon Sungrim, McInnes Bridget, Melton Genevieve B
School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, USA.
Department of Computer Science, Virginia Commonwealth University, Richmond, VA, USA.
Healthc Inform Res. 2015 Jan;21(1):35-42. doi: 10.4258/hir.2015.21.1.35. Epub 2015 Jan 31.
Although acronyms and abbreviations in clinical text are used widely on a daily basis, relatively little research has focused upon word sense disambiguation (WSD) of acronyms and abbreviations in the healthcare domain. Since clinical notes have distinctive characteristics, it is unclear whether techniques effective for acronym and abbreviation WSD from biomedical literature are sufficient.
The authors discuss feature selection for automated techniques and challenges with WSD of acronyms and abbreviations in the clinical domain.
There are significant challenges associated with the informal nature of clinical text, such as typographical errors and incomplete sentences; difficulty with insufficient clinical resources, such as clinical sense inventories; and obstacles with privacy and security for conducting research with clinical text. Although we anticipated that using sophisticated techniques, such as biomedical terminologies, semantic types, part-of-speech, and language modeling, would be needed for feature selection with automated machine learning approaches, we found instead that simple techniques, such as bag-of-words, were quite effective in many cases. Factors, such as majority sense prevalence and the degree of separateness between sense meanings, were also important considerations.
The first lesson is that a comprehensive understanding of the unique characteristics of clinical text is important for automatic acronym and abbreviation WSD. The second lesson learned is that investigators may find that using simple approaches is an effective starting point for these tasks. Finally, similar to other WSD tasks, an understanding of baseline majority sense rates and separateness between senses is important. Further studies and practical solutions are needed to better address these issues.
尽管临床文本中的首字母缩略词和缩写在日常中广泛使用,但相对较少的研究聚焦于医疗领域中首字母缩略词和缩写的词义消歧(WSD)。由于临床记录具有独特的特征,尚不清楚来自生物医学文献的对首字母缩略词和缩写进行词义消歧的有效技术是否足够。
作者讨论了自动化技术的特征选择以及临床领域中首字母缩略词和缩写的词义消歧所面临的挑战。
临床文本的非正式性质带来了重大挑战,如排版错误和句子不完整;临床资源不足带来困难,如临床意义清单;以及使用临床文本进行研究时的隐私和安全障碍。尽管我们预计使用复杂技术,如生物医学术语、语义类型、词性和语言建模,对于自动化机器学习方法的特征选择是必要的,但我们反而发现简单技术,如词袋模型,在许多情况下相当有效。诸如多数意义流行率和意义之间的分离程度等因素也是重要的考虑因素。
第一个教训是,全面理解临床文本的独特特征对于自动进行首字母缩略词和缩写的词义消歧很重要。第二个教训是,研究人员可能会发现使用简单方法是这些任务的有效起点。最后,与其他词义消歧任务类似,了解基线多数意义率和意义之间的分离程度很重要。需要进一步的研究和实际解决方案来更好地解决这些问题。