Moon Sungrim, Pakhomov Serguei, Melton Genevieve B
Institute for Health Informatics, University of Minnesota, Minneapolis, MN, USA.
AMIA Annu Symp Proc. 2012;2012:1310-9. Epub 2012 Nov 3.
Acronyms and abbreviations within electronic clinical texts are widespread and often associated with multiple senses. Automated acronym sense disambiguation (WSD), a task of assigning the context-appropriate sense to ambiguous clinical acronyms and abbreviations, represents an active problem for medical natural language processing (NLP) systems. In this paper, fifty clinical acronyms and abbreviations with 500 samples each were studied using supervised machine-learning techniques (Support Vector Machines (SVM), Naïve Bayes (NB), and Decision Trees (DT)) to optimize the window size and orientation and determine the minimum training sample size needed for optimal performance. Our analysis of window size and orientation showed best performance using a larger left-sided and smaller right-sided window. To achieve an accuracy of over 90%, the minimum required training sample size was approximately 125 samples for SVM classifiers with inverted cross-validation. These findings support future work in clinical acronym and abbreviation WSD and require validation with other clinical texts.
电子临床文本中的首字母缩略词和缩写广泛存在,且常常具有多种含义。自动首字母缩略词词义消歧(WSD),即给模糊的临床首字母缩略词和缩写赋予适合上下文的含义这一任务,是医学自然语言处理(NLP)系统面临的一个实际问题。在本文中,我们使用监督机器学习技术(支持向量机(SVM)、朴素贝叶斯(NB)和决策树(DT))对五十个临床首字母缩略词和缩写进行了研究,每个词有500个样本,以优化窗口大小和方向,并确定实现最佳性能所需的最小训练样本量。我们对窗口大小和方向的分析表明,使用较大的左侧窗口和较小的右侧窗口性能最佳。对于采用反向交叉验证的SVM分类器,要达到90%以上的准确率,所需的最小训练样本量约为125个样本。这些发现为临床首字母缩略词和缩写WSD的未来研究提供了支持,并且需要用其他临床文本进行验证。