Liu Hongfang, Teller Virginia, Friedman Carol
Department of Information Systems, University of Maryland at Baltimore County, 1000 Hilltop Circle, Baltimore, MD 21250, USA.
J Am Med Inform Assoc. 2004 Jul-Aug;11(4):320-31. doi: 10.1197/jamia.M1533. Epub 2004 Apr 2.
The aim of this study was to investigate relations among different aspects in supervised word sense disambiguation (WSD; supervised machine learning for disambiguating the sense of a term in a context) and compare supervised WSD in the biomedical domain with that in the general English domain.
The study involves three data sets (a biomedical abbreviation data set, a general biomedical term data set, and a general English data set). The authors implemented three machine-learning algorithms, including (1) naïve Bayes (NBL) and decision lists (TDLL), (2) their adaptation of decision lists (ODLL), and (3) their mixed supervised learning (MSL). There were six feature representations (various combinations of collocations, bag of words, oriented bag of words, etc.) and five window sizes (2, 4, 6, 8, and 10).
Supervised WSD is suitable only when there are enough sense-tagged instances with at least a few dozens of instances for each sense. Collocations combined with neighboring words are appropriate selections for the context. For terms with unrelated biomedical senses, a large window size such as the whole paragraph should be used, while for general English words a moderate window size between 4 and 10 should be used. The performance of the authors' implementation of decision list classifiers for abbreviations was better than that of traditional decision list classifiers. However, the opposite held for the other two sets. Also, the authors' mixed supervised learning was stable and generally better than others for all sets.
From this study, it was found that different aspects of supervised WSD depend on each other. The experiment method presented in the study can be used to select the best supervised WSD classifier for each ambiguous term.
本研究旨在调查监督式词义消歧(WSD;用于在上下文中消除术语歧义的监督式机器学习)不同方面之间的关系,并将生物医学领域的监督式WSD与通用英语领域的进行比较。
该研究涉及三个数据集(一个生物医学缩写数据集、一个通用生物医学术语数据集和一个通用英语数据集)。作者实现了三种机器学习算法,包括(1)朴素贝叶斯(NBL)和决策列表(TDLL),(2)他们对决策列表的改编(ODLL),以及(3)他们的混合监督学习(MSL)。有六种特征表示(搭配、词袋、定向词袋等的各种组合)和五种窗口大小(2、4、6、8和10)。
只有当有足够的带有词义标签的实例,且每个词义至少有几十个实例时,监督式WSD才适用。搭配与相邻词结合是上下文的合适选择。对于具有不相关生物医学词义的术语,应使用较大的窗口大小,如整个段落,而对于通用英语单词,应使用4到10之间的适中窗口大小。作者对缩写词的决策列表分类器的实现性能优于传统决策列表分类器。然而,对于其他两组情况则相反。此外,作者的混合监督学习是稳定的,并且在所有组中总体上优于其他方法。
从这项研究中发现,监督式WSD的不同方面相互依赖。该研究中提出的实验方法可用于为每个歧义术语选择最佳的监督式WSD分类器。