National Library of Medicine, 8600 Rockville Pike, Bethesda, MD 20894, USA.
BMC Bioinformatics. 2010 Nov 22;11:569. doi: 10.1186/1471-2105-11-569.
Word sense disambiguation (WSD) algorithms attempt to select the proper sense of ambiguous terms in text. Resources like the UMLS provide a reference thesaurus to be used to annotate the biomedical literature. Statistical learning approaches have produced good results, but the size of the UMLS makes the production of training data infeasible to cover all the domain.
We present research on existing WSD approaches based on knowledge bases, which complement the studies performed on statistical learning. We compare four approaches which rely on the UMLS Metathesaurus as the source of knowledge. The first approach compares the overlap of the context of the ambiguous word to the candidate senses based on a representation built out of the definitions, synonyms and related terms. The second approach collects training data for each of the candidate senses to perform WSD based on queries built using monosemous synonyms and related terms. These queries are used to retrieve MEDLINE citations. Then, a machine learning approach is trained on this corpus. The third approach is a graph-based method which exploits the structure of the Metathesaurus network of relations to perform unsupervised WSD. This approach ranks nodes in the graph according to their relative structural importance. The last approach uses the semantic types assigned to the concepts in the Metathesaurus to perform WSD. The context of the ambiguous word and semantic types of the candidate concepts are mapped to Journal Descriptors. These mappings are compared to decide among the candidate concepts. Results are provided estimating accuracy of the different methods on the WSD test collection available from the NLM.
We have found that the last approach achieves better results compared to the other methods. The graph-based approach, using the structure of the Metathesaurus network to estimate the relevance of the Metathesaurus concepts, does not perform well compared to the first two methods. In addition, the combination of methods improves the performance over the individual approaches. On the other hand, the performance is still below statistical learning trained on manually produced data and below the maximum frequency sense baseline. Finally, we propose several directions to improve the existing methods and to improve the Metathesaurus to be more effective in WSD.
词义消歧(WSD)算法试图在文本中选择歧义术语的正确含义。UMLS 等资源提供了一个参考词库,用于注释生物医学文献。统计学习方法已经取得了很好的结果,但 UMLS 的规模使得制作训练数据来涵盖所有领域变得不可行。
我们介绍了基于知识库的现有 WSD 方法的研究,这些方法补充了基于统计学习的研究。我们比较了四种方法,这些方法都依赖于 UMLS Metathesaurus 作为知识来源。第一种方法比较了歧义词的上下文与基于定义、同义词和相关术语构建的表示中的候选含义的重叠。第二种方法为每个候选含义收集训练数据,根据使用单义词和相关术语构建的查询执行 WSD。这些查询用于检索 MEDLINE 引文。然后,在这个语料库上训练机器学习方法。第三种方法是一种基于图的方法,利用 Metathesaurus 关系网络的结构来执行无监督的 WSD。该方法根据节点在图中的相对结构重要性对节点进行排序。最后一种方法使用分配给 Metathesaurus 中概念的语义类型来执行 WSD。将歧义词的上下文和候选概念的语义类型映射到 Journal Descriptors。通过比较这些映射来在候选概念中做出选择。结果是在 NLM 提供的 WSD 测试集中估计不同方法的准确性。
与其他方法相比,我们发现最后一种方法的效果更好。基于 Metathesaurus 网络结构来估计 Metathesaurus 概念相关性的基于图的方法与前两种方法相比效果不佳。此外,方法的组合提高了性能优于个别方法。另一方面,性能仍然低于基于人工生成数据训练的统计学习,也低于最大频率感知基线。最后,我们提出了几种改进现有方法和改进 Metathesaurus 以提高 WSD 效果的方向。