Jimeno Yepes Antonio
IBM Research Australia, Melbourne, VIC, Australia.
J Biomed Inform. 2017 Sep;73:137-147. doi: 10.1016/j.jbi.2017.08.001. Epub 2017 Aug 7.
Word sense disambiguation helps identifying the proper sense of ambiguous words in text. With large terminologies such as the UMLS Metathesaurus ambiguities appear and highly effective disambiguation methods are required. Supervised learning algorithm methods are used as one of the approaches to perform disambiguation. Features extracted from the context of an ambiguous word are used to identify the proper sense of such a word. The type of features have an impact on machine learning methods, thus affect disambiguation performance. In this work, we have evaluated several types of features derived from the context of the ambiguous word and we have explored as well more global features derived from MEDLINE using word embeddings. Results show that word embeddings improve the performance of more traditional features and allow as well using recurrent neural network classifiers based on Long-Short Term Memory (LSTM) nodes. The combination of unigrams and word embeddings with an SVM sets a new state of the art performance with a macro accuracy of 95.97 in the MSH WSD data set.
词义消歧有助于识别文本中歧义单词的正确词义。对于像统一医学语言系统(UMLS)元词表这样的大型术语集,歧义会出现,因此需要高效的消歧方法。监督学习算法方法被用作执行消歧的一种途径。从歧义单词的上下文提取的特征用于识别该单词的正确词义。特征的类型会对机器学习方法产生影响,从而影响消歧性能。在这项工作中,我们评估了从歧义单词上下文派生的几种类型的特征,并且还探索了使用词嵌入从医学文献数据库(MEDLINE)派生的更全局的特征。结果表明,词嵌入提高了更传统特征的性能,并且还允许使用基于长短期记忆(LSTM)节点的递归神经网络分类器。单字和词嵌入与支持向量机(SVM)的组合在医学主题词(MSH)词义消歧(WSD)数据集中以95.97的宏准确率设定了新的最先进性能。