Preiss Judita, Stevenson Mark
Advanced Computing Research Center, Department of Computer Science, The University of Sheffield, 211 Portobello, Sheffield, S1 4DP, UK.
BMC Med Inform Decis Mak. 2016 Jul 18;16 Suppl 1(Suppl 1):57. doi: 10.1186/s12911-016-0296-1.
The volume of research published in the biomedical domain has increasingly lead to researchers focussing on specific areas of interest and connections between findings being missed. Literature based discovery (LBD) attempts to address this problem by searching for previously unnoticed connections between published information (also known as "hidden knowledge"). A common approach is to identify hidden knowledge via shared linking terms. However, biomedical documents are highly ambiguous which can lead LBD systems to over generate hidden knowledge by hypothesising connections through different meanings of linking terms. Word Sense Disambiguation (WSD) aims to resolve ambiguities in text by identifying the meaning of ambiguous terms. This study explores the effect of WSD accuracy on LBD performance.
An existing LBD system is employed and four approaches to WSD of biomedical documents integrated with it. The accuracy of each WSD approach is determined by comparing its output against a standard benchmark. Evaluation of the LBD output is carried out using timeslicing approach, where hidden knowledge is generated from articles published prior to a certain cutoff date and a gold standard extracted from publications after the cutoff date.
WSD accuracy varies depending on the approach used. The connection between the performance of the LBD and WSD systems are analysed to reveal a correlation between WSD accuracy and LBD performance.
This study reveals that LBD performance is sensitive to WSD accuracy. It is therefore concluded that WSD has the potential to improve the output of LBD systems by reducing the amount of spurious hidden knowledge that is generated. It is also suggested that further improvements in WSD accuracy have the potential to improve LBD accuracy.
生物医学领域发表的研究数量日益增加,这使得研究人员越来越专注于特定的感兴趣领域,从而忽略了研究结果之间的联系。基于文献的发现(LBD)试图通过搜索已发表信息之间以前未被注意到的联系(也称为“隐藏知识”)来解决这个问题。一种常见的方法是通过共享链接词来识别隐藏知识。然而,生物医学文档具有高度的歧义性,这可能导致LBD系统通过对链接词的不同含义进行假设来过度生成隐藏知识。词义消歧(WSD)旨在通过识别歧义词的含义来解决文本中的歧义。本研究探讨了WSD准确性对LBD性能的影响。
采用现有的LBD系统,并将四种生物医学文档WSD方法与之集成。每种WSD方法的准确性通过将其输出与标准基准进行比较来确定。使用时间切片方法对LBD输出进行评估,其中隐藏知识是从某个截止日期之前发表的文章中生成的,而黄金标准是从截止日期之后的出版物中提取的。
WSD准确性因所使用的方法而异。分析了LBD和WSD系统性能之间的联系,以揭示WSD准确性与LBD性能之间的相关性。
本研究表明LBD性能对WSD准确性敏感。因此得出结论,WSD有潜力通过减少生成的虚假隐藏知识的数量来提高LBD系统的输出。还建议进一步提高WSD准确性有可能提高LBD准确性。