Choi Miji, Zobel Justin, Verspoor Karin
Department of Computing and Information Systems, The University of Melbourne, Melbourne, Australia; National ICT Australia (NICTA), Victoria Research Laboratory, Australia.
Department of Computing and Information Systems, The University of Melbourne, Melbourne, Australia.
J Biomed Inform. 2016 Apr;60:309-18. doi: 10.1016/j.jbi.2016.02.015. Epub 2016 Feb 27.
Coreference resolution is an essential task in information extraction from the published biomedical literature. It supports the discovery of complex information by linking referring expressions such as pronouns and appositives to their referents, which are typically entities that play a central role in biomedical events. Correctly establishing these links allows detailed understanding of all the participants in events, and connecting events together through their shared participants.
As an initial step towards the development of a novel coreference resolution system for the biomedical domain, we have categorised the characteristics of coreference relations by type of anaphor as well as broader syntactic and semantic characteristics, and have compared the performance of a domain adaptation of a state-of-the-art general system to published results from domain-specific systems in terms of this categorisation. We also develop a rule-based system for anaphoric coreference resolution in the biomedical domain with simple modules derived from available systems. Our results show that the domain-specific systems outperform the general system overall. Whilst this result is unsurprising, our proposed categorisation enables a detailed quantitative analysis of the system performance. We identify limitations of each system and find that there remain important gaps in the state-of-the-art systems, which are clearly identifiable with respect to the categorisation.
We have analysed in detail the performance of existing coreference resolution systems for the biomedical literature and have demonstrated that there clear gaps in their coverage. The approach developed in the general domain needs to be tailored for portability to the biomedical domain. The specific framework for class-based error analysis of existing systems that we propose has benefits for identifying specific limitations of those systems. This in turn provides insights for further system development.
指代消解是从已发表的生物医学文献中提取信息的一项重要任务。它通过将代词和同位语等指代性表达与其所指对象相链接来支持复杂信息的发现,这些所指对象通常是在生物医学事件中起核心作用的实体。正确建立这些链接有助于详细了解事件中的所有参与者,并通过共享参与者将事件联系起来。
作为开发用于生物医学领域的新型指代消解系统的第一步,我们按指代类型以及更广泛的句法和语义特征对指代关系的特征进行了分类,并根据这种分类比较了一个先进通用系统的领域适应性与特定领域系统已发表结果的性能。我们还开发了一个基于规则的生物医学领域指代消解系统,其简单模块源自现有系统。我们的结果表明,特定领域系统总体上优于通用系统。虽然这一结果并不意外,但我们提出的分类方法能够对系统性能进行详细的定量分析。我们确定了每个系统的局限性,并发现现有系统中仍存在重要差距,这些差距根据分类是清晰可辨的。
我们详细分析了现有生物医学文献指代消解系统的性能,并证明了它们在覆盖范围上存在明显差距。通用领域开发的方法需要进行调整以便于移植到生物医学领域。我们提出的针对现有系统基于类的错误分析的具体框架有助于识别这些系统的特定局限性。这反过来为进一步的系统开发提供了见解。