Pradhan Sameer, Luo Xiaoqiang, Recasens Marta, Hovy Eduard, Ng Vincent, Strube Michael
Harvard Medical School, Boston, MA.
Google Inc., New York, NY.
Proc Conf Assoc Comput Linguist Meet. 2014 Jun;2014:30-35. doi: 10.3115/v1/P14-2006.
The definitions of two coreference scoring metrics- B and CEAF-are underspecified with respect to , as opposed to (or ) mentions. Several variations have been proposed that manipulate either, or both, the key and predicted mentions in order to get a one-to-one mapping. On the other hand, the metric BLANC was, until recently, limited to scoring partitions of key mentions. In this paper, we (i) argue that mention manipulation for scoring predicted mentions is unnecessary, and potentially harmful as it could produce unintuitive results; (ii) illustrate the application of all these measures to scoring predicted mentions; (iii) make available an open-source, thoroughly-tested reference implementation of the main coreference evaluation measures; and (iv) rescore the results of the CoNLL-2011/2012 shared task systems with this implementation. This will help the community accurately measure and compare new end-to-end coreference resolution algorithms.
与(或)提及相对,两个共指消解评分指标——B指标和交叉实体对齐度指标(CEAF)——在涉及 时定义不明确。已经提出了几种变体,这些变体通过操纵关键提及或预测提及,或同时操纵两者,以实现一对一映射。另一方面,直到最近,BLANC指标还仅限于对关键提及的划分进行评分。在本文中,我们(i)认为对预测提及进行评分时操纵提及是不必要的,而且可能有害,因为它可能产生不直观的结果;(ii)说明所有这些指标在对预测提及进行评分时的应用;(iii)提供主要共指消解评估指标的开源、经过全面测试的参考实现;以及(iv)使用此实现对CoNLL-2011/2012共享任务系统的结果重新评分。这将有助于该领域准确地衡量和比较新的端到端共指消解算法。