Boguslav Mayla, Cohen Kevin Bretonnel
Computational Bioscience Program, University Colorado School of Medicine, Aurora, CO, USA.
Stud Health Technol Inform. 2017;245:298-302.
Human-annotated data is a fundamental part of natural language processing system development and evaluation. The quality of that data is typically assessed by calculating the agreement between the annotators. It is widely assumed that this agreement between annotators is the upper limit on system performance in natural language processing: if humans can't agree with each other about the classification more than some percentage of the time, we don't expect a computer to do any better. We trace the logical positivist roots of the motivation for measuring inter-annotator agreement, demonstrate the prevalence of the widely-held assumption about the relationship between inter-annotator agreement and system performance, and present data that suggest that inter-annotator agreement is not, in fact, an upper bound on language processing system performance.
人工标注的数据是自然语言处理系统开发和评估的基本组成部分。该数据的质量通常通过计算标注者之间的一致性来评估。人们普遍认为,标注者之间的这种一致性是自然语言处理系统性能的上限:如果人类在超过一定百分比的时间内不能就分类达成一致,我们就不期望计算机能做得更好。我们追溯了测量标注者间一致性动机的逻辑实证主义根源,证明了关于标注者间一致性与系统性能关系的广泛假设的普遍性,并给出数据表明,事实上,标注者间一致性并非语言处理系统性能的上限。