Soboroff Ian
National Institute of Standards and Technology, Gaithersburg, Maryland, USA.
Inf Retr Res J. 2025 Mar 25;1(1). doi: 10.54195/irrj.19625.
Relevance judgments and other truth data for information retrieval (IR) evaluations are created manually. There is a strong temptation to use large language models (LLMs) as proxies for human judges. However, letting the LLM write your truth data handicaps the evaluation by setting that LLM as a ceiling on performance. There are ways to use LLMs in the relevance assessment process, but just generating relevance judgments with a prompt isn't one of them.
用于信息检索(IR)评估的相关性判断和其他真值数据是人工创建的。人们很想使用大语言模型(LLM)来替代人类评判者。然而,让大语言模型编写真值数据会将该大语言模型设定为性能上限,从而阻碍评估。在相关性评估过程中有一些使用大语言模型的方法,但仅通过提示生成相关性判断并非其中之一。