不要使用大语言模型来进行相关性判断。

Don't Use LLMs to Make Relevance Judgments.

作者信息

Soboroff Ian

机构信息

National Institute of Standards and Technology, Gaithersburg, Maryland, USA.

出版信息

Inf Retr Res J. 2025 Mar 25;1(1). doi: 10.54195/irrj.19625.

DOI:10.54195/irrj.19625

PMID:40212105

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11984504/

Abstract

Relevance judgments and other truth data for information retrieval (IR) evaluations are created manually. There is a strong temptation to use large language models (LLMs) as proxies for human judges. However, letting the LLM write your truth data handicaps the evaluation by setting that LLM as a ceiling on performance. There are ways to use LLMs in the relevance assessment process, but just generating relevance judgments with a prompt isn't one of them.

摘要

用于信息检索（IR）评估的相关性判断和其他真值数据是人工创建的。人们很想使用大语言模型（LLM）来替代人类评判者。然而，让大语言模型编写真值数据会将该大语言模型设定为性能上限，从而阻碍评估。在相关性评估过程中有一些使用大语言模型的方法，但仅通过提示生成相关性判断并非其中之一。