Hersh William R, Müller Henning, Jensen Jeffery R, Yang Jianji, Gorman Paul N, Ruch Patrick
Department of Medical Informatics & Clinical Epidemiology, Oregon Health & Science University, BICC, Portland, OR 97239, USA.
J Am Med Inform Assoc. 2006 Sep-Oct;13(5):488-96. doi: 10.1197/jamia.M2082. Epub 2006 Jun 23.
Develop and analyze results from an image retrieval test collection.
After participating research groups obtained and assessed results from their systems in the image retrieval task of Cross-Language Evaluation Forum, we assessed the results for common themes and trends. In addition to overall performance, results were analyzed on the basis of topic categories (those most amenable to visual, textual, or mixed approaches) and run categories (those employing queries entered by automated or manual means as well as those using visual, textual, or mixed indexing and retrieval methods). We also assessed results on the different topics and compared the impact of duplicate relevance judgments.
A total of 13 research groups participated. Analysis was limited to the best run submitted by each group in each run category. The best results were obtained by systems that combined visual and textual methods. There was substantial variation in performance across topics. Systems employing textual methods were more resilient to visually oriented topics than those using visual methods were to textually oriented topics. The primary performance measure of mean average precision (MAP) was not necessarily associated with other measures, including those possibly more pertinent to real users, such as precision at 10 or 30 images.
We developed a test collection amenable to assessing visual and textual methods for image retrieval. Future work must focus on how varying topic and run types affect retrieval performance. Users' studies also are necessary to determine the best measures for evaluating the efficacy of image retrieval systems.
开发并分析图像检索测试集的结果。
在参与研究的小组获取并评估了他们的系统在跨语言评估论坛图像检索任务中的结果后,我们评估了结果中的共同主题和趋势。除了整体性能外,还根据主题类别(最适合视觉、文本或混合方法的类别)和运行类别(那些采用自动或手动方式输入查询以及使用视觉、文本或混合索引和检索方法的类别)对结果进行了分析。我们还评估了不同主题的结果,并比较了重复相关性判断的影响。
共有13个研究小组参与。分析限于每个小组在每个运行类别中提交的最佳运行结果。结合视觉和文本方法的系统取得了最佳结果。各主题的性能存在很大差异。采用文本方法的系统对以视觉为主的主题比采用视觉方法的系统对以文本为主的主题更具弹性。平均精度均值(MAP)这一主要性能指标不一定与其他指标相关,包括那些可能对实际用户更相关的指标,如在10或30幅图像时的精度。
我们开发了一个适合评估图像检索视觉和文本方法的测试集。未来的工作必须关注不同主题和运行类型如何影响检索性能。还需要进行用户研究,以确定评估图像检索系统有效性的最佳指标。