De-Arteaga Maria, Eggel Ivan, Kahn Charles E, Müller Henning
Carnegie Mellon University, 4800 Forbes Ave., Pittsburgh, PA, 15213, USA.
HES-SO, Rue de TechnoPole 3, 3960, Sierre, Switzerland.
J Digit Imaging. 2015 Oct;28(5):537-46. doi: 10.1007/s10278-015-9792-6.
Log files of information retrieval systems that record user behavior have been used to improve the outcomes of retrieval systems, understand user behavior, and predict events. In this article, a log file of the ARRS GoldMiner search engine containing 222,005 consecutive queries is analyzed. Time stamps are available for each query, as well as masked IP addresses, which enables to identify queries from the same person. This article describes the ways in which physicians (or Internet searchers interested in medical images) search and proposes potential improvements by suggesting query modifications. For example, many queries contain only few terms and therefore are not specific; others contain spelling mistakes or non-medical terms that likely lead to poor or empty results. One of the goals of this report is to predict the number of results a query will have since such a model allows search engines to automatically propose query modifications in order to avoid result lists that are empty or too large. This prediction is made based on characteristics of the query terms themselves. Prediction of empty results has an accuracy above 88%, and thus can be used to automatically modify the query to avoid empty result sets for a user. The semantic analysis and data of reformulations done by users in the past can aid the development of better search systems, particularly to improve results for novice users. Therefore, this paper gives important ideas to better understand how people search and how to use this knowledge to improve the performance of specialized medical search engines.
记录用户行为的信息检索系统日志文件已被用于改善检索系统的效果、理解用户行为以及预测事件。在本文中,对包含222,005条连续查询的ARRS GoldMiner搜索引擎日志文件进行了分析。每个查询都有时间戳以及掩码IP地址,这使得能够识别来自同一人的查询。本文描述了医生(或对医学图像感兴趣的互联网搜索者)的搜索方式,并通过建议修改查询提出了潜在的改进措施。例如,许多查询只包含很少的词项,因此不够具体;其他查询包含拼写错误或非医学词项,这可能导致结果不佳或为空。本报告的目标之一是预测一个查询将得到的结果数量,因为这样的模型可以让搜索引擎自动建议修改查询,以避免结果列表为空或过大。这种预测是基于查询词项本身的特征进行的。对空结果的预测准确率超过88%,因此可用于自动修改查询,以避免为用户生成空结果集。过去用户进行的语义分析和重新表述的数据有助于开发更好的搜索系统,特别是提高新手用户的搜索结果。因此,本文提供了重要的思路,以更好地理解人们如何搜索以及如何利用这些知识来提高专业医学搜索引擎的性能。