Myers Skatje, Miller Timothy A, Gao Yanjun, Churpek Matthew M, Mayampurath Anoop, Dligach Dmitriy, Afshar Majid
Department of Medicine, University of Wisconsin-Madison, Madison, WI 53726, United States.
Computational Health Informatics Program, Boston Children's Hospital, Boston, MA 02215, United States.
J Am Med Inform Assoc. 2025 Feb 1;32(2):357-364. doi: 10.1093/jamia/ocae308.
Applying large language models (LLMs) to the clinical domain is challenging due to the context-heavy nature of processing medical records. Retrieval-augmented generation (RAG) offers a solution by facilitating reasoning over large text sources. However, there are many parameters to optimize in just the retrieval system alone. This paper presents an ablation study exploring how different embedding models and pooling methods affect information retrieval for the clinical domain.
Evaluating on 3 retrieval tasks on 2 electronic health record (EHR) data sources, we compared 7 models, including medical- and general-domain models, specialized encoder embedding models, and off-the-shelf decoder LLMs. We also examine the choice of embedding pooling strategy for each model, independently on the query and the text to retrieve.
We found that the choice of embedding model significantly impacts retrieval performance, with BGE, a comparatively small general-domain model, consistently outperforming all others, including medical-specific models. However, our findings also revealed substantial variability across datasets and query text phrasings. We also determined the best pooling methods for each of these models to guide future design of retrieval systems.
The choice of embedding model, pooling strategy, and query formulation can significantly impact retrieval performance and the performance of these models on other public benchmarks does not necessarily transfer to new domains. The high variability in performance across different query phrasings suggests that the choice of query may need to be tuned and validated for each task, or even for each institution's EHR.
This study provides empirical evidence to guide the selection of models and pooling strategies for RAG frameworks in healthcare applications. Further studies such as this one are vital for guiding empirically-grounded development of retrieval frameworks, such as in the context of RAG, for the clinical domain.
由于处理医疗记录需要大量上下文信息,将大语言模型(LLMs)应用于临床领域具有挑战性。检索增强生成(RAG)通过促进对大型文本源的推理提供了一种解决方案。然而,仅在检索系统中就有许多参数需要优化。本文提出了一项消融研究,探讨不同的嵌入模型和池化方法如何影响临床领域的信息检索。
在2个电子健康记录(EHR)数据源上的3个检索任务上进行评估,我们比较了7种模型,包括医学领域和通用领域模型、专门的编码器嵌入模型以及现成的解码器LLMs。我们还独立地针对查询和要检索的文本,研究了每个模型的嵌入池化策略选择。
我们发现嵌入模型的选择对检索性能有显著影响,相对较小的通用领域模型BGE始终优于所有其他模型,包括医学专用模型。然而,我们的研究结果也揭示了不同数据集和查询文本措辞之间存在很大差异。我们还确定了这些模型各自的最佳池化方法,以指导未来检索系统的设计。
嵌入模型、池化策略和查询公式的选择会显著影响检索性能,并且这些模型在其他公共基准上的性能不一定能转移到新领域。不同查询措辞的性能差异很大,这表明可能需要针对每个任务甚至每个机构的EHR对查询选择进行调整和验证。
本研究提供了实证证据,以指导医疗保健应用中RAG框架的模型和池化策略选择。此类进一步研究对于指导基于实证的检索框架开发至关重要,例如在临床领域的RAG背景下。