使用两个大语言模型提取临床指南信息：评估研究

Extracting Clinical Guideline Information Using Two Large Language Models: Evaluation Study.

作者信息

Hsu Hsing-Yu, Chen Lu-Wen, Hsu Wan-Tseng, Hsieh Yow-Wen, Chang Shih-Sheng

机构信息

Graduate Institute of Clinical Pharmacy, College of Medicine, National Taiwan University, Taipei, Taiwan.

Department of Pharmacy, China Medical University Hospital, Taichung, Taiwan.

出版信息

J Med Internet Res. 2025 Sep 5;27:e73486. doi: 10.2196/73486.

DOI:10.2196/73486

PMID:40911841

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12413144/

Abstract

BACKGROUND

The effective implementation of personalized pharmacogenomics (PGx) requires the integration of released clinical guidelines into decision support systems to facilitate clinical applications. Large language models (LLMs) can be valuable tools for automating information extraction and updates.

OBJECTIVE

This study aimed to assess the effectiveness of repeated cross-comparisons and an agreement-threshold strategy in 2 advanced LLMs as supportive tools for updating information.

METHODS

The study evaluated the performance of 2 LLMs, GPT-4o and Gemini-1.5-Pro, in extracting PGx clinical guidelines and comparing their outputs with expert-annotated evaluations. The 2 LLMs classified 385 PGx clinical guidelines, with each recommendation tested 20 times per model. Accuracy was assessed by comparing the results with manually labeled data. Two prospectively defined strategies were used to identify inconsistent predictions. The first involved repeated cross-comparison, flagging discrepancies between the most frequent classifications from each model. The second used a consistency threshold strategy, which designated predictions appearing in less than 60% of the 40 combined outputs as unstable. Cases flagged by either strategy were subjected to manual review. This study also estimated the overall cost of model use and was conducted between October 1 and November 30, 2024.

RESULTS

GPT-4o and Gemini-1.5-Pro yielded reproducibility rates of 97.8% (7534/7700) and 98.9% (7612/7700), respectively, based on the most frequent classification for each query. Compared with expert labels, GPT-4o achieved 93.5% accuracy (Cohen κ=0.90; P<.001) and Gemini-1.5-Pro 92.7% accuracy (Cohen κ=0.89; P<.001). Both models demonstrated high overall performance, with comparable weighted average F1-scores (GPT-4o: 0.929; Gemini: 0.935). The models generated consistent predictions for 341 of 385 guideline items, reducing the need for manual review by 88.6%. Among these agreed-upon cases, only one (0.3%) diverged from expert labels. Applying a predefined agreement-threshold strategy further reduced the number of priority manual review cases to 2.9% (11/385), although the error rate slightly increased to 0.5% (2/374). The inconsistencies identified through these methods prompted the prioritization of manual review to minimize errors and enhance clinical applicability. The total combined cost of using both LLMs was only US $0.76.

CONCLUSIONS

These findings suggest that using 2 LLMs can effectively streamline PGx guideline integration into clinical decision support systems while maintaining high performance and minimal cost. Although selective manual review remains necessary, this approach offers a practical and scalable solution for PGx guideline classification in clinical workflows.

摘要

背景

个性化药物基因组学（PGx）的有效实施需要将发布的临床指南整合到决策支持系统中，以促进临床应用。大语言模型（LLMs）可成为自动化信息提取和更新的宝贵工具。

目的

本研究旨在评估重复交叉比较和一致性阈值策略在两个先进的大语言模型中作为更新信息支持工具的有效性。

方法

该研究评估了两个大语言模型GPT-4o和Gemini-1.5-Pro在提取PGx临床指南并将其输出与专家注释评估进行比较方面的性能。这两个大语言模型对385条PGx临床指南进行分类，每个模型对每条推荐进行20次测试。通过将结果与手动标注数据进行比较来评估准确性。使用两种预先定义的策略来识别不一致的预测。第一种方法是重复交叉比较，标记每个模型最频繁分类之间的差异。第二种方法是一致性阈值策略，将出现在40个组合输出中少于60%的预测指定为不稳定。两种策略标记的病例都要进行人工审核。本研究还估计了模型使用的总成本，研究于2024年10月1日至11月30日进行。

结果

基于每个查询的最频繁分类，GPT-4o和Gemini-1.5-Pro的重现率分别为97.8%（7534/7700）和98.9%（7612/7700）。与专家标签相比，GPT-4o的准确率为93.5%（Cohen κ = 0.90；P <.001），Gemini-1.5-Pro的准确率为92.7%（Cohen κ = 0.89；P <.001）。两个模型都表现出较高的整体性能，加权平均F1分数相当（GPT-4o：0.929；Gemini：0.935）。对于385条指南项目中的341条，模型生成了一致的预测，将人工审核的需求减少了88.6%。在这些达成一致的案例中，只有一个（0.3%）与专家标签不同。应用预定义的一致性阈值策略进一步将优先人工审核案例的数量减少到2.9%（11/385），尽管错误率略有增加至0.5%（2/374）。通过这些方法识别出的不一致情况促使优先进行人工审核，以尽量减少错误并提高临床适用性。使用这两个大语言模型的总组合成本仅为0.76美元。