Suppr超能文献

使用两个大语言模型提取临床指南信息:评估研究

Extracting Clinical Guideline Information Using Two Large Language Models: Evaluation Study.

作者信息

Hsu Hsing-Yu, Chen Lu-Wen, Hsu Wan-Tseng, Hsieh Yow-Wen, Chang Shih-Sheng

机构信息

Graduate Institute of Clinical Pharmacy, College of Medicine, National Taiwan University, Taipei, Taiwan.

Department of Pharmacy, China Medical University Hospital, Taichung, Taiwan.

出版信息

J Med Internet Res. 2025 Sep 5;27:e73486. doi: 10.2196/73486.

Abstract

BACKGROUND

The effective implementation of personalized pharmacogenomics (PGx) requires the integration of released clinical guidelines into decision support systems to facilitate clinical applications. Large language models (LLMs) can be valuable tools for automating information extraction and updates.

OBJECTIVE

This study aimed to assess the effectiveness of repeated cross-comparisons and an agreement-threshold strategy in 2 advanced LLMs as supportive tools for updating information.

METHODS

The study evaluated the performance of 2 LLMs, GPT-4o and Gemini-1.5-Pro, in extracting PGx clinical guidelines and comparing their outputs with expert-annotated evaluations. The 2 LLMs classified 385 PGx clinical guidelines, with each recommendation tested 20 times per model. Accuracy was assessed by comparing the results with manually labeled data. Two prospectively defined strategies were used to identify inconsistent predictions. The first involved repeated cross-comparison, flagging discrepancies between the most frequent classifications from each model. The second used a consistency threshold strategy, which designated predictions appearing in less than 60% of the 40 combined outputs as unstable. Cases flagged by either strategy were subjected to manual review. This study also estimated the overall cost of model use and was conducted between October 1 and November 30, 2024.

RESULTS

GPT-4o and Gemini-1.5-Pro yielded reproducibility rates of 97.8% (7534/7700) and 98.9% (7612/7700), respectively, based on the most frequent classification for each query. Compared with expert labels, GPT-4o achieved 93.5% accuracy (Cohen κ=0.90; P<.001) and Gemini-1.5-Pro 92.7% accuracy (Cohen κ=0.89; P<.001). Both models demonstrated high overall performance, with comparable weighted average F1-scores (GPT-4o: 0.929; Gemini: 0.935). The models generated consistent predictions for 341 of 385 guideline items, reducing the need for manual review by 88.6%. Among these agreed-upon cases, only one (0.3%) diverged from expert labels. Applying a predefined agreement-threshold strategy further reduced the number of priority manual review cases to 2.9% (11/385), although the error rate slightly increased to 0.5% (2/374). The inconsistencies identified through these methods prompted the prioritization of manual review to minimize errors and enhance clinical applicability. The total combined cost of using both LLMs was only US $0.76.

CONCLUSIONS

These findings suggest that using 2 LLMs can effectively streamline PGx guideline integration into clinical decision support systems while maintaining high performance and minimal cost. Although selective manual review remains necessary, this approach offers a practical and scalable solution for PGx guideline classification in clinical workflows.

摘要

背景

个性化药物基因组学(PGx)的有效实施需要将发布的临床指南整合到决策支持系统中,以促进临床应用。大语言模型(LLMs)可成为自动化信息提取和更新的宝贵工具。

目的

本研究旨在评估重复交叉比较和一致性阈值策略在两个先进的大语言模型中作为更新信息支持工具的有效性。

方法

该研究评估了两个大语言模型GPT-4o和Gemini-1.5-Pro在提取PGx临床指南并将其输出与专家注释评估进行比较方面的性能。这两个大语言模型对385条PGx临床指南进行分类,每个模型对每条推荐进行20次测试。通过将结果与手动标注数据进行比较来评估准确性。使用两种预先定义的策略来识别不一致的预测。第一种方法是重复交叉比较,标记每个模型最频繁分类之间的差异。第二种方法是一致性阈值策略,将出现在40个组合输出中少于60%的预测指定为不稳定。两种策略标记的病例都要进行人工审核。本研究还估计了模型使用的总成本,研究于2024年10月1日至11月30日进行。

结果

基于每个查询的最频繁分类,GPT-4o和Gemini-1.5-Pro的重现率分别为97.8%(7534/7700)和98.9%(7612/7700)。与专家标签相比,GPT-4o的准确率为93.5%(Cohen κ = 0.90;P <.001),Gemini-1.5-Pro的准确率为92.7%(Cohen κ = 0.89;P <.001)。两个模型都表现出较高的整体性能,加权平均F1分数相当(GPT-4o:0.929;Gemini:0.935)。对于385条指南项目中的341条,模型生成了一致的预测,将人工审核的需求减少了88.6%。在这些达成一致的案例中,只有一个(0.3%)与专家标签不同。应用预定义的一致性阈值策略进一步将优先人工审核案例的数量减少到2.9%(11/385),尽管错误率略有增加至0.5%(2/374)。通过这些方法识别出的不一致情况促使优先进行人工审核,以尽量减少错误并提高临床适用性。使用这两个大语言模型的总组合成本仅为0.76美元。

结论

这些发现表明,使用两个大语言模型可以有效地简化将PGx指南整合到临床决策支持系统中,同时保持高性能和低成本。虽然仍然需要进行选择性人工审核,但这种方法为临床工作流程中的PGx指南分类提供了一种实用且可扩展的解决方案。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c8a8/12413144/d4242e86b260/jmir-v27-e73486-g001.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验