• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

使用两个大语言模型提取临床指南信息:评估研究

Extracting Clinical Guideline Information Using Two Large Language Models: Evaluation Study.

作者信息

Hsu Hsing-Yu, Chen Lu-Wen, Hsu Wan-Tseng, Hsieh Yow-Wen, Chang Shih-Sheng

机构信息

Graduate Institute of Clinical Pharmacy, College of Medicine, National Taiwan University, Taipei, Taiwan.

Department of Pharmacy, China Medical University Hospital, Taichung, Taiwan.

出版信息

J Med Internet Res. 2025 Sep 5;27:e73486. doi: 10.2196/73486.

DOI:10.2196/73486
PMID:40911841
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12413144/
Abstract

BACKGROUND

The effective implementation of personalized pharmacogenomics (PGx) requires the integration of released clinical guidelines into decision support systems to facilitate clinical applications. Large language models (LLMs) can be valuable tools for automating information extraction and updates.

OBJECTIVE

This study aimed to assess the effectiveness of repeated cross-comparisons and an agreement-threshold strategy in 2 advanced LLMs as supportive tools for updating information.

METHODS

The study evaluated the performance of 2 LLMs, GPT-4o and Gemini-1.5-Pro, in extracting PGx clinical guidelines and comparing their outputs with expert-annotated evaluations. The 2 LLMs classified 385 PGx clinical guidelines, with each recommendation tested 20 times per model. Accuracy was assessed by comparing the results with manually labeled data. Two prospectively defined strategies were used to identify inconsistent predictions. The first involved repeated cross-comparison, flagging discrepancies between the most frequent classifications from each model. The second used a consistency threshold strategy, which designated predictions appearing in less than 60% of the 40 combined outputs as unstable. Cases flagged by either strategy were subjected to manual review. This study also estimated the overall cost of model use and was conducted between October 1 and November 30, 2024.

RESULTS

GPT-4o and Gemini-1.5-Pro yielded reproducibility rates of 97.8% (7534/7700) and 98.9% (7612/7700), respectively, based on the most frequent classification for each query. Compared with expert labels, GPT-4o achieved 93.5% accuracy (Cohen κ=0.90; P<.001) and Gemini-1.5-Pro 92.7% accuracy (Cohen κ=0.89; P<.001). Both models demonstrated high overall performance, with comparable weighted average F1-scores (GPT-4o: 0.929; Gemini: 0.935). The models generated consistent predictions for 341 of 385 guideline items, reducing the need for manual review by 88.6%. Among these agreed-upon cases, only one (0.3%) diverged from expert labels. Applying a predefined agreement-threshold strategy further reduced the number of priority manual review cases to 2.9% (11/385), although the error rate slightly increased to 0.5% (2/374). The inconsistencies identified through these methods prompted the prioritization of manual review to minimize errors and enhance clinical applicability. The total combined cost of using both LLMs was only US $0.76.

CONCLUSIONS

These findings suggest that using 2 LLMs can effectively streamline PGx guideline integration into clinical decision support systems while maintaining high performance and minimal cost. Although selective manual review remains necessary, this approach offers a practical and scalable solution for PGx guideline classification in clinical workflows.

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c8a8/12413144/b79613a11aaa/jmir-v27-e73486-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c8a8/12413144/d4242e86b260/jmir-v27-e73486-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c8a8/12413144/b79613a11aaa/jmir-v27-e73486-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c8a8/12413144/d4242e86b260/jmir-v27-e73486-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c8a8/12413144/b79613a11aaa/jmir-v27-e73486-g002.jpg
摘要

背景

个性化药物基因组学(PGx)的有效实施需要将发布的临床指南整合到决策支持系统中,以促进临床应用。大语言模型(LLMs)可成为自动化信息提取和更新的宝贵工具。

目的

本研究旨在评估重复交叉比较和一致性阈值策略在两个先进的大语言模型中作为更新信息支持工具的有效性。

方法

该研究评估了两个大语言模型GPT-4o和Gemini-1.5-Pro在提取PGx临床指南并将其输出与专家注释评估进行比较方面的性能。这两个大语言模型对385条PGx临床指南进行分类,每个模型对每条推荐进行20次测试。通过将结果与手动标注数据进行比较来评估准确性。使用两种预先定义的策略来识别不一致的预测。第一种方法是重复交叉比较,标记每个模型最频繁分类之间的差异。第二种方法是一致性阈值策略,将出现在40个组合输出中少于60%的预测指定为不稳定。两种策略标记的病例都要进行人工审核。本研究还估计了模型使用的总成本,研究于2024年10月1日至11月30日进行。

结果

基于每个查询的最频繁分类,GPT-4o和Gemini-1.5-Pro的重现率分别为97.8%(7534/7700)和98.9%(7612/7700)。与专家标签相比,GPT-4o的准确率为93.5%(Cohen κ = 0.90;P <.001),Gemini-1.5-Pro的准确率为92.7%(Cohen κ = 0.89;P <.001)。两个模型都表现出较高的整体性能,加权平均F1分数相当(GPT-4o:0.929;Gemini:0.935)。对于385条指南项目中的341条,模型生成了一致的预测,将人工审核的需求减少了88.6%。在这些达成一致的案例中,只有一个(0.3%)与专家标签不同。应用预定义的一致性阈值策略进一步将优先人工审核案例的数量减少到2.9%(11/385),尽管错误率略有增加至0.5%(2/374)。通过这些方法识别出的不一致情况促使优先进行人工审核,以尽量减少错误并提高临床适用性。使用这两个大语言模型的总组合成本仅为0.76美元。

结论

这些发现表明,使用两个大语言模型可以有效地简化将PGx指南整合到临床决策支持系统中,同时保持高性能和低成本。虽然仍然需要进行选择性人工审核,但这种方法为临床工作流程中的PGx指南分类提供了一种实用且可扩展的解决方案。

相似文献

1
Extracting Clinical Guideline Information Using Two Large Language Models: Evaluation Study.使用两个大语言模型提取临床指南信息:评估研究
J Med Internet Res. 2025 Sep 5;27:e73486. doi: 10.2196/73486.
2
Prescription of Controlled Substances: Benefits and Risks管制药品的处方:益处与风险
3
Classifying Patient Complaints Using Artificial Intelligence-Powered Large Language Models: Cross-Sectional Study.使用人工智能驱动的大语言模型对患者投诉进行分类:横断面研究
J Med Internet Res. 2025 Aug 6;27:e74231. doi: 10.2196/74231.
4
Comparison of Two Modern Survival Prediction Tools, SORG-MLA and METSSS, in Patients With Symptomatic Long-bone Metastases Who Underwent Local Treatment With Surgery Followed by Radiotherapy and With Radiotherapy Alone.两种现代生存预测工具 SORG-MLA 和 METSSS 在接受手术联合放疗和单纯放疗治疗有症状长骨转移患者中的比较。
Clin Orthop Relat Res. 2024 Dec 1;482(12):2193-2208. doi: 10.1097/CORR.0000000000003185. Epub 2024 Jul 23.
5
Assessing the Accuracy and Reliability of Large Language Models in Psychiatry Using Standardized Multiple-Choice Questions: Cross-Sectional Study.使用标准化多项选择题评估大型语言模型在精神病学中的准确性和可靠性:横断面研究
J Med Internet Res. 2025 May 20;27:e69910. doi: 10.2196/69910.
6
Assessment of Recommendations Provided to Athletes Regarding Sleep Education by GPT-4o and Google Gemini: Comparative Evaluation Study.GPT-4o和谷歌Gemini向运动员提供的关于睡眠教育的建议评估:比较评估研究
JMIR Form Res. 2025 Jul 8;9:e71358. doi: 10.2196/71358.
7
Leveraging Retrieval-Augmented Large Language Models for Dietary Recommendations With Traditional Chinese Medicine's Medicine Food Homology: Algorithm Development and Validation.利用检索增强大语言模型结合中医药食同源进行饮食推荐:算法开发与验证
JMIR Med Inform. 2025 Aug 21;13:e75279. doi: 10.2196/75279.
8
Data extraction from free-text stroke CT reports using GPT-4o and Llama-3.3-70B: the impact of annotation guidelines.使用GPT-4o和Llama-3.3-70B从自由文本中风CT报告中提取数据:注释指南的影响
Eur Radiol Exp. 2025 Jun 19;9(1):61. doi: 10.1186/s41747-025-00600-2.
9
Cost-effectiveness of using prognostic information to select women with breast cancer for adjuvant systemic therapy.利用预后信息为乳腺癌患者选择辅助性全身治疗的成本效益
Health Technol Assess. 2006 Sep;10(34):iii-iv, ix-xi, 1-204. doi: 10.3310/hta10340.
10
Intravenous magnesium sulphate and sotalol for prevention of atrial fibrillation after coronary artery bypass surgery: a systematic review and economic evaluation.静脉注射硫酸镁和索他洛尔预防冠状动脉搭桥术后房颤:系统评价与经济学评估
Health Technol Assess. 2008 Jun;12(28):iii-iv, ix-95. doi: 10.3310/hta12280.

本文引用的文献

1
Clinical insights: A comprehensive review of language models in medicine.临床见解:医学领域语言模型的全面综述
PLOS Digit Health. 2025 May 8;4(5):e0000800. doi: 10.1371/journal.pdig.0000800. eCollection 2025 May.
2
Roles and Potential of Large Language Models in Healthcare: A Comprehensive Review.大语言模型在医疗保健中的作用与潜力:全面综述
Biomed J. 2025 Apr 29:100868. doi: 10.1016/j.bj.2025.100868.
3
Which current chatbot is more competent in urological theoretical knowledge? A comparative analysis by the European board of urology in-service assessment.
目前哪种聊天机器人在泌尿学理论知识方面更具能力?欧洲泌尿外科在职评估委员会的一项比较分析。
World J Urol. 2025 Feb 11;43(1):116. doi: 10.1007/s00345-025-05499-3.
4
Generative AI chatbots for reliable cancer information: Evaluating web-search, multilingual, and reference capabilities of emerging large language models.用于提供可靠癌症信息的生成式人工智能聊天机器人:评估新兴大语言模型的网络搜索、多语言和参考能力。
Eur J Cancer. 2025 Mar 11;218:115274. doi: 10.1016/j.ejca.2025.115274. Epub 2025 Feb 3.
5
The TRIPOD-LLM reporting guideline for studies using large language models.使用大语言模型的研究的TRIPOD-LLM报告指南。
Nat Med. 2025 Jan;31(1):60-69. doi: 10.1038/s41591-024-03425-5. Epub 2025 Jan 8.
6
A deep neural network model for classifying pharmacy practice publications into research domains.一种用于将药学实践出版物分类到研究领域的深度神经网络模型。
Res Social Adm Pharm. 2025 Feb;21(2):85-93. doi: 10.1016/j.sapharm.2024.10.009. Epub 2024 Nov 5.
7
LLMs and generative agent-based models for complex systems research.用于复杂系统研究的大语言模型和基于生成代理的模型。
Phys Life Rev. 2024 Dec;51:283-293. doi: 10.1016/j.plrev.2024.10.013. Epub 2024 Oct 28.
8
Testing and Evaluation of Health Care Applications of Large Language Models: A Systematic Review.大语言模型在医疗保健应用中的测试与评估:一项系统综述。
JAMA. 2025 Jan 28;333(4):319-328. doi: 10.1001/jama.2024.21700.
9
Do large language model chatbots perform better than established patient information resources in answering patient questions? A comparative study on melanoma.在回答患者问题方面,大型语言模型聊天机器人的表现是否优于成熟的患者信息资源?一项关于黑色素瘤的比较研究。
Br J Dermatol. 2025 Jan 24;192(2):306-315. doi: 10.1093/bjd/ljae377.
10
The potential of ChatGPT in medicine: an example analysis of nephrology specialty exams in Poland.ChatGPT在医学领域的潜力:以波兰肾脏病专业考试为例的分析
Clin Kidney J. 2024 Jun 22;17(8):sfae193. doi: 10.1093/ckj/sfae193. eCollection 2024 Aug.