• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于标准化问卷的大语言模型在宫颈癌管理中的性能评估:比较研究

Performance Evaluation of Large Language Models in Cervical Cancer Management Based on a Standardized Questionnaire: Comparative Study.

作者信息

Kuerbanjiang Warisijiang, Peng Shengzhe, Jiamaliding Yiershatijiang, Yi Yuexiong

机构信息

Department of Gynecology, Zhongnan Hospital of Wuhan University, Wuhan, Hubei Province, China.

出版信息

J Med Internet Res. 2025 Feb 5;27:e63626. doi: 10.2196/63626.

DOI:10.2196/63626
PMID:39908540
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11840365/
Abstract

BACKGROUND

Cervical cancer remains the fourth leading cause of death among women globally, with a particularly severe burden in low-resource settings. A comprehensive approach-from screening to diagnosis and treatment-is essential for effective prevention and management. Large language models (LLMs) have emerged as potential tools to support health care, though their specific role in cervical cancer management remains underexplored.

OBJECTIVE

This study aims to systematically evaluate the performance and interpretability of LLMs in cervical cancer management.

METHODS

Models were selected from the AlpacaEval leaderboard version 2.0 and based on the capabilities of our computer. The questions inputted into the models cover aspects of general knowledge, screening, diagnosis, and treatment, according to guidelines. The prompt was developed using the Context, Objective, Style, Tone, Audience, and Response (CO-STAR) framework. Responses were evaluated for accuracy, guideline compliance, clarity, and practicality, graded as A, B, C, and D with corresponding scores of 3, 2, 1, and 0. The effective rate was calculated as the ratio of A and B responses to the total number of designed questions. Local Interpretable Model-Agnostic Explanations (LIME) was used to explain and enhance physicians' trust in model outputs within the medical context.

RESULTS

Nine models were included in this study, and a set of 100 standardized questions covering general information, screening, diagnosis, and treatment was designed based on international and national guidelines. Seven models (ChatGPT-4.0 Turbo, Claude 2, Gemini Pro, Mistral-7B-v0.2, Starling-LM-7B alpha, HuatuoGPT, and BioMedLM 2.7B) provided stable responses. Among all the models included, ChatGPT-4.0 Turbo ranked first with a mean score of 2.67 (95% CI 2.54-2.80; effective rate 94.00%) with a prompt and 2.52 (95% CI 2.37-2.67; effective rate 87.00%) without a prompt, outperforming the other 8 models (P<.001). Regardless of prompts, QiZhenGPT consistently ranked among the lowest-performing models, with P<.01 in comparisons against all models except BioMedLM. Interpretability analysis showed that prompts improved alignment with human annotations for proprietary models (median intersection over union 0.43), while medical-specialized models exhibited limited improvement.

CONCLUSIONS

Proprietary LLMs, particularly ChatGPT-4.0 Turbo and Claude 2, show promise in clinical decision-making involving logical analysis. The use of prompts can enhance the accuracy of some models in cervical cancer management to varying degrees. Medical-specialized models, such as HuatuoGPT and BioMedLM, did not perform as well as expected in this study. By contrast, proprietary models, particularly those augmented with prompts, demonstrated notable accuracy and interpretability in medical tasks, such as cervical cancer management. However, this study underscores the need for further research to explore the practical application of LLMs in medical practice.

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c066/11840365/c5ff54b37b7e/jmir_v27i1e63626_fig6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c066/11840365/fbcab7dbd5e3/jmir_v27i1e63626_fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c066/11840365/f6d7b90e82e8/jmir_v27i1e63626_fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c066/11840365/d4e1dfd719d1/jmir_v27i1e63626_fig3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c066/11840365/ecb7aa407dd8/jmir_v27i1e63626_fig4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c066/11840365/58ba8c168d85/jmir_v27i1e63626_fig5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c066/11840365/c5ff54b37b7e/jmir_v27i1e63626_fig6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c066/11840365/fbcab7dbd5e3/jmir_v27i1e63626_fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c066/11840365/f6d7b90e82e8/jmir_v27i1e63626_fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c066/11840365/d4e1dfd719d1/jmir_v27i1e63626_fig3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c066/11840365/ecb7aa407dd8/jmir_v27i1e63626_fig4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c066/11840365/58ba8c168d85/jmir_v27i1e63626_fig5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c066/11840365/c5ff54b37b7e/jmir_v27i1e63626_fig6.jpg
摘要

背景

宫颈癌仍是全球女性第四大死因,在资源匮乏地区负担尤为沉重。从筛查到诊断和治疗的综合方法对于有效预防和管理至关重要。大语言模型(LLMs)已成为支持医疗保健的潜在工具,但其在宫颈癌管理中的具体作用仍未得到充分探索。

目的

本研究旨在系统评估大语言模型在宫颈癌管理中的性能和可解释性。

方法

从AlpacaEval排行榜2.0版本中根据我们计算机的性能选择模型。根据指南,输入模型的问题涵盖常识、筛查、诊断和治疗等方面。使用上下文、目标、风格、语气、受众和回答(CO - STAR)框架制定提示。对回答进行准确性、指南依从性、清晰度和实用性评估,分为A、B、C和D级,相应分数为3、2、1和0。有效率计算为A和B级回答占设计问题总数的比例。使用局部可解释模型无关解释(LIME)在医学背景下解释并增强医生对模型输出的信任。

结果

本研究纳入了9个模型,并根据国际和国家指南设计了一组涵盖一般信息、筛查、诊断和治疗的100个标准化问题。7个模型(ChatGPT - 4.0 Turbo、Claude 2、Gemini Pro、Mistral - 7B - v0.2、Starling - LM - 7B alpha、华佗GPT和BioMedLM 2.7B)提供了稳定的回答。在所有纳入的模型中,ChatGPT - 4.0 Turbo在有提示时平均得分2.67(95%CI 2.54 - 2.80;有效率94.00%),无提示时平均得分2.52(95%CI 2.37 - 2.67;有效率87.00%),排名第一,优于其他8个模型(P <.001)。无论有无提示,启真GPT始终是表现最差的模型之一,与除BioMedLM之外的所有模型相比P <.01。可解释性分析表明,提示提高了专有模型与人类注释的一致性(中位数交并比为0.43),而医学专业模型的改善有限。

结论

专有大语言模型,特别是ChatGPT - 4.0 Turbo和Claude 2,在涉及逻辑分析的临床决策中显示出前景。提示的使用可以在不同程度上提高一些模型在宫颈癌管理中的准确性。医学专业模型,如华佗GPT和BioMedLM,在本研究中的表现不如预期。相比之下,专有模型,特别是那些添加了提示的模型,在诸如宫颈癌管理等医学任务中表现出显著的准确性和可解释性。然而,本研究强调需要进一步研究以探索大语言模型在医学实践中的实际应用。

相似文献

1
Performance Evaluation of Large Language Models in Cervical Cancer Management Based on a Standardized Questionnaire: Comparative Study.基于标准化问卷的大语言模型在宫颈癌管理中的性能评估:比较研究
J Med Internet Res. 2025 Feb 5;27:e63626. doi: 10.2196/63626.
2
Enhanced Artificial Intelligence in Bladder Cancer Management: A Comparative Analysis and Optimization Study of Multiple Large Language Models.膀胱癌管理中的增强型人工智能:多个大语言模型的比较分析与优化研究
J Endourol. 2025 May;39(5):494-499. doi: 10.1089/end.2024.0860. Epub 2025 Mar 18.
3
Evaluating text and visual diagnostic capabilities of large language models on questions related to the Breast Imaging Reporting and Data System Atlas 5 edition.评估大语言模型在与《乳腺影像报告和数据系统》第5版相关问题上的文本和视觉诊断能力。
Diagn Interv Radiol. 2025 Mar 3;31(2):111-129. doi: 10.4274/dir.2024.242876. Epub 2024 Sep 9.
4
Proficiency, Clarity, and Objectivity of Large Language Models Versus Specialists' Knowledge on COVID-19's Impacts in Pregnancy: Cross-Sectional Pilot Study.大型语言模型在新冠肺炎对妊娠影响方面的熟练度、清晰度和客观性与专家知识对比:横断面试点研究
JMIR Form Res. 2025 Feb 5;9:e56126. doi: 10.2196/56126.
5
Accuracy of Large Language Models for Infective Endocarditis Prophylaxis in Dental Procedures.大型语言模型在牙科手术中预防感染性心内膜炎的准确性。
Int Dent J. 2025 Feb;75(1):206-212. doi: 10.1016/j.identj.2024.09.033. Epub 2024 Oct 12.
6
Large Language Models for Simplified Interventional Radiology Reports: A Comparative Analysis.用于简化介入放射学报告的大语言模型:一项比较分析
Acad Radiol. 2025 Feb;32(2):888-898. doi: 10.1016/j.acra.2024.09.041. Epub 2024 Sep 30.
7
Currently Available Large Language Models Do Not Provide Musculoskeletal Treatment Recommendations That Are Concordant With Evidence-Based Clinical Practice Guidelines.目前可用的大语言模型并未提供与循证临床实践指南相一致的肌肉骨骼治疗建议。
Arthroscopy. 2025 Feb;41(2):263-275.e6. doi: 10.1016/j.arthro.2024.07.040. Epub 2024 Aug 22.
8
Accuracy, consistency, and contextual understanding of large language models in restorative dentistry and endodontics.大语言模型在修复牙科和牙髓病学中的准确性、一致性及上下文理解
J Dent. 2025 Jun;157:105764. doi: 10.1016/j.jdent.2025.105764. Epub 2025 Apr 15.
9
Optimizing ChatGPT's Interpretation and Reporting of Delirium Assessment Outcomes: Exploratory Study.优化 ChatGPT 对谵妄评估结果的解释和报告:探索性研究。
JMIR Form Res. 2024 Oct 1;8:e51383. doi: 10.2196/51383.
10
Accuracy of latest large language models in answering multiple choice questions in dentistry: A comparative study.最新大语言模型在回答牙科多项选择题方面的准确性:一项比较研究。
PLoS One. 2025 Jan 29;20(1):e0317423. doi: 10.1371/journal.pone.0317423. eCollection 2025.

引用本文的文献

1
Exploring the possibilities and limitations of customized large language model to support and improve cervical cancer screening.探索定制大语言模型以支持和改进宫颈癌筛查的可能性与局限性。
BMC Med Inform Decis Mak. 2025 Jul 1;25(1):242. doi: 10.1186/s12911-025-03088-3.
2
High-risk HPV genotypes in women with abnormal cytology: a 12-year retrospective study.细胞学异常女性中的高危人乳头瘤病毒基因型:一项12年回顾性研究
Infect Agent Cancer. 2025 May 26;20(1):34. doi: 10.1186/s13027-025-00664-0.

本文引用的文献

1
Based on Medicine, The Now and Future of Large Language Models.基于医学,大语言模型的现状与未来。
Cell Mol Bioeng. 2024 Sep 16;17(4):263-277. doi: 10.1007/s12195-024-00820-3. eCollection 2024 Aug.
2
SEOM-GEICO Clinical Guidelines on cervical cancer (2023).SEOM-GEICO 宫颈癌临床指南(2023)。
Clin Transl Oncol. 2024 Nov;26(11):2771-2782. doi: 10.1007/s12094-024-03604-3. Epub 2024 Aug 31.
3
Ethical Considerations and Fundamental Principles of Large Language Models in Medical Education: Viewpoint.医学教育中大型语言模型的伦理考量与基本原则:观点
J Med Internet Res. 2024 Aug 1;26:e60083. doi: 10.2196/60083.
4
An Empirical Evaluation of Prompting Strategies for Large Language Models in Zero-Shot Clinical Natural Language Processing: Algorithm Development and Validation Study.零样本临床自然语言处理中大型语言模型提示策略的实证评估:算法开发与验证研究
JMIR Med Inform. 2024 Apr 8;12:e55318. doi: 10.2196/55318.
5
Can large language models reason about medical questions?大型语言模型能对医学问题进行推理吗?
Patterns (N Y). 2024 Mar 1;5(3):100943. doi: 10.1016/j.patter.2024.100943. eCollection 2024 Mar 8.
6
Evaluating capabilities of large language models: Performance of GPT-4 on surgical knowledge assessments.评估大语言模型的能力:GPT-4在外科知识评估中的表现。
Surgery. 2024 Apr;175(4):936-942. doi: 10.1016/j.surg.2023.12.014. Epub 2024 Jan 20.
7
Artificial intelligence, ChatGPT, and other large language models for social determinants of health: Current state and future directions.人工智能、ChatGPT 及其他用于健康社会决定因素的大语言模型:现状与未来方向。
Cell Rep Med. 2024 Jan 16;5(1):101356. doi: 10.1016/j.xcrm.2023.101356.
8
Pure Wisdom or Potemkin Villages? A Comparison of ChatGPT 3.5 and ChatGPT 4 on USMLE Step 3 Style Questions: Quantitative Analysis.纯粹的智慧还是虚假的村庄?对 USMLE Step 3 题型的 ChatGPT 3.5 和 ChatGPT 4 的比较:定量分析。
JMIR Med Educ. 2024 Jan 5;10:e51148. doi: 10.2196/51148.
9
Clinfo.ai: An Open-Source Retrieval-Augmented Large Language Model System for Answering Medical Questions using Scientific Literature.Clinfo.ai:一个使用科学文献回答医学问题的开源检索增强型大型语言模型系统。
Pac Symp Biocomput. 2024;29:8-23.
10
Evaluation of Large language model performance on the Multi-Specialty Recruitment Assessment (MSRA) exam.大语言模型在多专科招聘评估(MSRA)考试中的表现评估。
Comput Biol Med. 2024 Jan;168:107794. doi: 10.1016/j.compbiomed.2023.107794. Epub 2023 Nov 30.