Kuerbanjiang Warisijiang, Peng Shengzhe, Jiamaliding Yiershatijiang, Yi Yuexiong
Department of Gynecology, Zhongnan Hospital of Wuhan University, Wuhan, Hubei Province, China.
J Med Internet Res. 2025 Feb 5;27:e63626. doi: 10.2196/63626.
Cervical cancer remains the fourth leading cause of death among women globally, with a particularly severe burden in low-resource settings. A comprehensive approach-from screening to diagnosis and treatment-is essential for effective prevention and management. Large language models (LLMs) have emerged as potential tools to support health care, though their specific role in cervical cancer management remains underexplored.
This study aims to systematically evaluate the performance and interpretability of LLMs in cervical cancer management.
Models were selected from the AlpacaEval leaderboard version 2.0 and based on the capabilities of our computer. The questions inputted into the models cover aspects of general knowledge, screening, diagnosis, and treatment, according to guidelines. The prompt was developed using the Context, Objective, Style, Tone, Audience, and Response (CO-STAR) framework. Responses were evaluated for accuracy, guideline compliance, clarity, and practicality, graded as A, B, C, and D with corresponding scores of 3, 2, 1, and 0. The effective rate was calculated as the ratio of A and B responses to the total number of designed questions. Local Interpretable Model-Agnostic Explanations (LIME) was used to explain and enhance physicians' trust in model outputs within the medical context.
Nine models were included in this study, and a set of 100 standardized questions covering general information, screening, diagnosis, and treatment was designed based on international and national guidelines. Seven models (ChatGPT-4.0 Turbo, Claude 2, Gemini Pro, Mistral-7B-v0.2, Starling-LM-7B alpha, HuatuoGPT, and BioMedLM 2.7B) provided stable responses. Among all the models included, ChatGPT-4.0 Turbo ranked first with a mean score of 2.67 (95% CI 2.54-2.80; effective rate 94.00%) with a prompt and 2.52 (95% CI 2.37-2.67; effective rate 87.00%) without a prompt, outperforming the other 8 models (P<.001). Regardless of prompts, QiZhenGPT consistently ranked among the lowest-performing models, with P<.01 in comparisons against all models except BioMedLM. Interpretability analysis showed that prompts improved alignment with human annotations for proprietary models (median intersection over union 0.43), while medical-specialized models exhibited limited improvement.
Proprietary LLMs, particularly ChatGPT-4.0 Turbo and Claude 2, show promise in clinical decision-making involving logical analysis. The use of prompts can enhance the accuracy of some models in cervical cancer management to varying degrees. Medical-specialized models, such as HuatuoGPT and BioMedLM, did not perform as well as expected in this study. By contrast, proprietary models, particularly those augmented with prompts, demonstrated notable accuracy and interpretability in medical tasks, such as cervical cancer management. However, this study underscores the need for further research to explore the practical application of LLMs in medical practice.
宫颈癌仍是全球女性第四大死因,在资源匮乏地区负担尤为沉重。从筛查到诊断和治疗的综合方法对于有效预防和管理至关重要。大语言模型(LLMs)已成为支持医疗保健的潜在工具,但其在宫颈癌管理中的具体作用仍未得到充分探索。
本研究旨在系统评估大语言模型在宫颈癌管理中的性能和可解释性。
从AlpacaEval排行榜2.0版本中根据我们计算机的性能选择模型。根据指南,输入模型的问题涵盖常识、筛查、诊断和治疗等方面。使用上下文、目标、风格、语气、受众和回答(CO - STAR)框架制定提示。对回答进行准确性、指南依从性、清晰度和实用性评估,分为A、B、C和D级,相应分数为3、2、1和0。有效率计算为A和B级回答占设计问题总数的比例。使用局部可解释模型无关解释(LIME)在医学背景下解释并增强医生对模型输出的信任。
本研究纳入了9个模型,并根据国际和国家指南设计了一组涵盖一般信息、筛查、诊断和治疗的100个标准化问题。7个模型(ChatGPT - 4.0 Turbo、Claude 2、Gemini Pro、Mistral - 7B - v0.2、Starling - LM - 7B alpha、华佗GPT和BioMedLM 2.7B)提供了稳定的回答。在所有纳入的模型中,ChatGPT - 4.0 Turbo在有提示时平均得分2.67(95%CI 2.54 - 2.80;有效率94.00%),无提示时平均得分2.52(95%CI 2.37 - 2.67;有效率87.00%),排名第一,优于其他8个模型(P <.001)。无论有无提示,启真GPT始终是表现最差的模型之一,与除BioMedLM之外的所有模型相比P <.01。可解释性分析表明,提示提高了专有模型与人类注释的一致性(中位数交并比为0.43),而医学专业模型的改善有限。
专有大语言模型,特别是ChatGPT - 4.0 Turbo和Claude 2,在涉及逻辑分析的临床决策中显示出前景。提示的使用可以在不同程度上提高一些模型在宫颈癌管理中的准确性。医学专业模型,如华佗GPT和BioMedLM,在本研究中的表现不如预期。相比之下,专有模型,特别是那些添加了提示的模型,在诸如宫颈癌管理等医学任务中表现出显著的准确性和可解释性。然而,本研究强调需要进一步研究以探索大语言模型在医学实践中的实际应用。