Rubinstein Samuel, Mohsin Aleenah, Banerjee Rahul, Ma Will, Mishra Sanjay, Kwok Mary, Yang Peter, Warner Jeremy L, Cowan Andrew J
Division of Hematology, Department of Medicine, University of North Carolina, Chapel Hill, NC, United States.
Brown University Health Cancer Institute, Rhode Island Hospital, Providence, RI, United States.
Front Digit Health. 2025 Apr 29;7:1569554. doi: 10.3389/fdgth.2025.1569554. eCollection 2025.
Concise synopses of clinical evidence support treatment decision-making but are time-consuming to curate. Large language models (LLMs) offer potential but they may provide inaccurate information. We objectively assessed the abilities of four commercially available LLMs to generate synopses for six treatment regimens in multiple myeloma and amyloid light chain (AL) amyloidosis.
We compared the performance of four LLMs: Claude 3.5, ChatGPT 4.0; Gemini 1.0 and Llama-3.1. Each LLM was prompted to write synopses for six regimens. Two hematologists independently assessed accuracy, completeness, relevance, clarity, coherence, and hallucinations using Likert scales. Mean scores with 95% confidence intervals (CI) were calculated across all domains and inter-rater reliability was evaluated using Cohen's quadratic weighted kappa.
Claude demonstrated the highest performance in all domains, outperforming the other LLMs in accuracy: mean Likert score 3.92 (95% CI 3.54-4.29); ChatGPT 3.25 (2.76-3.74); Gemini 3.17 (2.54-3.80); Llama 1.92 (1.41-2.43);completeness: mean Likert score 4.00 (3.66-4.34); GPT 2.58 (2.02-3.15); Gemini 2.58 (2.02-3.15); Llama 1.67 (1.39-1.95); and extentofhallucinations: mean Likert score 4.00 (4.00-4.00); ChatGPT 2.75 (2.06-3.44); Gemini 3.25 (2.65-3.85); Llama 1.92 (1.26-2.57). Llama performed considerably poorer across all the studied domains. ChatGPT and Gemini had intermediate performance. Notably, none of the LLMs registered perfect accuracy, completeness, or relevance.
Claude performed at a consistently higher level than other LLMs, all tested LLMs required careful editing from a domain expert to become usable. More time will be needed to determine the suitability of LLMsto independently generate clinical synopses.
临床证据的简明概要有助于支持治疗决策,但整理起来耗时费力。大语言模型(LLMs)具有潜力,但可能会提供不准确的信息。我们客观评估了四种商用大语言模型针对多发性骨髓瘤和淀粉样轻链(AL)淀粉样变性的六种治疗方案生成概要的能力。
我们比较了四种大语言模型的性能:Claude 3.5、ChatGPT 4.0、Gemini 1.0和Llama - 3.1。每个大语言模型都被要求针对六种治疗方案撰写概要。两名血液科医生使用李克特量表独立评估准确性、完整性、相关性、清晰度、连贯性和幻觉情况。计算所有领域的平均得分及95%置信区间(CI),并使用科恩二次加权卡帕评估评分者间信度。
Claude在所有领域表现最佳,在准确性方面优于其他大语言模型:平均李克特评分为3.92(95% CI 3.54 - 4.29);ChatGPT为3.25(2.76 - 3.74);Gemini为3.17(2.54 - 3.80);Llama为1.92(1.41 - 2.43);在完整性方面:平均李克特评分为4.00(3.66 - 4.34);GPT为2.58(2.02 - 3.15);Gemini为2.58(2.02 - 3.15);Llama为1.67(1.39 - 1.95);在幻觉程度方面:平均李克特评分为4.00(4.00 - 4.00);ChatGPT为2.75(2.06 - 3.44);Gemini为3.25(2.65 - 3.85);Llama为1.92(1.26 - 2.57)。Llama在所有研究领域的表现都差得多。ChatGPT和Gemini表现中等。值得注意的是,没有一个大语言模型在准确性、完整性或相关性方面达到完美。
Claude的表现始终高于其他大语言模型,所有测试的大语言模型都需要领域专家进行仔细编辑才能使用。需要更多时间来确定大语言模型独立生成临床概要的适用性。