Lee Jong Kwon, Choi Sooin, Park Sholhui, Hwang Sang-Hyun, Cho Duck
Department of Laboratory Medicine and Genetics, Samsung Medical Center, Sungkyunkwan University School of Medicine, Seoul, Korea.
Department of Laboratory Medicine and Genetics, Soonchunhyang University Bucheon Hospital, Soonchunhyang University College of Medicine, Bucheon, Korea.
Ann Lab Med. 2025 Sep 1;45(5):520-529. doi: 10.3343/alm.2024.0588. Epub 2025 Apr 28.
: Large language models (LLMs) have the potential for clinical decision support; however, their use in specific tasks, such as determining the RhD blood type for transfusion, remains underexplored. Therefore, we evaluated the accuracy of six LLMs in addressing RhD blood type-related issues in Korean healthcare.
: Fifteen multiple-choice and true/false questions, based on real-world transfusion scenarios and reviewed by specialists, were developed. The questions were administered twice to six LLMs (Clova X, Gemini 1.0, Gemini 1.5, ChatGPT-3.5, GPT-4.0, and GPT-4o) in both Korean and English.
were compared against the performance of 22 transfusion medicine experts. For particularly challenging questions, prompt engineering was applied, and the questions were reevaluated.
: GPT-4o demonstrated the highest accuracy rate in Korean (0.6), with significant differences compared with those of Clova X and Gemini ( <0.05). In English, the results were similar across all models. The transfusion experts achieved a higher accuracy rate (0.8). Among the five questions subjected to prompt engineering, only GPT-4o correctly responded to one, whereas the other models failed. All LLM models changed their responses or did not respond when the same question was repeated.
: GPT-4o showed the best overall performance among the models tested and may be beneficial in RhD blood product transfusion decision-making. However, its performance suggests that it may serve best in a supportive role rather than as a primary decision-making tool.
大型语言模型(LLMs)具有临床决策支持的潜力;然而,它们在特定任务中的应用,如确定输血用的RhD血型,仍未得到充分探索。因此,我们评估了六种大型语言模型在解决韩国医疗保健中与RhD血型相关问题方面的准确性。
基于现实世界的输血场景并经专家审核,编制了15道多项选择题和是非题。这些问题以韩语和英语分两次向六种大型语言模型(Clova X、Gemini 1.0、Gemini 1.5、ChatGPT-3.5、GPT-4.0和GPT-4o)进行提问。
将结果与22位输血医学专家的表现进行比较。对于特别具有挑战性的问题,应用了提示工程,并对问题进行了重新评估。
GPT-4o在韩语中的准确率最高(0.6),与Clova X和Gemini相比有显著差异(<0.05)。在英语中,所有模型的结果相似。输血专家的准确率更高(0.8)。在经过提示工程处理的五个问题中,只有GPT-4o正确回答了一个,而其他模型均未答对。当重复相同问题时,所有大型语言模型都改变了回答或未作答。
在测试的模型中,GPT-4o总体表现最佳,可能有助于RhD血液制品输血决策。然而,其表现表明它最适合作为辅助工具,而非主要决策工具。