Gumilar Khanisyah Erza, Indraprasta Birama R, Faridzi Ach Salman, Wibowo Bagus M, Herlambang Aditya, Rahestyningtyas Eccita, Irawan Budi, Tambunan Zulkarnain, Bustomi Ahmad Fadhli, Brahmantara Bagus Ngurah, Yu Zih-Ying, Hsu Yu-Cheng, Pramuditya Herlangga, Putra Very Great E, Nugroho Hari, Mulawardhana Pungky, Tjokroprawiro Brahmana A, Hedianto Tri, Ibrahim Ibrahim H, Huang Jingshan, Li Dongqi, Lu Chien-Hsing, Yang Jer-Yen, Liao Li-Na, Tan Ming
Graduate Institute of Biomedical Science, China Medical University, Taichung, Taiwan.
Department of Obstetrics and Gynecology, Hospital of Universitas Airlangga - Faculty of Medicine, Universitas Airlangga, Surabaya, Indonesia.
Comput Struct Biotechnol J. 2024 Oct 31;23:4019-4026. doi: 10.1016/j.csbj.2024.10.050. eCollection 2024 Dec.
This study investigated the ability of Large Language Models (LLMs) to provide accurate and consistent answers by focusing on their performance in complex gynecologic cancer cases.
LLMs are advancing rapidly and require a thorough evaluation to ensure that they can be safely and effectively used in clinical decision-making. Such evaluations are essential for confirming LLM reliability and accuracy in supporting medical professionals in casework.
We assessed three prominent LLMs-ChatGPT-4 (CG-4), Gemini Advanced (GemAdv), and Copilot-evaluating their accuracy, consistency, and overall performance. Fifteen clinical vignettes of varying difficulty and five open-ended questions based on real patient cases were used. The responses were coded, randomized, and evaluated blindly by six expert gynecologic oncologists using a 5-point Likert scale for relevance, clarity, depth, focus, and coherence.
GemAdv demonstrated superior accuracy (81.87 %) compared to both CG-4 (61.60 %) and Copilot (70.67 %) across all difficulty levels. GemAdv consistently provided correct answers more frequently (>60 % every day during the testing period). Although CG-4 showed a slight advantage in adhering to the National Comprehensive Cancer Network (NCCN) treatment guidelines, GemAdv excelled in the depth and focus of the answers provided, which are crucial aspects of clinical decision-making.
LLMs, especially GemAdv, show potential in supporting clinical practice by providing accurate, consistent, and relevant information for gynecologic cancer. However, further refinement is needed for more complex scenarios. This study highlights the promise of LLMs in gynecologic oncology, emphasizing the need for ongoing development and rigorous evaluation to maximize their clinical utility and reliability.
本研究通过关注大语言模型(LLMs)在复杂妇科癌症病例中的表现,调查其提供准确和一致答案的能力。
大语言模型正在迅速发展,需要进行全面评估,以确保它们能够安全有效地用于临床决策。此类评估对于确认大语言模型在支持医疗专业人员处理病例方面的可靠性和准确性至关重要。
我们评估了三个著名的大语言模型——ChatGPT-4(CG-4)、Gemini Advanced(GemAdv)和Copilot——评估它们的准确性、一致性和整体表现。使用了15个不同难度的临床病例摘要以及基于真实患者病例的5个开放式问题。六位妇科肿瘤专家使用5点李克特量表对回答进行编码、随机化并进行盲评,评估其相关性、清晰度、深度、重点和连贯性。
在所有难度级别上,GemAdv的准确率(81.87%)均高于CG-4(61.60%)和Copilot(70.67%)。GemAdv始终更频繁地提供正确答案(测试期间每天>60%)。尽管CG-4在遵循美国国立综合癌症网络(NCCN)治疗指南方面略有优势,但GemAdv在所提供答案的深度和重点方面表现出色,而这是临床决策的关键方面。
大语言模型,尤其是GemAdv,通过为妇科癌症提供准确、一致和相关的信息,在支持临床实践方面显示出潜力。然而,对于更复杂的情况还需要进一步完善。本研究突出了大语言模型在妇科肿瘤学中的前景,强调需要持续发展和严格评估,以最大限度地提高其临床效用和可靠性。