Suppr超能文献

大语言模型ChatGPT-3.5、ChatGPT-4和Open AI-o1在骨髓瘤细胞程序性死亡领域的性能比较分析。

Comparative analysis of the performance of the large language models ChatGPT-3.5, ChatGPT-4 and Open AI-o1 in the field of Programmed Cell Death in myeloma.

作者信息

Kun Wu, Bo Tao, Yuntao Li, Shenju Cheng, Yanhong Li, Shan Luo, Yun Zeng, Bo Nie, Mingxia Shi

机构信息

Yunnan Key Laboratory of Laboratory Medicine, Yunnan Province Clinical Research Center for Laboratory Medicine, Department of Clinical Laboratory, The First Affiliated Hospital of Kunming Medical University, Kunming, 650032, China.

Information Center, The First Affiliated Hospital of Kunming Medical University, Kunming, 650032, China.

出版信息

Discov Oncol. 2025 May 23;16(1):870. doi: 10.1007/s12672-025-02648-3.

Abstract

UNLABELLED

ABS: OBJECTIVE: This study aimed to compare the performance of three large language models (LLMs)-ChatGPT-3.5, ChatGPT-4, and Open AI-o1-in addressing clinical questions related to Programmed Cell Death in multiple myeloma. By evaluating each model's accuracy, comprehensiveness, and self-correcting capabilities, the investigation sought to determine the most effective tool for supporting clinical decision-making in this specialized oncological context.

METHODS

A comprehensive set of forty clinical questions was curated from recent high-impact oncology journals, International Myeloma Working Group (IMWG) guidelines, and reputable medical databases, covering various aspects of Programmed Cell Death in multiple myeloma. These questions were refined and validated by a panel of four hematologists-oncologists with expertise in the field. Each question was individually posed to ChatGPT-3.5, ChatGPT-4, and Open AI-o1 in controlled sessions. Responses were anonymized and evaluated by the same panel using a five-point Likert scale assessing accuracy, depth, and completeness. Responses were categorized as "excellent," "satisfactory," or "insufficient" based on cumulative scores. Additionally, the models' self-correcting abilities were assessed by providing feedback on initially insufficient responses and re-evaluating the revised answers. Interrater reliability was measured using Cohen's Kappa coefficients.

RESULTS

Open AI-o1 consistently generated the most extensive and detailed responses, achieving significantly higher total scores across all domains compared to ChatGPT-3.5 and ChatGPT-4. It demonstrated the lowest proportion of "insufficient" responses and the highest percentage of "excellent" answers, particularly excelling in guideline-based questions. Open AI-o1 also exhibited superior self-correcting capacity, effectively enhancing its responses upon receiving feedback. The highest Cohen's Kappa coefficient among the models indicated greater consistency in evaluations by clinical experts. User satisfaction surveys revealed that 85% of hematologists-oncologists rated Open AI-o1 as "highly satisfactory," compared to 60% for ChatGPT-4 and 45% for ChatGPT-3.5.

CONCLUSION

Open AI-o1 outperforms ChatGPT-3.5 and ChatGPT-4 in accuracy, depth, and reliability when addressing complex clinical questions related to Programmed Cell Death in multiple myeloma. Its advanced "thinking" ability facilitates comprehensive and evidence-based responses, making it a more dependable tool for clinical decision support. These findings suggest that Open AI-o1 holds significant potential for enhancing clinical practices in specialized oncological fields, though ongoing validation and integration with human expertise remain essential.

摘要

未标注

摘要:目的:本研究旨在比较三种大语言模型(LLMs)——ChatGPT-3.5、ChatGPT-4和Open AI-o1——在解决与多发性骨髓瘤程序性细胞死亡相关的临床问题方面的表现。通过评估每个模型的准确性、全面性和自我纠正能力,该研究试图确定在这个专门的肿瘤学背景下支持临床决策的最有效工具。

方法

从近期高影响力的肿瘤学期刊、国际骨髓瘤工作组(IMWG)指南和著名医学数据库中精心挑选了一组全面的40个临床问题,涵盖多发性骨髓瘤程序性细胞死亡的各个方面。这些问题由四位在该领域具有专业知识的血液肿瘤学家组成的小组进行完善和验证。在受控环节中,将每个问题分别向ChatGPT-3.5、ChatGPT-4和Open AI-o1提出。回答进行匿名处理,并由同一小组使用五点李克特量表评估准确性、深度和完整性。根据累积分数将回答分为“优秀”、“满意”或“不足”。此外,通过对最初不足的回答提供反馈并重新评估修订后的答案来评估模型的自我纠正能力。使用科恩卡帕系数测量评分者间信度。

结果

Open AI-o1始终生成最广泛和详细的回答,与ChatGPT-3.5和ChatGPT-4相比,在所有领域的总分显著更高。它表现出“不足”回答的比例最低,“优秀”答案的百分比最高,在基于指南的问题上尤其出色。Open AI-o1还表现出卓越的自我纠正能力,在收到反馈后有效地改进了其回答。模型中科恩卡帕系数最高表明临床专家的评估一致性更高。用户满意度调查显示,85%的血液肿瘤学家将Open AI-o1评为“非常满意”,而ChatGPT-4为60%,ChatGPT-3.5为45%。

结论

在解决与多发性骨髓瘤程序性细胞死亡相关的复杂临床问题时,Open AI-o1在准确性、深度和可靠性方面优于ChatGPT-3.5和ChatGPT-4。其先进的“思维”能力有助于提供全面且基于证据的回答,使其成为更可靠的临床决策支持工具。这些发现表明,Open AI-o1在增强专门肿瘤学领域的临床实践方面具有巨大潜力,尽管持续的验证以及与人类专业知识的整合仍然至关重要。

相似文献

本文引用的文献

1
AI-Driven Drug Discovery for Rare Diseases.用于罕见病的人工智能驱动的药物发现
J Chem Inf Model. 2025 Mar 10;65(5):2214-2231. doi: 10.1021/acs.jcim.4c01966. Epub 2024 Dec 17.
6
Defining precancer: a grand challenge for the cancer community.定义癌前病变:癌症领域的重大挑战。
Nat Rev Cancer. 2024 Nov;24(11):792-809. doi: 10.1038/s41568-024-00744-0. Epub 2024 Oct 1.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验