评估大型语言模型（LLM）在既定乳腺分类系统上的性能。

Evaluating Large Language Model (LLM) Performance on Established Breast Classification Systems.

作者信息

Haider Syed Ali, Pressman Sophia M, Borna Sahar, Gomez-Cabello Cesar A, Sehgal Ajai, Leibovich Bradley C, Forte Antonio Jorge

机构信息

Division of Plastic Surgery, Mayo Clinic, Jacksonville, FL 32224, USA.

Center for Digital Health, Mayo Clinic, Rochester, MN 55905, USA.

出版信息

Diagnostics (Basel). 2024 Jul 11;14(14):1491. doi: 10.3390/diagnostics14141491.

DOI:10.3390/diagnostics14141491

PMID:39061628

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11275570/

Abstract

Medical researchers are increasingly utilizing advanced LLMs like ChatGPT-4 and Gemini to enhance diagnostic processes in the medical field. This research focuses on their ability to comprehend and apply complex medical classification systems for breast conditions, which can significantly aid plastic surgeons in making informed decisions for diagnosis and treatment, ultimately leading to improved patient outcomes. Fifty clinical scenarios were created to evaluate the classification accuracy of each LLM across five established breast-related classification systems. Scores from 0 to 2 were assigned to LLM responses to denote incorrect, partially correct, or completely correct classifications. Descriptive statistics were employed to compare the performances of ChatGPT-4 and Gemini. Gemini exhibited superior overall performance, achieving 98% accuracy compared to ChatGPT-4's 71%. While both models performed well in the Baker classification for capsular contracture and UTSW classification for gynecomastia, Gemini consistently outperformed ChatGPT-4 in other systems, such as the Fischer Grade Classification for gender-affirming mastectomy, Kajava Classification for ectopic breast tissue, and Regnault Classification for breast ptosis. With further development, integrating LLMs into plastic surgery practice will likely enhance diagnostic support and decision making.

摘要

医学研究人员越来越多地利用ChatGPT-4和Gemini等先进的大语言模型来改进医学领域的诊断流程。这项研究聚焦于它们理解和应用复杂乳腺疾病分类系统的能力，这能够极大地帮助整形外科医生做出明智的诊断和治疗决策，最终改善患者的治疗效果。创建了50个临床病例场景，以评估每个大语言模型在五个既定乳腺相关分类系统中的分类准确性。大语言模型的回答被赋予0到2分，以表示分类错误、部分正确或完全正确。采用描述性统计来比较ChatGPT-4和Gemini的性能。Gemini展现出更优的整体性能，准确率达到98%，而ChatGPT-4的准确率为71%。虽然两个模型在包膜挛缩的贝克分类和男性乳房发育的UTSW分类中表现良好，但在其他系统中，如性别确认乳房切除术的菲舍尔分级分类、异位乳腺组织的卡亚瓦分类和乳房下垂的雷诺分类中，Gemini始终优于ChatGPT-4。随着进一步发展，将大语言模型整合到整形外科实践中可能会增强诊断支持和决策制定。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

评估大型语言模型（LLM）在既定乳腺分类系统上的性能。

Evaluating Large Language Model (LLM) Performance on Established Breast Classification Systems.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

评估大型语言模型（LLM）在既定乳腺分类系统上的性能。

Evaluating Large Language Model (LLM) Performance on Established Breast Classification Systems.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献