Suppr超能文献

评估大语言模型(LLMs)在中国背景下回答有关乳腺癌医学问题的表现。

Assessing the performance of large language models (LLMs) in answering medical questions regarding breast cancer in the Chinese context.

作者信息

Piao Ying, Chen Hongtao, Wu Shihai, Li Xianming, Li Zihuang, Yang Dong

机构信息

Department of Radiation Oncology, Shenzhen People's Hospital (The Second Clinical Medical College, Jinan University; The First Affiliated Hospital, Southern University of Science and Technology), Shenzhen, Guangdong, People's Republic of China.

出版信息

Digit Health. 2024 Oct 7;10:20552076241284771. doi: 10.1177/20552076241284771. eCollection 2024 Jan-Dec.

Abstract

PURPOSE

Large language models (LLMs) are deep learning models designed to comprehend and generate meaningful responses, which have gained public attention in recent years. The purpose of this study is to evaluate and compare the performance of LLMs in answering questions regarding breast cancer in the Chinese context.

MATERIAL AND METHODS

ChatGPT, ERNIE Bot, and ChatGLM were chosen to answer 60 questions related to breast cancer posed by two oncologists. Responses were scored as comprehensive, correct but inadequate, mixed with correct and incorrect data, completely incorrect, or unanswered. The accuracy, length, and readability among answers from different models were evaluated using statistical software.

RESULTS

ChatGPT answered 60 questions, with 40 (66.7%) comprehensive answers and six (10.0%) correct but inadequate answers. ERNIE Bot answered 60 questions, with 34 (56.7%) comprehensive answers and seven (11.7%) correct but inadequate answers. ChatGLM generated 60 answers, with 35 (58.3%) comprehensive answers and six (10.0%) correct but inadequate answers. The differences for chosen accuracy metrics among the three LLMs did not reach statistical significance, but only ChatGPT demonstrated a sense of human compassion. The accuracy of the three models in answering questions regarding breast cancer treatment was the lowest, with an average of 44.4%. ERNIE Bot's responses were significantly shorter compared to ChatGPT and ChatGLM ( < .001 for both). The readability scores of the three models showed no statistical significance.

CONCLUSIONS

In the Chinese context, the capabilities of ChatGPT, ERNIE Bot, and ChatGLM are similar in answering breast cancer-related questions at present. These three LLMs may serve as adjunct informational tools for breast cancer patients in the Chinese context, offering guidance for general inquiries. However, for highly specialized issues, particularly in the realm of breast cancer treatment, LLMs cannot deliver reliable performance. It is necessary to utilize them under the supervision of healthcare professionals.

摘要

目的

大语言模型(LLMs)是旨在理解并生成有意义回答的深度学习模型,近年来受到公众关注。本研究的目的是评估和比较大语言模型在中国背景下回答乳腺癌相关问题的表现。

材料与方法

选择ChatGPT、文心一言和ChatGLM来回答两位肿瘤学家提出的60个与乳腺癌相关的问题。回答被评为全面、正确但不充分、正误数据混合、完全错误或未回答。使用统计软件评估不同模型答案的准确性、长度和可读性。

结果

ChatGPT回答了60个问题,其中40个(66.7%)为全面回答,6个(10.0%)为正确但不充分的回答。文心一言回答了60个问题,其中34个(56.7%)为全面回答,7个(11.7%)为正确但不充分的回答。ChatGLM生成了60个答案,其中35个(58.3%)为全面回答,6个(10.0%)为正确但不充分的回答。三个大语言模型在所选准确性指标上的差异未达到统计学意义,但只有ChatGPT表现出人文关怀。三个模型在回答乳腺癌治疗相关问题时的准确性最低,平均为44.4%。与ChatGPT和ChatGLM相比,文心一言的回答明显更短(两者均P < 0.001)。三个模型的可读性得分无统计学意义。

结论

在中国背景下,目前ChatGPT、文心一言和ChatGLM在回答乳腺癌相关问题的能力上相似。这三个大语言模型可作为中国乳腺癌患者的辅助信息工具,为一般咨询提供指导。然而,对于高度专业化的问题,尤其是在乳腺癌治疗领域,大语言模型无法提供可靠的表现。有必要在医疗专业人员的监督下使用它们。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6524/11462564/7fca26682113/10.1177_20552076241284771-fig1.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验