Suppr超能文献

关于麻醉和手术的人工智能生成回复的质量和数量评估:使用ChatGPT 3.5和4.0

Evaluation of the quality and quantity of artificial intelligence-generated responses about anesthesia and surgery: using ChatGPT 3.5 and 4.0.

作者信息

Choi Jisun, Oh Ah Ran, Park Jungchan, Kang Ryung A, Yoo Seung Yeon, Lee Dong Jae, Yang Kwangmo

机构信息

Department of Anesthesiology and Pain Medicine, Samsung Medical Center, Sungkyunkwan University School of Medicine, Seoul, Republic of Korea.

Center for Health Promotion, Samsung Medical Center, Sungkyunkwan University School of Medicine, Seoul, Republic of Korea.

出版信息

Front Med (Lausanne). 2024 Jul 11;11:1400153. doi: 10.3389/fmed.2024.1400153. eCollection 2024.

Abstract

INTRODUCTION

The large-scale artificial intelligence (AI) language model chatbot, Chat Generative Pre-Trained Transformer (ChatGPT), is renowned for its ability to provide data quickly and efficiently. This study aimed to assess the medical responses of ChatGPT regarding anesthetic procedures.

METHODS

Two anesthesiologist authors selected 30 questions representing inquiries patients might have about surgery and anesthesia. These questions were inputted into two versions of ChatGPT in English. A total of 31 anesthesiologists then evaluated each response for quality, quantity, and overall assessment, using 5-point Likert scales. Descriptive statistics summarized the scores, and a paired sample -test compared ChatGPT 3.5 and 4.0.

RESULTS

Regarding quality, "appropriate" was the most common rating for both ChatGPT 3.5 and 4.0 (40 and 48%, respectively). For quantity, responses were deemed "insufficient" in 59% of cases for 3.5, and "adequate" in 69% for 4.0. In overall assessment, 3 points were most common for 3.5 (36%), while 4 points were predominant for 4.0 (42%). Mean quality scores were 3.40 and 3.73, and mean quantity scores were - 0.31 (between insufficient and adequate) and 0.03 (between adequate and excessive), respectively. The mean overall score was 3.21 for 3.5 and 3.67 for 4.0. Responses from 4.0 showed statistically significant improvement in three areas.

CONCLUSION

ChatGPT generated responses mostly ranging from appropriate to slightly insufficient, providing an overall average amount of information. Version 4.0 outperformed 3.5, and further research is warranted to investigate the potential utility of AI chatbots in assisting patients with medical information.

摘要

引言

大规模人工智能(AI)语言模型聊天机器人Chat生成预训练变换器(ChatGPT)以能够快速高效地提供数据而闻名。本研究旨在评估ChatGPT关于麻醉程序的医学回答。

方法

两位麻醉科医生作者选择了30个代表患者可能对手术和麻醉提出的问题。这些问题被输入到两个英文版的ChatGPT中。然后,共有31位麻醉科医生使用5点李克特量表对每个回答的质量、数量和总体评估进行评价。描述性统计总结了得分,并使用配对样本t检验比较了ChatGPT 3.5和4.0。

结果

在质量方面,“合适”是ChatGPT 3.5和4.0最常见的评级(分别为40%和48%)。在数量方面,3.5的回答在59%的情况下被认为“不足”,而4.0的回答在69%的情况下被认为“足够”。在总体评估中,3.5最常见的分数是3分(36%),而4.0最常见的分数是4分(42%)。平均质量得分分别为3.40和3.73,平均数量得分分别为-0.3(介于不足和足够之间)和0.03(介于足够和过多之间)。3.5的平均总体得分为3.21,4.0的平均总体得分为3.67。4.0的回答在三个方面显示出统计学上的显著改善。

结论

ChatGPT生成的回答大多介于合适到略有不足之间,提供了总体平均信息量。4.0版本优于3.5版本,有必要进一步研究以探讨人工智能聊天机器人在协助患者获取医学信息方面的潜在效用。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5363/11269144/a16773126f72/fmed-11-1400153-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验