隆胸手术中ChatGPT咨询质量的综合评估:整形外科医生与外行人的比较分析

A comprehensive evaluation of ChatGPT consultation quality for augmentation mammoplasty: A comparative analysis between plastic surgeons and laypersons.

作者信息

Yun Ji Young, Kim Dong Jin, Lee Nara, Kim Eun Key

机构信息

Department of Plastic and Reconstructive Surgery, Busan Paik Hospital, Inje University School of Medicine, Busan, Republic of Korea.

Department of Plastic Surgery, Asan Medical Center, University of Ulsan College of Medicine, Seoul, Republic of Korea.

出版信息

Int J Med Inform. 2023 Nov;179:105219. doi: 10.1016/j.ijmedinf.2023.105219. Epub 2023 Sep 20.

Abstract

OBJECTIVES

ChatGPT has gained significant popularity as a source of healthcare information among the general population. Evaluating the quality of chatbot responses is crucial, requiring comprehensive and qualitative analysis. This study aims to assess the answers provided by ChatGPT during hypothetical breast augmentation consultations across various categories and depths. The evaluation involves the utilization of validated tools and a comparison of scores between plastic surgeons and laypersons.

METHODS

A panel consisting of five plastic surgeons and five laypersons evaluated ChatGPT's responses to 25 questions spanning consultation, procedure, recovery, and sentiment categories. The DISCERN and PEMAT tools were employed to assess the responses, while emotional context was examined through ten specific questions. Additionally, readability was measured using the Flesch Reading Ease score. Qualitative analysis was performed to identify the overall strengths and weaknesses.

RESULTS

Plastic surgeons generally scored lower than laypersons across most domains. Scores for each evaluation domain varied by category, with the consultation category demonstrating lower scores in terms of DISCERN reliability, information quality, and DISCERN score. Plastic surgeons assigned significantly lower overall quality ratings to the procedure category compared to other question categories. They also gave lower emotion scores in the procedure category compared to laypersons. The depth of the questions did not impact the scoring.

CONCLUSIONS

Existing health information evaluation tools may not be entirely suitable for comprehensively evaluating the quality of individual responses generated by ChatGPT. Consequently, the development and implementation of appropriate evaluation tools to assess the appropriateness and quality of AI consultations are necessary.

摘要

目的

ChatGPT作为普通人群获取医疗保健信息的来源已大受欢迎。评估聊天机器人回复的质量至关重要,这需要进行全面的定性分析。本研究旨在评估ChatGPT在各类别和不同深度的假设性隆胸咨询中提供的答案。评估包括使用经过验证的工具以及比较整形外科医生和非专业人士的评分。

方法

由五名整形外科医生和五名非专业人士组成的小组评估了ChatGPT对25个问题的回复,这些问题涵盖咨询、手术、恢复和情感类别。使用DISCERN和PEMAT工具评估回复,同时通过十个特定问题检查情感背景。此外,使用弗莱什易读性分数测量可读性。进行定性分析以确定总体优势和劣势。

结果

在大多数领域,整形外科医生的评分普遍低于非专业人士。每个评估领域的分数因类别而异,咨询类别在DISCERN可靠性、信息质量和DISCERN分数方面得分较低。与其他问题类别相比,整形外科医生对手术类别的总体质量评分明显较低。与非专业人士相比,他们在手术类别中的情感得分也较低。问题的深度不影响评分。

结论

现有的健康信息评估工具可能不完全适用于全面评估ChatGPT生成的单个回复的质量。因此,有必要开发和实施适当的评估工具来评估人工智能咨询的适当性和质量。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索