建立一种新型评分系统并用于评估和比较ChatGPT-4妇产科咨询与医生咨询的质量：一项试点研究。

Establishing a novel score system and using it to assess and compare the quality of ChatGPT-4 consultation with physician consultation for obstetrics and gynecology: A pilot study.

作者信息

Lan Lan, Yang Ling, Li Jinyan, Hou Jia, Yan Yunsheng, Zhang Yaozong

机构信息

Department of Intensive Care Medicine, Women and Children's Hospital of Chongqing Medical University, Chongqing, China.

Emergency Department, Women and Children's Hospital of Chongqing Medical University, Chongqing, China.

出版信息

Int J Gynaecol Obstet. 2025 Mar;168(3):1251-1257. doi: 10.1002/ijgo.15934. Epub 2024 Sep 28.

DOI:10.1002/ijgo.15934

PMID:39340470

Abstract

OBJECTIVES

In the current study, we aimed to establish a quantified scoring system for evaluating consultation quality. Subsequently, using the score system to assess the quality of ChatGPT-4 consultations, we compared them with physician consultations when presented with the same clinical cases from obstetrics and gynecology.

METHODS

This study was conducted in the Women and Children's Hospital of Chongqing Medical University, a tertiary-care hospital with approximately 16 000-20 000 deliveries and 8500-12 000 gynecologic surgeries per year. The detailed data from obstetric and gynecologic medical records were analyzed by ChatGPT-4 and physicians; the consultation opinions were then generated respectively. All consultation opinions were graded by eight junior doctors using the novel score system; subsequently, the correlation, agreement, and comparison between the two types of consultation opinions were then evaluated.

RESULTS

A total of 100 medical records from obstetrics and 100 medical records from gynecology were randomly selected. Pearson correlation analysis suggested a noncorrelation or weak correlation between consultations from ChatGPT-4 and physicians. Bland-Altman plot showed an unacceptable agreement between the two types of consultation opinions. Paired t tests showed that the scores of physician consultations were significantly higher than those generated by ChatGPT-4 in both obstetric and gynecologic patients.

CONCLUSION

At present, ChatGPT-4 may not be a substitute for physicians in consultations for obstetric and gynecologic patients. Therefore, it is crucial to pay careful attention and conduct ongoing evaluations to ensure the quality of consultation opinions generated by ChatGPT-4.

摘要

目的

在本研究中，我们旨在建立一个用于评估会诊质量的量化评分系统。随后，使用该评分系统评估ChatGPT-4会诊的质量，并将其与针对相同妇产科临床病例的医生会诊进行比较。

方法

本研究在重庆医科大学附属妇女儿童医院进行，这是一家三级医院，每年约有16000 - 20000例分娩和8500 - 12000例妇科手术。ChatGPT-4和医生分别分析妇产科病历的详细数据，然后分别给出会诊意见。所有会诊意见由8名初级医生使用新的评分系统进行评分；随后，评估两种会诊意见之间的相关性、一致性和比较情况。