Lan Lan, Yang Ling, Li Jinyan, Hou Jia, Yan Yunsheng, Zhang Yaozong
Department of Intensive Care Medicine, Women and Children's Hospital of Chongqing Medical University, Chongqing, China.
Emergency Department, Women and Children's Hospital of Chongqing Medical University, Chongqing, China.
Int J Gynaecol Obstet. 2025 Mar;168(3):1251-1257. doi: 10.1002/ijgo.15934. Epub 2024 Sep 28.
In the current study, we aimed to establish a quantified scoring system for evaluating consultation quality. Subsequently, using the score system to assess the quality of ChatGPT-4 consultations, we compared them with physician consultations when presented with the same clinical cases from obstetrics and gynecology.
This study was conducted in the Women and Children's Hospital of Chongqing Medical University, a tertiary-care hospital with approximately 16 000-20 000 deliveries and 8500-12 000 gynecologic surgeries per year. The detailed data from obstetric and gynecologic medical records were analyzed by ChatGPT-4 and physicians; the consultation opinions were then generated respectively. All consultation opinions were graded by eight junior doctors using the novel score system; subsequently, the correlation, agreement, and comparison between the two types of consultation opinions were then evaluated.
A total of 100 medical records from obstetrics and 100 medical records from gynecology were randomly selected. Pearson correlation analysis suggested a noncorrelation or weak correlation between consultations from ChatGPT-4 and physicians. Bland-Altman plot showed an unacceptable agreement between the two types of consultation opinions. Paired t tests showed that the scores of physician consultations were significantly higher than those generated by ChatGPT-4 in both obstetric and gynecologic patients.
At present, ChatGPT-4 may not be a substitute for physicians in consultations for obstetric and gynecologic patients. Therefore, it is crucial to pay careful attention and conduct ongoing evaluations to ensure the quality of consultation opinions generated by ChatGPT-4.
在本研究中,我们旨在建立一个用于评估会诊质量的量化评分系统。随后,使用该评分系统评估ChatGPT-4会诊的质量,并将其与针对相同妇产科临床病例的医生会诊进行比较。
本研究在重庆医科大学附属妇女儿童医院进行,这是一家三级医院,每年约有16000 - 20000例分娩和8500 - 12000例妇科手术。ChatGPT-4和医生分别分析妇产科病历的详细数据,然后分别给出会诊意见。所有会诊意见由8名初级医生使用新的评分系统进行评分;随后,评估两种会诊意见之间的相关性、一致性和比较情况。
共随机选取了100份产科病历和100份妇科病历。Pearson相关分析表明ChatGPT-4的会诊与医生的会诊之间无相关性或弱相关性。Bland-Altman图显示两种会诊意见之间的一致性不可接受。配对t检验表明,在产科和妇科患者中,医生会诊的得分均显著高于ChatGPT-4生成的得分。
目前,在妇产科患者的会诊中,ChatGPT-4可能无法替代医生。因此,密切关注并持续评估以确保ChatGPT-4生成的会诊意见质量至关重要。