Lois Alex, Yates Robert, Ivy Megan, Inaba Colette, Tatum Roger, Cetrulo Lawrence, Parr Zoe, Chen Judy, Khandelwal Saurabh, Wright Andrew
Department of Surgery, University of Chicago, 5841 S. Maryland, MC 5095, Chicago, IL, 60637, USA.
Department of Surgery, University of Washington Medical Center, University of Washington, 1959 NE Pacific St, Box 356410, Seattle, WA, 98195, USA.
Surg Endosc. 2024 Dec;38(12):7409-7415. doi: 10.1007/s00464-024-11221-y. Epub 2024 Oct 23.
NLPs such as ChatGPT are novel sources of online healthcare information that are readily accessible and integrated into internet search tools. The accuracy of NLP-generated responses to health information questions is unknown.
We queried four NLPs (ChatGPT 3.5 and 4, Bard, and Claude 2.0) for responses to simulated patient questions about inguinal hernias and their management. Responses were graded on a Likert scale (1 poor to 5 excellent) for relevance, completeness, and accuracy. Responses were compiled and scored collectively for readability using the Flesch-Kincaid score and for educational quality using the DISCERN instrument, a validated tool for evaluating patient information materials. Responses were also compared to two gold-standard educational materials provided by SAGES and the ACS. Evaluations were performed by six hernia surgeons.
The average NLP response scores for relevance, completeness, and accuracy were 4.76 (95% CI 4.70-4.80), 4.11 (95% CI 4.02-4.20), and 4.14 (95% CI 4.03-4.24), respectively. ChatGPT4 received higher accuracy scores (mean 4.43 [95% CI 4.37-4.50]) than Bard (mean 4.06 [95% CI 3.88-4.26]) and Claude 2.0 (mean 3.85 [95% CI 3.63-4.08]). The ACS document received the best scores for reading ease (55.2) and grade level (9.2); however, none of the documents achieved the readibility thresholds recommended by the American Medical Association. The ACS document also received the highest DISCERN score of 63.5 (57.0-70.1), and this was significantly higher compared to ChatGPT 4 (50.8 [95% CI 46.2-55.4]) and Claude 2.0 (48 [95% CI 41.6-54.4]).
The evaluated NLPs provided relevant responses of reasonable accuracy to questions about inguinal hernia. Compiled NLP responses received relatively low readability and DISCERN scores, although results may improve as NLPs evolve or with adjustments in question wording. As surgical patients expand their use of NLPs for healthcare information, surgeons should be aware of the benefits and limitations of NLPs as patient education tools.
诸如ChatGPT之类的自然语言处理程序是在线医疗保健信息的新来源,易于获取并集成到互联网搜索工具中。自然语言处理程序生成的健康信息问题回复的准确性尚不清楚。
我们向四个自然语言处理程序(ChatGPT 3.5和4、Bard以及Claude 2.0)查询有关腹股沟疝及其治疗的模拟患者问题的回复。根据李克特量表(1分差至5分优)对回复的相关性、完整性和准确性进行评分。使用弗莱什-金凯德分数对回复进行汇总并集体评分以评估可读性,使用DISCERN工具对教育质量进行评分,DISCERN工具是一种经过验证的评估患者信息材料的工具。回复还与美国胃肠内镜外科医师学会(SAGES)和美国外科医师学会(ACS)提供的两份黄金标准教育材料进行了比较。评估由六位疝外科医生进行。
自然语言处理程序回复的相关性、完整性和准确性的平均得分分别为4.76(95%置信区间4.70 - 4.80)、4.11(95%置信区间4.02 - 4.20)和4.14(95%置信区间4.03 - 4.24)。ChatGPT4的准确性得分(平均4.43 [95%置信区间4.37 - 4.50])高于Bard(平均4.06 [95%置信区间3.88 - 4.26])和Claude 2.0(平均3.85 [95%置信区间3.63 - 4.08])。ACS的文档在易读性(55.2)和年级水平(9.2)方面得分最高;然而,没有一份文档达到美国医学协会推荐的可读性阈值。ACS的文档DISCERN得分也最高,为63.5(57.0 - 70.1),与ChatGPT 4(50.8 [95%置信区间46.2 - 55.4])和Claude 2.0(48 [95%置信区间41.6 - 54.4])相比,显著更高。
所评估的自然语言处理程序对腹股沟疝问题提供了相关性合理且准确性尚可的回复。汇总的自然语言处理程序回复的可读性和DISCERN得分相对较低,不过随着自然语言处理程序的发展或问题措辞的调整,结果可能会有所改善。随着手术患者更多地使用自然语言处理程序获取医疗保健信息,外科医生应了解自然语言处理程序作为患者教育工具的益处和局限性。