Suppr超能文献

用于农村医疗保健中儿科鉴别诊断的大语言模型:比较GPT-3与儿科医生表现的多中心回顾性队列研究

Large Language Models for Pediatric Differential Diagnoses in Rural Health Care: Multicenter Retrospective Cohort Study Comparing GPT-3 With Pediatrician Performance.

作者信息

Mansoor Masab, Ibrahim Andrew F, Grindem David, Baig Asad

机构信息

Edward Via College of Osteopathic Medicine, 4408 Bon Aire Dr, Monroe, LA, 71203, United States, 1 5045213500.

Texas Tech University Health Sciences Center School of Medicine, Lubbock, TX, United States.

出版信息

JMIRx Med. 2025 Mar 19;6:e65263. doi: 10.2196/65263.

Abstract

BACKGROUND

Rural health care providers face unique challenges such as limited specialist access and high patient volumes, making accurate diagnostic support tools essential. Large language models like GPT-3 have demonstrated potential in clinical decision support but remain understudied in pediatric differential diagnosis.

OBJECTIVE

This study aims to evaluate the diagnostic accuracy and reliability of a fine-tuned GPT-3 model compared to board-certified pediatricians in rural health care settings.

METHODS

This multicenter retrospective cohort study analyzed 500 pediatric encounters (ages 0-18 years; n=261, 52.2% female) from rural health care organizations in Central Louisiana between January 2020 and December 2021. The GPT-3 model (DaVinci version) was fine-tuned using the OpenAI application programming interface and trained on 350 encounters, with 150 reserved for testing. Five board-certified pediatricians (mean experience: 12, SD 5.8 years) provided reference standard diagnoses. Model performance was assessed using accuracy, sensitivity, specificity, and subgroup analyses.

RESULTS

The GPT-3 model achieved an accuracy of 87.3% (131/150 cases), sensitivity of 85% (95% CI 82%-88%), and specificity of 90% (95% CI 87%-93%), comparable to pediatricians' accuracy of 91.3% (137/150 cases; P=.47). Performance was consistent across age groups (0-5 years: 54/62, 87%; 6-12 years: 47/53, 89%; 13-18 years: 30/35, 86%) and common complaints (fever: 36/39, 92%; abdominal pain: 20/23, 87%). For rare diagnoses (n=20), accuracy was slightly lower (16/20, 80%) but comparable to pediatricians (17/20, 85%; P=.62).

CONCLUSIONS

This study demonstrates that a fine-tuned GPT-3 model can provide diagnostic support comparable to pediatricians, particularly for common presentations, in rural health care. Further validation in diverse populations is necessary before clinical implementation.

摘要

背景

农村医疗服务提供者面临着独特的挑战,如专科医疗资源有限和患者数量众多,因此准确的诊断支持工具至关重要。像GPT-3这样的大语言模型已在临床决策支持方面展现出潜力,但在儿科鉴别诊断方面仍研究不足。

目的

本研究旨在评估在农村医疗环境中,经过微调的GPT-3模型与获得委员会认证的儿科医生相比的诊断准确性和可靠性。

方法

这项多中心回顾性队列研究分析了2020年1月至2021年12月期间路易斯安那州中部农村医疗机构的500例儿科病例(年龄0至18岁;n = 261,其中52.2%为女性)。使用OpenAI应用程序编程接口对GPT-3模型(达芬奇版本)进行微调,并在350例病例上进行训练,保留150例用于测试。五位获得委员会认证的儿科医生(平均经验:12年,标准差5.8年)提供参考标准诊断。使用准确性、敏感性、特异性和亚组分析来评估模型性能。

结果

GPT-3模型的准确率为87.3%(131/150例),敏感性为85%(95%置信区间82% - 88%),特异性为90%(95%置信区间87% - 93%),与儿科医生91.3%(137/150例)的准确率相当(P = 0.47)。各年龄组(0至5岁:54/62,87%;6至12岁:47/53,89%;13至18岁:30/35,86%)和常见症状(发热:36/39,92%;腹痛:20/23,87%)的表现一致。对于罕见诊断(n = 20),准确率略低(16/20,80%),但与儿科医生相当(17/20,85%;P = 0.62)。

结论

本研究表明,经过微调的GPT-3模型在农村医疗中能够提供与儿科医生相当的诊断支持,尤其是对于常见症状。在临床应用之前,需要在不同人群中进行进一步验证。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d037/11939124/e90fd4ec679b/xmed-v6-e65263-g001.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验