• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

相似文献

1
Evaluating the Appropriateness, Consistency, and Readability of ChatGPT in Critical Care Recommendations.评估ChatGPT在重症监护建议中的适用性、一致性和可读性。
J Intensive Care Med. 2025 Feb;40(2):184-190. doi: 10.1177/08850666241267871. Epub 2024 Aug 8.
2
Harnessing artificial intelligence in bariatric surgery: comparative analysis of ChatGPT-4, Bing, and Bard in generating clinician-level bariatric surgery recommendations.利用人工智能在减重手术中的应用:ChatGPT-4、Bing 和 Bard 在生成临床医生水平的减重手术建议方面的比较分析。
Surg Obes Relat Dis. 2024 Jul;20(7):603-608. doi: 10.1016/j.soard.2024.03.011. Epub 2024 Mar 24.
3
Assessing the Responses of Large Language Models (ChatGPT-4, Claude 3, Gemini, and Microsoft Copilot) to Frequently Asked Questions in Retinopathy of Prematurity: A Study on Readability and Appropriateness.评估大型语言模型(ChatGPT-4、Claude 3、Gemini和Microsoft Copilot)对早产儿视网膜病变常见问题的回答:一项关于可读性和适宜性的研究
J Pediatr Ophthalmol Strabismus. 2025 Mar-Apr;62(2):84-95. doi: 10.3928/01913913-20240911-05. Epub 2024 Oct 28.
4
Large language models: a new frontier in paediatric cataract patient education.大语言模型:小儿白内障患者教育的新前沿。
Br J Ophthalmol. 2024 Sep 20;108(10):1470-1476. doi: 10.1136/bjo-2024-325252.
5
Appropriateness and Readability of ChatGPT-4-Generated Responses for Surgical Treatment of Retinal Diseases.ChatGPT-4 生成的回复在视网膜疾病手术治疗中的适宜性和可读性。
Ophthalmol Retina. 2023 Oct;7(10):862-868. doi: 10.1016/j.oret.2023.05.022. Epub 2023 Jun 3.
6
Assessing the Quality and Reliability of ChatGPT's Responses to Radiotherapy-Related Patient Queries: Comparative Study With GPT-3.5 and GPT-4.评估ChatGPT对放疗相关患者问题回答的质量和可靠性:与GPT-3.5和GPT-4的比较研究
JMIR Cancer. 2025 Apr 16;11:e63677. doi: 10.2196/63677.
7
Evaluating the Effectiveness of Large Language Models in Providing Patient Education for Chinese Patients With Ocular Myasthenia Gravis: Mixed Methods Study.评估大语言模型为中国重症肌无力性眼病患者提供患者教育的有效性:混合方法研究
J Med Internet Res. 2025 Apr 10;27:e67883. doi: 10.2196/67883.
8
Using Large Language Models to Generate Educational Materials on Childhood Glaucoma.利用大语言模型生成儿童青光眼教育材料。
Am J Ophthalmol. 2024 Sep;265:28-38. doi: 10.1016/j.ajo.2024.04.004. Epub 2024 Apr 16.
9
Appropriateness and readability of Google Bard and ChatGPT-3.5 generated responses for surgical treatment of glaucoma.谷歌巴德和 ChatGPT-3.5 生成的青光眼手术治疗回复的适宜性和可读性。
Rom J Ophthalmol. 2024 Jul-Sep;68(3):243-248. doi: 10.22336/rjo.2024.45.
10
Leveraging large language models to improve patient education on dry eye disease.利用大语言模型改善干眼症患者教育。
Eye (Lond). 2025 Apr;39(6):1115-1122. doi: 10.1038/s41433-024-03476-5. Epub 2024 Dec 16.

引用本文的文献

1
AI assisted prediction of unplanned intensive care admissions using natural language processing in elective neurosurgery.利用自然语言处理技术在择期神经外科手术中进行人工智能辅助预测非计划重症监护病房入院情况
NPJ Digit Med. 2025 Aug 27;8(1):549. doi: 10.1038/s41746-025-01952-0.
2
Primer on large language models: an educational overview for intensivists.大语言模型入门:重症医学专家的教育概述
Crit Care. 2025 Jun 12;29(1):238. doi: 10.1186/s13054-025-05479-4.
3
A large language model improves clinicians' diagnostic performance in complex critical illness cases.一个大语言模型提高了临床医生在复杂重症病例中的诊断表现。
Crit Care. 2025 Jun 6;29(1):230. doi: 10.1186/s13054-025-05468-7.
4
Comparative evaluation and performance of large language models on expert level critical care questions: a benchmark study.大型语言模型在专家级重症监护问题上的比较评估与性能:一项基准研究。
Crit Care. 2025 Feb 10;29(1):72. doi: 10.1186/s13054-025-05302-0.

本文引用的文献

1
Evaluation of ChatGPT in Predicting 6-Month Outcomes After Traumatic Brain Injury.评估 ChatGPT 在预测创伤性脑损伤后 6 个月结局中的作用。
Crit Care Med. 2024 Jun 1;52(6):942-950. doi: 10.1097/CCM.0000000000006236. Epub 2024 Mar 6.
2
Assessing the Utility of ChatGPT Throughout the Entire Clinical Workflow: Development and Usability Study.评估 ChatGPT 在整个临床工作流程中的效用:开发和可用性研究。
J Med Internet Res. 2023 Aug 22;25:e48659. doi: 10.2196/48659.
3
Evaluating GPT as an Adjunct for Radiologic Decision Making: GPT-4 Versus GPT-3.5 in a Breast Imaging Pilot.评估 GPT 作为放射学决策辅助工具:GPT-4 与 GPT-3.5 在乳腺成像试点中的比较。
J Am Coll Radiol. 2023 Oct;20(10):990-997. doi: 10.1016/j.jacr.2023.05.003. Epub 2023 Jun 21.
4
ChatGPT Answers Common Patient Questions About Colonoscopy.ChatGPT回答患者关于结肠镜检查的常见问题。
Gastroenterology. 2023 Aug;165(2):509-511.e7. doi: 10.1053/j.gastro.2023.04.033. Epub 2023 May 5.
5
Comparing scientific abstracts generated by ChatGPT to real abstracts with detectors and blinded human reviewers.使用检测器和不知情的人类评审员,将ChatGPT生成的科学摘要与真实摘要进行比较。
NPJ Digit Med. 2023 Apr 26;6(1):75. doi: 10.1038/s41746-023-00819-6.
6
Appropriateness of Breast Cancer Prevention and Screening Recommendations Provided by ChatGPT.ChatGPT提供的乳腺癌预防和筛查建议的适宜性。
Radiology. 2023 May;307(4):e230424. doi: 10.1148/radiol.230424. Epub 2023 Apr 4.
7
ChatGPT Utility in Healthcare Education, Research, and Practice: Systematic Review on the Promising Perspectives and Valid Concerns.ChatGPT在医学教育、研究与实践中的应用:对其前景与合理担忧的系统评价
Healthcare (Basel). 2023 Mar 19;11(6):887. doi: 10.3390/healthcare11060887.
8
Large language models and the perils of their hallucinations.大语言模型及其幻觉的风险。
Crit Care. 2023 Mar 21;27(1):120. doi: 10.1186/s13054-023-04393-x.
9
Revolutionizing radiology with GPT-based models: Current applications, future possibilities and limitations of ChatGPT.基于 GPT 的模型推动放射学革命:ChatGPT 的当前应用、未来可能性和局限性。
Diagn Interv Imaging. 2023 Jun;104(6):269-274. doi: 10.1016/j.diii.2023.02.003. Epub 2023 Feb 28.
10
The potential impact of ChatGPT in clinical and translational medicine.ChatGPT在临床与转化医学中的潜在影响。
Clin Transl Med. 2023 Mar;13(3):e1216. doi: 10.1002/ctm2.1216.

评估ChatGPT在重症监护建议中的适用性、一致性和可读性。

Evaluating the Appropriateness, Consistency, and Readability of ChatGPT in Critical Care Recommendations.

作者信息

Balta Kaan Y, Javidan Arshia P, Walser Eric, Arntfield Robert, Prager Ross

机构信息

Schulich School of Medicine & Dentistry, Western University, London, Ontario, Canada.

Division of Vascular Surgery, Department of Surgery, University of Toronto, Toronto, Ontario, Canada.

出版信息

J Intensive Care Med. 2025 Feb;40(2):184-190. doi: 10.1177/08850666241267871. Epub 2024 Aug 8.

DOI:10.1177/08850666241267871
PMID:39118320
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11639400/
Abstract

We assessed 2 versions of the large language model (LLM) ChatGPT-versions 3.5 and 4.0-in generating appropriate, consistent, and readable recommendations on core critical care topics. How do successive large language models compare in terms of generating appropriate, consistent, and readable recommendations on core critical care topics? A set of 50 LLM-generated responses to clinical questions were evaluated by 2 independent intensivists based on a 5-point Likert scale for appropriateness, consistency, and readability. ChatGPT 4.0 showed significantly higher median appropriateness scores compared to ChatGPT 3.5 (4.0 vs 3.0,  < .001). However, there was no significant difference in consistency between the 2 versions (40% vs 28%,  = 0.291). Readability, assessed by the Flesch-Kincaid Grade Level, was also not significantly different between the 2 models (14.3 vs 14.4,  = 0.93). Both models produced "hallucinations"-misinformation delivered with high confidence-which highlights the risk of relying on these tools without domain expertise. Despite potential for clinical application, both models lacked consistency producing different results when asked the same question multiple times. The study underscores the need for clinicians to understand the strengths and limitations of LLMs for safe and effective implementation in critical care settings. https://osf.io/8chj7/.

摘要

我们评估了大语言模型(LLM)ChatGPT的两个版本——3.5版和4.0版——在生成关于核心重症监护主题的恰当、一致且可读的建议方面的表现。在生成关于核心重症监护主题的恰当、一致且可读的建议方面,连续的大语言模型相比如何?两位独立的重症监护医生根据5分制李克特量表,对一组由大语言模型生成的针对临床问题的50条回复进行了恰当性、一致性和可读性评估。与ChatGPT 3.5相比,ChatGPT 4.0的中位数恰当性得分显著更高(4.0对3.0,<0.001)。然而,两个版本在一致性方面没有显著差异(40%对28%,P = 0.291)。通过弗莱什-金凯德年级水平评估的可读性,在两个模型之间也没有显著差异(14.3对14.4,P = 0.93)。两个模型都产生了“幻觉”——以高度自信传递的错误信息——这凸显了在没有领域专业知识的情况下依赖这些工具的风险。尽管有临床应用的潜力,但当多次被问到相同问题时,两个模型都缺乏一致性,会产生不同的结果。该研究强调临床医生需要了解大语言模型的优势和局限性,以便在重症监护环境中安全有效地应用。https://osf.io/8chj7/