人工智能聊天机器人大型语言模型在解决骨骼生物学和骨骼健康问题方面的表现。

The performance of artificial intelligence chatbot large language models to address skeletal biology and bone health queries.

机构信息

Department of Pathology and Laboratory Medicine, Weill Cornell Medical College, New York, NY 10065, United States.

Skeletal Diseases Program, The Garvan Institute of Medical Research, Darlinghurst, 2010, Australia.

出版信息

J Bone Miner Res. 2024 Mar 22;39(2):106-115. doi: 10.1093/jbmr/zjad007.

DOI:10.1093/jbmr/zjad007

PMID:38477743

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11184616/

Abstract

Artificial intelligence (AI) chatbots utilizing large language models (LLMs) have recently garnered significant interest due to their ability to generate humanlike responses to user inquiries in an interactive dialog format. While these models are being increasingly utilized to obtain medical information by patients, scientific and medical providers, and trainees to address biomedical questions, their performance may vary from field to field. The opportunities and risks these chatbots pose to the widespread understanding of skeletal health and science are unknown. Here we assess the performance of 3 high-profile LLM chatbots, Chat Generative Pre-Trained Transformer (ChatGPT) 4.0, BingAI, and Bard, to address 30 questions in 3 categories: basic and translational skeletal biology, clinical practitioner management of skeletal disorders, and patient queries to assess the accuracy and quality of the responses. Thirty questions in each of these categories were posed, and responses were independently graded for their degree of accuracy by four reviewers. While each of the chatbots was often able to provide relevant information about skeletal disorders, the quality and relevance of these responses varied widely, and ChatGPT 4.0 had the highest overall median score in each of the categories. Each of these chatbots displayed distinct limitations that included inconsistent, incomplete, or irrelevant responses, inappropriate utilization of lay sources in a professional context, a failure to take patient demographics or clinical context into account when providing recommendations, and an inability to consistently identify areas of uncertainty in the relevant literature. Careful consideration of both the opportunities and risks of current AI chatbots is needed to formulate guidelines for best practices for their use as source of information about skeletal health and biology.

摘要

人工智能 (AI) 聊天机器人利用大型语言模型 (LLM) 最近因其能够以交互对话的形式对用户查询生成类人响应而引起了广泛关注。虽然这些模型正被越来越多地用于通过患者、科学和医疗提供者以及受训者来获取医学信息，以解决生物医学问题，但它们在不同领域的表现可能会有所不同。这些聊天机器人给人们广泛理解骨骼健康和科学带来的机遇和风险尚不清楚。在这里，我们评估了 3 个知名的大型语言模型聊天机器人，ChatGPT 4.0、BingAI 和 Bard，以回答 3 个类别中的 30 个问题：基础和转化骨骼生物学、临床医生管理骨骼疾病、以及患者查询，以评估这些回答的准确性和质量。每个类别提出了 30 个问题，然后由四名评审员独立对其准确性进行评分。虽然每个聊天机器人都能够提供有关骨骼疾病的相关信息，但这些回答的质量和相关性差异很大，ChatGPT 4.0 在每个类别中的总体中位数得分最高。这些聊天机器人都显示出明显的局限性，包括不一致、不完整或不相关的回答、在专业环境中不恰当地使用非专业来源、在提供建议时未能考虑患者的人口统计学或临床背景、以及无法始终识别相关文献中的不确定性领域。需要仔细考虑当前 AI 聊天机器人的机遇和风险，以便制定最佳实践指南，指导其在骨骼健康和生物学方面的信息来源的使用。

相似文献

The performance of artificial intelligence chatbot large language models to address skeletal biology and bone health queries.

J Bone Miner Res. 2024 Mar 22;39(2):106-115. doi: 10.1093/jbmr/zjad007.

Capacity for large language model chatbots to aid in orthopedic management, research, and patient queries.

J Orthop Res. 2024 Jun;42(6):1276-1282. doi: 10.1002/jor.25782. Epub 2024 Jan 21.

Utility of Large Language Models for Health Care Professionals and Patients in Navigating Hematopoietic Stem Cell Transplantation: Comparison of the Performance of ChatGPT-3.5, ChatGPT-4, and Bard.

J Med Internet Res. 2024 May 17;26:e54758. doi: 10.2196/54758.

Accuracy and Readability of Artificial Intelligence Chatbot Responses to Vasectomy-Related Questions: Public Beware.

Cureus. 2024 Aug 28;16(8):e67996. doi: 10.7759/cureus.67996. eCollection 2024 Aug.

Performance of Artificial Intelligence Chatbots on Glaucoma Questions Adapted From Patient Brochures.

Cureus. 2024 Mar 23;16(3):e56766. doi: 10.7759/cureus.56766. eCollection 2024 Mar.

Assessing the Accuracy of Information on Medication Abortion: A Comparative Analysis of ChatGPT and Google Bard AI.

Cureus. 2024 Jan 2;16(1):e51544. doi: 10.7759/cureus.51544. eCollection 2024 Jan.

Evaluation of the Performance of Generative AI Large Language Models ChatGPT, Google Bard, and Microsoft Bing Chat in Supporting Evidence-Based Dentistry: Comparative Mixed Methods Study.

J Med Internet Res. 2023 Dec 28;25:e51580. doi: 10.2196/51580.

Evidence-based potential of generative artificial intelligence large language models in orthodontics: a comparative study of ChatGPT, Google Bard, and Microsoft Bing.

Eur J Orthod. 2024 Apr 13. doi: 10.1093/ejo/cjae017.

Comparison of the Audiological Knowledge of Three Chatbots: ChatGPT, Bing Chat, and Bard.

Audiol Neurootol. 2024;29(6):457-463. doi: 10.1159/000538983. Epub 2024 May 6.

Performance of ChatGPT-4 and Bard chatbots in responding to common patient questions on prostate cancer Lu-PSMA-617 therapy.

Front Oncol. 2024 Jul 12;14:1386718. doi: 10.3389/fonc.2024.1386718. eCollection 2024.

引用本文的文献

Development and evaluation of a lightweight large language model chatbot for medication enquiry.

PLOS Digit Health. 2025 Sep 4;4(9):e0000961. doi: 10.1371/journal.pdig.0000961. eCollection 2025 Sep.

Effectiveness and satisfaction of fully self-service fundus disease screening among middle-aged individuals: a cross-sectional study.

BMJ Open Ophthalmol. 2025 Aug 26;10(1):e001950. doi: 10.1136/bmjophth-2024-001950.

The Artificial Intelligence-Assisted Diagnosis of Skeletal Dysplasias in Pediatric Patients: A Comparative Benchmark Study of Large Language Models and a Clinical Expert Group.

Genes (Basel). 2025 Jun 28;16(7):762. doi: 10.3390/genes16070762.

Perceptions and Attitudes of Chinese Oncologists Toward Endorsing AI-Driven Chatbots for Health Information Seeking Among Patients with Cancer: Phenomenological Qualitative Study.

J Med Internet Res. 2025 Jul 23;27:e71418. doi: 10.2196/71418.

Clinical insights: A comprehensive review of language models in medicine.

PLOS Digit Health. 2025 May 8;4(5):e0000800. doi: 10.1371/journal.pdig.0000800. eCollection 2025 May.

Evaluating large language model performance to support the diagnosis and management of patients with primary immune disorders.

J Allergy Clin Immunol. 2025 Feb 14. doi: 10.1016/j.jaci.2025.02.004.

Comparative evaluation of ChatGPT-4, ChatGPT-3.5 and Google Gemini on PCOS assessment and management based on recommendations from the 2023 guideline.

Endocrine. 2025 Apr;88(1):315-322. doi: 10.1007/s12020-024-04121-7. Epub 2024 Dec 2.

Large language models in patient education: a scoping review of applications in medicine.

Front Med (Lausanne). 2024 Oct 29;11:1477898. doi: 10.3389/fmed.2024.1477898. eCollection 2024.

Clinical artificial intelligence: teaching a large language model to generate recommendations that align with guidelines for the surgical management of GERD.

Surg Endosc. 2024 Oct;38(10):5668-5677. doi: 10.1007/s00464-024-11155-5. Epub 2024 Aug 12.

本文引用的文献

AI Chatbots in Clinical Laboratory Medicine: Foundations and Trends.

Clin Chem. 2023 Nov 2;69(11):1238-1246. doi: 10.1093/clinchem/hvad106.

Large language models in medicine.

Nat Med. 2023 Aug;29(8):1930-1940. doi: 10.1038/s41591-023-02448-8. Epub 2023 Jul 17.

Large language models encode clinical knowledge.

Nature. 2023 Aug;620(7972):172-180. doi: 10.1038/s41586-023-06291-2. Epub 2023 Jul 12.

Practical Applications of ChatGPT in Undergraduate Medical Education.

J Med Educ Curric Dev. 2023 May 24;10:23821205231178449. doi: 10.1177/23821205231178449. eCollection 2023 Jan-Dec.

A Case Study Demonstrating Applications of ChatGPT in the Clinical Management of Treatment-Resistant Schizophrenia.

Cureus. 2023 Apr 26;15(4):e38166. doi: 10.7759/cureus.38166. eCollection 2023 Apr.

Assessing the Accuracy and Clinical Utility of ChatGPT in Laboratory Medicine.

Clin Chem. 2023 Aug 2;69(8):939-940. doi: 10.1093/clinchem/hvad058.

ChatGPT Performance on the American Urological Association Self-assessment Study Program and the Potential Influence of Artificial Intelligence in Urologic Training.

Urology. 2023 Jul;177:29-33. doi: 10.1016/j.urology.2023.05.010. Epub 2023 May 18.

The role of ChatGPT in scientific communication: writing better scientific review articles.

Am J Cancer Res. 2023 Apr 15;13(4):1148-1154. eCollection 2023.

Medicine in the Era of Artificial Intelligence: Hey Chatbot, Write Me an H&P.

JAMA Intern Med. 2023 Jun 1;183(6):507-508. doi: 10.1001/jamainternmed.2023.1832.

Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum.

JAMA Intern Med. 2023 Jun 1;183(6):589-596. doi: 10.1001/jamainternmed.2023.1838.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

人工智能聊天机器人大型语言模型在解决骨骼生物学和骨骼健康问题方面的表现。

The performance of artificial intelligence chatbot large language models to address skeletal biology and bone health queries.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献