Department of Pathology and Laboratory Medicine, Weill Cornell Medical College, New York, NY 10065, United States.
Skeletal Diseases Program, The Garvan Institute of Medical Research, Darlinghurst, 2010, Australia.
J Bone Miner Res. 2024 Mar 22;39(2):106-115. doi: 10.1093/jbmr/zjad007.
Artificial intelligence (AI) chatbots utilizing large language models (LLMs) have recently garnered significant interest due to their ability to generate humanlike responses to user inquiries in an interactive dialog format. While these models are being increasingly utilized to obtain medical information by patients, scientific and medical providers, and trainees to address biomedical questions, their performance may vary from field to field. The opportunities and risks these chatbots pose to the widespread understanding of skeletal health and science are unknown. Here we assess the performance of 3 high-profile LLM chatbots, Chat Generative Pre-Trained Transformer (ChatGPT) 4.0, BingAI, and Bard, to address 30 questions in 3 categories: basic and translational skeletal biology, clinical practitioner management of skeletal disorders, and patient queries to assess the accuracy and quality of the responses. Thirty questions in each of these categories were posed, and responses were independently graded for their degree of accuracy by four reviewers. While each of the chatbots was often able to provide relevant information about skeletal disorders, the quality and relevance of these responses varied widely, and ChatGPT 4.0 had the highest overall median score in each of the categories. Each of these chatbots displayed distinct limitations that included inconsistent, incomplete, or irrelevant responses, inappropriate utilization of lay sources in a professional context, a failure to take patient demographics or clinical context into account when providing recommendations, and an inability to consistently identify areas of uncertainty in the relevant literature. Careful consideration of both the opportunities and risks of current AI chatbots is needed to formulate guidelines for best practices for their use as source of information about skeletal health and biology.
人工智能 (AI) 聊天机器人利用大型语言模型 (LLM) 最近因其能够以交互对话的形式对用户查询生成类人响应而引起了广泛关注。虽然这些模型正被越来越多地用于通过患者、科学和医疗提供者以及受训者来获取医学信息,以解决生物医学问题,但它们在不同领域的表现可能会有所不同。这些聊天机器人给人们广泛理解骨骼健康和科学带来的机遇和风险尚不清楚。在这里,我们评估了 3 个知名的大型语言模型聊天机器人,ChatGPT 4.0、BingAI 和 Bard,以回答 3 个类别中的 30 个问题:基础和转化骨骼生物学、临床医生管理骨骼疾病、以及患者查询,以评估这些回答的准确性和质量。每个类别提出了 30 个问题,然后由四名评审员独立对其准确性进行评分。虽然每个聊天机器人都能够提供有关骨骼疾病的相关信息,但这些回答的质量和相关性差异很大,ChatGPT 4.0 在每个类别中的总体中位数得分最高。这些聊天机器人都显示出明显的局限性,包括不一致、不完整或不相关的回答、在专业环境中不恰当地使用非专业来源、在提供建议时未能考虑患者的人口统计学或临床背景、以及无法始终识别相关文献中的不确定性领域。需要仔细考虑当前 AI 聊天机器人的机遇和风险,以便制定最佳实践指南,指导其在骨骼健康和生物学方面的信息来源的使用。