Cao Mingde, Wang Qianwen, Zhang Xueyou, Liang Zuru, Qiu Jihong, Yung Patrick Shu-Hang, Ong Michael Tim-Yun
Department of Orthopaedics and Traumatology, Faculty of Medicine, The Chinese University of Hong Kong, Hong Kong 999077, China; Center for Neuromusculoskeletal Restorative Medicine (CNRM), The Chinese University of Hong Kong, Hong Kong 999077, China.
Department of Orthopaedics and Traumatology, Faculty of Medicine, The Chinese University of Hong Kong, Hong Kong 999077, China.
J Sport Health Sci. 2024 Nov 28;14:101016. doi: 10.1016/j.jshs.2024.101016.
Large Language Models (LLMs) have gained much attention and, in part, have replaced common search engines as a popular channel for obtaining information due to their contextually relevant responses. Osteoarthritis (OA) is a common topic in skeletal muscle disorders, and patients often seek information about it online. Our study evaluated the ability of 3 LLMs (ChatGPT-3.5, ChatGPT-4.0, and Perplexity) to accurately answer common OA-related queries.
We defined 6 themes (pathogenesis, risk factors, clinical presentation, diagnosis, treatment and prevention, and prognosis) based on a generalization of 25 frequently asked questions about OA. Three consultant-level orthopedic specialists independently rated the LLMs' replies on a 4-point accuracy scale. The final ratings for each response were determined using a majority consensus approach. Responses classified as "satisfactory" were evaluated for comprehensiveness on a 5-point scale.
ChatGPT-4.0 demonstrated superior accuracy, with 64% of responses rated as "excellent", compared to 40% for ChatGPT-3.5 and 28% for Perplexity (Pearson's χ test with Fisher's exact test, all p < 0.001). All 3 LLM-chatbots had high mean comprehensiveness ratings (Perplexity = 3.88; ChatGPT-4.0 = 4.56; ChatGPT-3.5 = 3.96, out of a maximum score of 5). The LLM-chatbots performed reliably across domains, except for "treatment and prevention" However, ChatGPT-4.0 still outperformed ChatGPT-3.5 and Perplexity, garnering 53.8% "excellent" ratings (Pearson's χ test with Fisher's exact test, all p < 0.001).
Our findings underscore the potential of LLMs, specifically ChatGPT-4.0 and Perplexity, to deliver accurate and thorough responses to OA-related queries. Targeted correction of specific misconceptions to improve the accuracy of LLMs remains crucial.
大语言模型(LLMs)已备受关注,部分原因是其上下文相关的回答,在一定程度上取代了普通搜索引擎,成为获取信息的常用渠道。骨关节炎(OA)是骨骼肌疾病中的常见话题,患者常在线寻求相关信息。我们的研究评估了3种大语言模型(ChatGPT - 3.5、ChatGPT - 4.0和Perplexity)准确回答常见OA相关问题的能力。
基于对25个关于OA的常见问题的归纳,我们定义了6个主题(发病机制、危险因素、临床表现、诊断、治疗与预防以及预后)。三位顾问级骨科专家独立地根据4分制的准确性量表对大语言模型的回答进行评分。每个回答的最终评分采用多数共识法确定。对分类为“满意”的回答,再用5分制评估其全面性。
ChatGPT - 4.0表现出更高的准确性,64%的回答被评为“优秀”,相比之下,ChatGPT - 3.5为40%,Perplexity为28%(Pearson卡方检验及Fisher精确检验,所有p < 0.001)。所有3种大语言模型聊天机器人的平均全面性评分都很高(Perplexity = 3.88;ChatGPT - 4.0 = 4.56;ChatGPT - 3.5 = 3.96,满分5分)。大语言模型聊天机器人在各个领域的表现都较为可靠,但在“治疗与预防”方面除外。然而,ChatGPT - 4.0仍优于ChatGPT - 3.5和Perplexity,获得了53.8%的“优秀”评分(Pearson卡方检验及Fisher精确检验,所有p < 0.001)。
我们的研究结果强调了大语言模型,特别是ChatGPT - 4.0和Perplexity,对OA相关问题提供准确和全面回答的潜力。针对性地纠正特定误解以提高大语言模型的准确性仍然至关重要。