Khalil Mahmoud, Mohamed Fatma, Shoufan Abdulhadi
Computer Science Department, Western University, London, ON, Canada.
Center for Cyber-Physical Systems (C2PS), Computer Science Department, Khalifa University, Abu Dhabi, UAE.
Sci Rep. 2025 Mar 22;15(1):9906. doi: 10.1038/s41598-025-94208-6.
YouTube has become a dominant source of medical information and health-related decision-making. Yet, many videos on this platform contain inaccurate or biased information. Although expert reviews could help mitigate this situation, the vast number of daily uploads makes this solution impractical. In this study, we explored the potential of Large Language Models (LLMs) to assess the quality of medical content on YouTube. We collected a set of videos previously evaluated by experts and prompted twenty models to rate their quality using the DISCERN instrument. We then analyzed the inter-rater agreement between the language models' and experts' ratings using Brennan-Prediger's (BP) Kappa. We found that LLMs exhibited a wide range of inter-rater agreements with the experts (ranging from -1.10 to 0.82). All models tended to give higher scores than the human experts. The agreement on individual questions tended to be lower, with some questions showing significant disagreement between models and experts. Including scoring guidelines in the prompt has improved model performance. We conclude that some LLMs are capable of evaluating the quality of medical videos. If used as stand-alone expert systems or embedded into traditional recommender systems, these models can mitigate the quality issue of health-related online videos.
YouTube已成为医学信息和健康相关决策的主要来源。然而,该平台上的许多视频包含不准确或有偏差的信息。尽管专家评审有助于缓解这种情况,但每天大量的上传内容使这种解决方案不切实际。在本研究中,我们探索了大语言模型(LLMs)评估YouTube上医学内容质量的潜力。我们收集了一组先前由专家评估过的视频,并促使20个模型使用DISCERN工具对其质量进行评分。然后,我们使用布伦南 - 普雷迪格(BP)kappa分析了语言模型和专家评分之间的评分者间一致性。我们发现,大语言模型与专家之间表现出广泛的评分者间一致性(范围从 -1.10到0.82)。所有模型给出的分数往往高于人类专家。在单个问题上的一致性往往较低,有些问题在模型和专家之间存在显著分歧。在提示中包含评分指南提高了模型性能。我们得出结论,一些大语言模型能够评估医学视频的质量。如果用作独立的专家系统或嵌入到传统推荐系统中,这些模型可以缓解与健康相关的在线视频的质量问题。