• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

使用大语言模型评估YouTube上医学内容的质量。

Evaluating the quality of medical content on YouTube using large language models.

作者信息

Khalil Mahmoud, Mohamed Fatma, Shoufan Abdulhadi

机构信息

Computer Science Department, Western University, London, ON, Canada.

Center for Cyber-Physical Systems (C2PS), Computer Science Department, Khalifa University, Abu Dhabi, UAE.

出版信息

Sci Rep. 2025 Mar 22;15(1):9906. doi: 10.1038/s41598-025-94208-6.

DOI:10.1038/s41598-025-94208-6
PMID:40121315
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11929840/
Abstract

YouTube has become a dominant source of medical information and health-related decision-making. Yet, many videos on this platform contain inaccurate or biased information. Although expert reviews could help mitigate this situation, the vast number of daily uploads makes this solution impractical. In this study, we explored the potential of Large Language Models (LLMs) to assess the quality of medical content on YouTube. We collected a set of videos previously evaluated by experts and prompted twenty models to rate their quality using the DISCERN instrument. We then analyzed the inter-rater agreement between the language models' and experts' ratings using Brennan-Prediger's (BP) Kappa. We found that LLMs exhibited a wide range of inter-rater agreements with the experts (ranging from -1.10 to 0.82). All models tended to give higher scores than the human experts. The agreement on individual questions tended to be lower, with some questions showing significant disagreement between models and experts. Including scoring guidelines in the prompt has improved model performance. We conclude that some LLMs are capable of evaluating the quality of medical videos. If used as stand-alone expert systems or embedded into traditional recommender systems, these models can mitigate the quality issue of health-related online videos.

摘要

YouTube已成为医学信息和健康相关决策的主要来源。然而,该平台上的许多视频包含不准确或有偏差的信息。尽管专家评审有助于缓解这种情况,但每天大量的上传内容使这种解决方案不切实际。在本研究中,我们探索了大语言模型(LLMs)评估YouTube上医学内容质量的潜力。我们收集了一组先前由专家评估过的视频,并促使20个模型使用DISCERN工具对其质量进行评分。然后,我们使用布伦南 - 普雷迪格(BP)kappa分析了语言模型和专家评分之间的评分者间一致性。我们发现,大语言模型与专家之间表现出广泛的评分者间一致性(范围从 -1.10到0.82)。所有模型给出的分数往往高于人类专家。在单个问题上的一致性往往较低,有些问题在模型和专家之间存在显著分歧。在提示中包含评分指南提高了模型性能。我们得出结论,一些大语言模型能够评估医学视频的质量。如果用作独立的专家系统或嵌入到传统推荐系统中,这些模型可以缓解与健康相关的在线视频的质量问题。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/17a7/11929840/088f89a999df/41598_2025_94208_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/17a7/11929840/e3dd3d28aaf6/41598_2025_94208_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/17a7/11929840/d79a807ab9c2/41598_2025_94208_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/17a7/11929840/7ae54e28eaa3/41598_2025_94208_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/17a7/11929840/f72f07b9629b/41598_2025_94208_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/17a7/11929840/088f89a999df/41598_2025_94208_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/17a7/11929840/e3dd3d28aaf6/41598_2025_94208_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/17a7/11929840/d79a807ab9c2/41598_2025_94208_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/17a7/11929840/7ae54e28eaa3/41598_2025_94208_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/17a7/11929840/f72f07b9629b/41598_2025_94208_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/17a7/11929840/088f89a999df/41598_2025_94208_Fig5_HTML.jpg

相似文献

1
Evaluating the quality of medical content on YouTube using large language models.使用大语言模型评估YouTube上医学内容的质量。
Sci Rep. 2025 Mar 22;15(1):9906. doi: 10.1038/s41598-025-94208-6.
2
YouTube Videos as a Source of Misinformation on Idiopathic Pulmonary Fibrosis.YouTube 视频作为特发性肺纤维化错误信息的来源。
Ann Am Thorac Soc. 2019 May;16(5):572-579. doi: 10.1513/AnnalsATS.201809-644OC.
3
Evaluating the quality and reliability of YouTube videos on myopia: a video content analysis.评估YouTube上关于近视的视频的质量和可靠性:一项视频内容分析。
Int Ophthalmol. 2024 Jul 18;44(1):329. doi: 10.1007/s10792-024-03250-2.
4
Fiction, Falsehoods, and Few Facts: Cross-Sectional Study on the Content-Related Quality of Atopic Eczema-Related Videos on YouTube.虚构、谎言与少量事实:关于YouTube上特应性皮炎相关视频内容质量的横断面研究
J Med Internet Res. 2020 Apr 24;22(4):e15599. doi: 10.2196/15599.
5
YouTube as a source of information in cardiopulmonary resuscitation for 2020 AHA Resuscitation Guidelines.YouTube 作为 2020 年 AHA 复苏指南中心肺复苏信息源。
PeerJ. 2024 Nov 8;12:e18344. doi: 10.7717/peerj.18344. eCollection 2024.
6
Is YouTube a reliable source of health-related information? A systematic review.YouTube 是健康相关信息的可靠来源吗?一项系统评价。
BMC Med Educ. 2022 May 19;22(1):382. doi: 10.1186/s12909-022-03446-z.
7
Quality of Information in Carpal Tunnel Syndrome: Social Media Platforms Versus Large Language Models.腕管综合征信息的质量:社交媒体平台与大语言模型
Ann Plast Surg. 2025 May 1;94(5):512-515. doi: 10.1097/SAP.0000000000004232. Epub 2025 Jan 24.
8
YouTube as a source of education on piriformis injection: a content, quality, and reliability analysis.YouTube作为梨状肌注射教育资源的内容、质量和可靠性分析。
BMC Med Educ. 2025 Apr 16;25(1):549. doi: 10.1186/s12909-025-07154-2.
9
Evidence-based quality and accuracy of YouTube videos about nephrolithiasis.关于肾结石的 YouTube 视频的循证质量和准确性。
BJU Int. 2021 Feb;127(2):247-253. doi: 10.1111/bju.15213. Epub 2020 Sep 19.
10
A Survey of YouTube Videos as a Source of Useful/Unuseful Information in the Field of the Prevention and Management of Burn Injuries: A Cross-sectional Analysis of the English Language Content.YouTube 视频作为预防和管理烧伤领域有用/无用信息来源的调查:对英文内容的横断面分析。
J Burn Care Res. 2022 Jul 1;43(4):971-976. doi: 10.1093/jbcr/irab231.

本文引用的文献

1
Toward expert-level medical question answering with large language models.迈向使用大语言模型实现专家级医学问答
Nat Med. 2025 Mar;31(3):943-950. doi: 10.1038/s41591-024-03423-7. Epub 2025 Jan 8.
2
Using fine-tuned large language models to parse clinical notes in musculoskeletal pain disorders.使用微调的大语言模型解析肌肉骨骼疼痛障碍的临床记录。
Lancet Digit Health. 2023 Oct 26. doi: 10.1016/S2589-7500(23)00202-9.
3
Evaluating the Effectiveness of advanced large language models in medical Knowledge: A Comparative study using Japanese national medical examination.
评估先进的大型语言模型在医学知识方面的有效性:使用日本国家医学考试的比较研究。
Int J Med Inform. 2025 Jan;193:105673. doi: 10.1016/j.ijmedinf.2024.105673. Epub 2024 Oct 28.
4
The diagnostic and triage accuracy of the GPT-3 artificial intelligence model: an observational study.GPT-3 人工智能模型的诊断和分诊准确性:一项观察性研究。
Lancet Digit Health. 2024 Aug;6(8):e555-e561. doi: 10.1016/S2589-7500(24)00097-9.
5
Diagnostic performances of GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro in "Diagnosis Please" cases.GPT-4o、Claude 3 Opus 和 Gemini 1.5 Pro 在“诊断请”案例中的诊断性能。
Jpn J Radiol. 2024 Nov;42(11):1231-1235. doi: 10.1007/s11604-024-01619-y. Epub 2024 Jul 1.
6
Detecting and monitoring concerns against HPV vaccination on social media using large language models.利用大型语言模型在社交媒体上检测和监测 HPV 疫苗接种相关问题
Sci Rep. 2024 Jun 21;14(1):14362. doi: 10.1038/s41598-024-64703-3.
7
Adapted large language models can outperform medical experts in clinical text summarization.经过改编的大型语言模型在临床文本总结方面的表现优于医学专家。
Nat Med. 2024 Apr;30(4):1134-1142. doi: 10.1038/s41591-024-02855-5. Epub 2024 Feb 27.
8
Almanac - Retrieval-Augmented Language Models for Clinical Medicine.用于临床医学的年鉴检索增强语言模型。
NEJM AI. 2024 Feb;1(2). doi: 10.1056/aioa2300068. Epub 2024 Jan 25.
9
The role of large language models in medical image processing: a narrative review.大语言模型在医学图像处理中的作用:一项叙述性综述。
Quant Imaging Med Surg. 2024 Jan 3;14(1):1108-1121. doi: 10.21037/qims-23-892. Epub 2023 Nov 23.
10
Opportunities and challenges for ChatGPT and large language models in biomedicine and health.ChatGPT 和大型语言模型在生物医学和健康领域的机遇与挑战。
Brief Bioinform. 2023 Nov 22;25(1). doi: 10.1093/bib/bbad493.