• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

使用QUADAS-2对大型语言模型进行诊断准确性研究的偏倚风险评估

Risk of Bias Assessment of Diagnostic Accuracy Studies Using QUADAS 2 by Large Language Models.

作者信息

Leucuța Daniel-Corneliu, Urda-Cîmpean Andrada Elena, Istrate Dan, Drugan Tudor

机构信息

Department of Medical Informatics and Biostatistics, Iuliu Hațieganu University of Medicine and Pharmacy, 400349 Cluj-Napoca, Romania.

出版信息

Diagnostics (Basel). 2025 Jun 6;15(12):1451. doi: 10.3390/diagnostics15121451.

DOI:10.3390/diagnostics15121451
PMID:40564772
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12191753/
Abstract

Diagnostic accuracy studies are essential for the evaluation of the performance of medical tests. The risk of bias (RoB) for these studies is commonly assessed using the Quality Assessment of Diagnostic Accuracy Studies (QUADAS) tool. This study aimed to assess the capabilities and reasoning accuracy of large language models (LLMs) in evaluating the RoB in diagnostic accuracy studies, using QUADAS 2, compared to human experts. : Four LLMs were used for the AI assessment: ChatGPT 4o model, X.AI Grok 3 model, Gemini 2.0 flash model, and DeepSeek V3 model. Ten recent open-access diagnostic accuracy studies were selected. Each article was independently assessed by human experts and by LLMs using QUADAS 2. : Out of 110 signaling questions assessments (11 questions for each of the 10 articles) by the four AI models, and the mean percentage of correct assessments of all the models was 72.95%. The most accurate model was Grok 3, followed by ChatGPT 4o, DeepSeek V3, and Gemini 2.0 Flash, with accuracies ranging from 74.45% to 67.27%. When analyzed by domain, the most accurate responses were for "flow and timing", followed by "index test", and then similarly for "patient selection" and "reference standard". An extensive list of reasoning errors was documented. : This study demonstrates that LLMs can achieve a moderate level of accuracy in evaluating the RoB in diagnostic accuracy studies. However, they are not yet a substitute for expert clinical and methodological judgment. LLMs may serve as complementary tools in systematic reviews, with compulsory human supervision.

摘要

诊断准确性研究对于评估医学检测的性能至关重要。这些研究的偏倚风险(RoB)通常使用诊断准确性研究质量评估(QUADAS)工具进行评估。本研究旨在评估大语言模型(LLMs)在使用QUADAS 2评估诊断准确性研究中的RoB时的能力和推理准确性,并与人类专家进行比较。使用了四个大语言模型进行人工智能评估:ChatGPT 4o模型、X.AI Grok 3模型、Gemini 2.0闪存模型和DeepSeek V3模型。选择了十项近期的开放获取诊断准确性研究。每篇文章由人类专家和大语言模型使用QUADAS 2独立评估。在四个人工智能模型对110个信号问题的评估中(10篇文章,每篇11个问题),所有模型正确评估的平均百分比为72.95%。最准确的模型是Grok 3,其次是ChatGPT 4o、DeepSeek V3和Gemini 2.0 Flash,准确率从74.45%到67.27%不等。按领域分析时,最准确的回答是关于“流程和时间”,其次是“索引测试”,然后“患者选择”和“参考标准”的情况类似。记录了大量的推理错误。这项研究表明,大语言模型在评估诊断准确性研究中的RoB时可以达到中等水平的准确性。然而,它们还不能替代专家的临床和方法学判断。大语言模型可以作为系统评价中的补充工具,但需要强制性的人工监督。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1495/12191753/4afaa9342e7e/diagnostics-15-01451-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1495/12191753/a8af65b16e5f/diagnostics-15-01451-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1495/12191753/4afaa9342e7e/diagnostics-15-01451-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1495/12191753/a8af65b16e5f/diagnostics-15-01451-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1495/12191753/4afaa9342e7e/diagnostics-15-01451-g002.jpg

相似文献

1
Risk of Bias Assessment of Diagnostic Accuracy Studies Using QUADAS 2 by Large Language Models.使用QUADAS-2对大型语言模型进行诊断准确性研究的偏倚风险评估
Diagnostics (Basel). 2025 Jun 6;15(12):1451. doi: 10.3390/diagnostics15121451.
2
Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.在基层医疗机构或医院门诊环境中,如果患者出现以下症状和体征,可判断其是否患有 COVID-19。
Cochrane Database Syst Rev. 2022 May 20;5(5):CD013665. doi: 10.1002/14651858.CD013665.pub3.
3
Screening for aspiration risk associated with dysphagia in acute stroke.筛查急性脑卒中吞咽困难相关的吸入风险。
Cochrane Database Syst Rev. 2021 Oct 18;10(10):CD012679. doi: 10.1002/14651858.CD012679.pub2.
4
Diagnostic test accuracy and cost-effectiveness of tests for codeletion of chromosomal arms 1p and 19q in people with glioma.染色体臂 1p 和 19q 缺失的检测在胶质瘤患者中的诊断准确性和成本效益。
Cochrane Database Syst Rev. 2022 Mar 2;3(3):CD013387. doi: 10.1002/14651858.CD013387.pub2.
5
Rapid, point-of-care antigen tests for diagnosis of SARS-CoV-2 infection.用于 SARS-CoV-2 感染诊断的快速、即时抗原检测。
Cochrane Database Syst Rev. 2022 Jul 22;7(7):CD013705. doi: 10.1002/14651858.CD013705.pub3.
6
Magnetic resonance perfusion for differentiating low-grade from high-grade gliomas at first presentation.首次就诊时磁共振灌注成像用于鉴别低级别与高级别胶质瘤
Cochrane Database Syst Rev. 2018 Jan 22;1(1):CD011551. doi: 10.1002/14651858.CD011551.pub2.
7
Clinical judgement by primary care physicians for the diagnosis of all-cause dementia or cognitive impairment in symptomatic people.初级保健医生对有症状人群进行全因痴呆或认知障碍诊断的临床判断。
Cochrane Database Syst Rev. 2022 Jun 16;6(6):CD012558. doi: 10.1002/14651858.CD012558.pub2.
8
Antibody tests for identification of current and past infection with SARS-CoV-2.抗体检测用于鉴定 SARS-CoV-2 的现症感染和既往感染。
Cochrane Database Syst Rev. 2022 Nov 17;11(11):CD013652. doi: 10.1002/14651858.CD013652.pub2.
9
Artificial intelligence for detecting keratoconus.人工智能在圆锥角膜检测中的应用。
Cochrane Database Syst Rev. 2023 Nov 15;11(11):CD014911. doi: 10.1002/14651858.CD014911.pub2.
10
Three-dimensional saline infusion sonography compared to two-dimensional saline infusion sonography for the diagnosis of focal intracavitary lesions.三维盐水灌注超声与二维盐水灌注超声在诊断腔内局灶性病变中的比较。
Cochrane Database Syst Rev. 2017 May 5;5(5):CD011126. doi: 10.1002/14651858.CD011126.pub2.

本文引用的文献

1
The emergence of large language models as tools in literature reviews: a large language model-assisted systematic review.大语言模型作为文献综述工具的出现:一项大语言模型辅助的系统综述
J Am Med Inform Assoc. 2025 Jun 1;32(6):1071-1086. doi: 10.1093/jamia/ocaf063.
2
Dental Splints and Sport Performance: A Review of the Current Literature.牙托与运动表现:当前文献综述
Dent J (Basel). 2025 Apr 18;13(4):170. doi: 10.3390/dj13040170.
3
Diagnostic Accuracy of the LiverRisk Score to Detect Increased Liver Stiffness Among a United States General Population and Subgroups.
肝脏风险评分在美国普通人群及亚组中检测肝脏硬度增加的诊断准确性
J Clin Exp Hepatol. 2025 Jul-Aug;15(4):102512. doi: 10.1016/j.jceh.2025.102512. Epub 2025 Feb 6.
4
A Comparative Analysis of Diagnostic Accuracy: Vibration Perception Threshold vs. Diabetic Neuropathy Examination for Diabetic Neuropathy.诊断准确性的比较分析:振动觉阈值与糖尿病性神经病变检查对糖尿病性神经病变的诊断
J Pharm Bioallied Sci. 2024 Dec;16(Suppl 5):S4536-S4539. doi: 10.4103/jpbs.jpbs_1160_24. Epub 2025 Jan 30.
5
Association between METS-IR index and obstructive sleep apnea: evidence from NHANES.代谢综合征胰岛素抵抗(METS-IR)指数与阻塞性睡眠呼吸暂停之间的关联:来自美国国家健康与营养检查调查(NHANES)的证据。
Sci Rep. 2025 Feb 24;15(1):6654. doi: 10.1038/s41598-024-84040-9.
6
Myocardial Perfusion Imaging Versus Coronary CT Angiography for the Detection of Coronary Artery Disease.心肌灌注成像与冠状动脉CT血管造影在检测冠状动脉疾病中的应用比较
Med J Islam Repub Iran. 2024 Nov 25;38:136. doi: 10.47176/mjiri.38.136. eCollection 2024.
7
Language models for data extraction and risk of bias assessment in complementary medicine.用于补充医学数据提取和偏倚风险评估的语言模型
NPJ Digit Med. 2025 Jan 31;8(1):74. doi: 10.1038/s41746-025-01457-w.
8
Association between Metrnl and carotid atherosclerosis in patients with type 2 diabetes mellitus.2型糖尿病患者中Metrnl与颈动脉粥样硬化的关联。
Front Endocrinol (Lausanne). 2025 Jan 8;15:1414508. doi: 10.3389/fendo.2024.1414508. eCollection 2024.
9
Opportunities, challenges and risks of using artificial intelligence for evidence synthesis.使用人工智能进行证据综合的机遇、挑战与风险。
BMJ Evid Based Med. 2025 Jan 9. doi: 10.1136/bmjebm-2024-113320.
10
Diagnostic accuracy of Fatty Liver Index (FLI) for detecting Metabolic Associated Fatty Liver Disease (MAFLD) in adults attending a tertiary care hospital, a cross-sectional study.一项横断面研究:三级医院成年患者中脂肪肝指数(FLI)检测代谢相关脂肪性肝病(MAFLD)的诊断准确性
Clin Diabetes Endocrinol. 2024 Dec 13;10(1):46. doi: 10.1186/s40842-024-00197-2.