人工智能模型与人类专家在管理血脂异常方面的比较：对临床指南依从性的评估

Comparison of Artificial Intelligence Models and Human Experts in Managing Dyslipidemia: Assessment of Adherence to Clinical Guidelines.

作者信息

Ucdal Mete, Yurtsever Karya, Yildiz Pinar, Akalin Aysen, Mert Kadir Ugur, Guven Gulay S

机构信息

Department of Internal Medicine, Hacettepe University Faculty of Medicine, Ankara, TUR.

Department of Internal Medicine, Faculty of Medicine, Eskişehir Osmangazi University, Eskişehir, TUR.

出版信息

Cureus. 2025 Aug 31;17(8):e91363. doi: 10.7759/cureus.91363. eCollection 2025 Aug.

DOI:10.7759/cureus.91363

PMID:40904968

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12402675/

Abstract

Objective The objective of this study is to compare guideline adherence between artificial intelligence (AI) models (Claude-3 (Anthropic, San Francisco, CA), DeepSeek-V2 (DeepSeek, Hangzhou, China), GPT-4 (OpenAI, San Francisco, CA)) and human experts in dyslipidemia management using standardized clinical scenarios based on 2019 European Society of Cardiology (ESC)/European Atherosclerosis Society (EAS) and 2021 ESC prevention guidelines. The study employed a comprehensive evaluation framework to capture the holistic nature of dyslipidemia management across multiple interconnected domains. Methods Thirty fictitious but clinically representative cases were developed by lipid specialists across five domains: cardiovascular risk assessment, lipid management, lifestyle modifications, pharmacotherapy, and special populations. This broad scope was deliberately chosen to evaluate the full complexity of integrated cardiovascular risk management as it occurs in clinical practice. Cases included all variables required for objective guideline application. AI models and clinicians (professors, specialists, residents) provided management recommendations. A blinded assessment paradigm was employed to minimize potential evaluation bias, with evaluators scoring responses using alphanumeric coding to prevent source identification bias. Responses were assessed using standardized rubrics (0-3 scales) for four equally-weighted parameters: accuracy (guideline concordance), comprehensiveness (clinical coverage), applicability (implementation feasibility), and efficacy (simulated low-density lipoprotein cholesterol (LDL-C) target attainment). Composite scores were calculated by summing all parameters (maximum 12 points). Results Correct response rates were 91% for AI, 72% for professors, 50% for specialists, and 21-32% for residents. Composite scores (mean ± SD/12) were 10.3 ± 1.0 for AI, 8.1-9.2 for professors, 7.4 ± 1.5 for specialists, and 5.2-6.2 for residents. AI excelled in literal guideline application while professors considered contextual factors (frailty, life expectancy). Professors primarily erred in LDL-C targets (using <100 vs. <55 mg/dL), while AI in nuanced risk stratification. Simulated outcomes showed LDL-C target attainment of 83% with AI, 64% with professors, and 92% with a combined approach. Conclusion AI demonstrated superior guideline adherence in standardized scenarios but may miss contextual clinical factors. The hybrid AI-human approach optimized outcomes, suggesting that augmented intelligence represents the most promising implementation strategy. Limitations include simulated cases (n = 30), potential performance bias favoring literal interpretation, and lack of real-world complexity. Prospective clinical validation is warranted.

摘要

目的本研究的目的是使用基于2019年欧洲心脏病学会（ESC）/欧洲动脉粥样硬化协会（EAS）和2021年ESC预防指南的标准化临床场景，比较人工智能（AI）模型（Claude-3（Anthropic，加利福尼亚州旧金山）、DeepSeek-V2（DeepSeek，中国杭州）、GPT-4（OpenAI，加利福尼亚州旧金山））与人类专家在血脂异常管理方面对指南的遵循情况。该研究采用了一个综合评估框架，以涵盖血脂异常管理在多个相互关联领域的整体性质。方法由脂质专家在五个领域制定了30个虚拟但具有临床代表性的病例：心血管风险评估、脂质管理、生活方式改变、药物治疗和特殊人群。特意选择这个广泛的范围来评估临床实践中综合心血管风险管理的全部复杂性。病例包括客观应用指南所需的所有变量。AI模型和临床医生（教授、专家、住院医师）提供管理建议。采用盲法评估范式以尽量减少潜在的评估偏差，评估者使用字母数字编码对回答进行评分，以防止来源识别偏差。使用标准化评分标准（0-3分制）对四个同等权重的参数评估回答：准确性（与指南的一致性）、全面性（临床覆盖范围）、适用性（实施可行性）和有效性（模拟低密度脂蛋白胆固醇（LDL-C）目标达成情况）。通过对所有参数求和计算综合得分（最高12分）。结果 AI的正确回答率为91%，教授为72%，专家为50%，住院医师为21%-32%。综合得分（平均值±标准差/12）AI为10.3±1.0，教授为8.1-9.2，专家为7.4±1.5，住院医师为5.2-6.2。AI在逐字应用指南方面表现出色，而教授会考虑背景因素（虚弱、预期寿命）。教授主要在LDL-C目标方面出错（使用<100 vs.<55 mg/dL），而AI在细微的风险分层方面出错。模拟结果显示，AI实现LDL-C目标的比例为83%，教授为64%，联合方法为92%。结论在标准化场景中，AI表现出更好的指南遵循情况，但可能会忽略背景临床因素。AI与人类相结合的方法优化了结果，表明增强智能是最有前景的实施策略。局限性包括模拟病例（n = 30）、可能有利于字面解释的潜在性能偏差以及缺乏现实世界的复杂性。需要进行前瞻性临床验证。