Suppr超能文献

用于设计特定用例代表性案例并评估非专业人员和症状评估应用程序分诊准确性的RepVig框架。

The RepVig framework for designing use-case specific representative vignettes and evaluating triage accuracy of laypeople and symptom assessment applications.

作者信息

Kopka Marvin, Napierala Hendrik, Privoznik Martin, Sapunova Desislava, Zhang Sizhuo, Feufel Markus A

机构信息

Division of Ergonomics, Department of Psychology and Ergonomics (IPA), Technische Universität Berlin, Berlin, Germany.

Institute of General Practice and Family Medicine, Charité - Universitätsmedizin, Corporate Member of Freie Universität Berlin and Humboldt- Universität zu Berlin, Berlin, Germany.

出版信息

Sci Rep. 2024 Dec 23;14(1):30614. doi: 10.1038/s41598-024-83844-z.

Abstract

Most studies evaluating symptom-assessment applications (SAAs) rely on a common set of case vignettes that are authored by clinicians and devoid of context, which may be representative of clinical settings but not of situations where patients use SAAs. Assuming the use case of self-triage, we used representative design principles to sample case vignettes from online platforms where patients describe their symptoms to obtain professional advice and compared triage performance of laypeople, SAAs (e.g., WebMD or NHS 111), and Large Language Models (LLMs, e.g., GPT-4 or Claude) on representative versus standard vignettes. We found performance differences in all three groups depending on vignette type: When using representative vignettes, accuracy was higher (OR = 1.52 to 2.00, p < .001 to .03 in binary decisions, i.e., correct or incorrect), safety was higher (OR = 1.81 to 3.41, p < .001 to .002 in binary decisions, i.e., safe or unsafe), and the inclination to overtriage was also higher (OR = 1.80 to 2.66, p < .001 to p = .035 in binary decisions, overtriage or undertriage error). Additionally, we found changed rankings of best-performing SAAs and LLMs. Based on these results, we argue that our representative vignette sampling approach (that we call the RepVig Framework) should replace the practice of using a fixed vignette set as standard for SAA evaluation studies.

摘要

大多数评估症状评估应用程序(SAA)的研究都依赖于一组由临床医生编写且缺乏背景信息的常见病例 vignette,这些 vignette 可能代表临床环境,但不能代表患者使用 SAA 的情况。假设是自我分诊的用例,我们使用代表性设计原则从患者描述症状以获取专业建议的在线平台中对病例 vignette 进行抽样,并比较了外行人、SAA(如 WebMD 或 NHS 111)和大语言模型(LLM,如 GPT - 4 或 Claude)在代表性 vignette 与标准 vignette 上的分诊表现。我们发现,根据 vignette 类型,所有三组的表现都存在差异:使用代表性 vignette 时,在二元决策(即正确或错误)中,准确性更高(OR = 1.52 至 2.00,p <.001 至.03),安全性更高(OR = 1.81 至 3.41,p <.001 至.002),过度分诊的倾向也更高(OR = 1.80 至 2.66,p <.001 至 p =.035,过度分诊或分诊不足错误)。此外,我们发现表现最佳的 SAA 和 LLM 的排名发生了变化。基于这些结果,我们认为我们的代表性 vignette 抽样方法(我们称之为 RepVig 框架)应取代使用固定 vignette 集作为 SAA 评估研究标准的做法。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/df77/11666565/e0d9ccecb5ff/41598_2024_83844_Fig1_HTML.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验