临床案例对症状检查应用程序评估的适用性如何？一个测试理论视角。

How suitable are clinical vignettes for the evaluation of symptom checker apps? A test theoretical perspective.

作者信息

Kopka Marvin, Feufel Markus A, Berner Eta S, Schmieding Malte L

机构信息

Department of Psychology and Ergonomics (IPA), Division of Ergonomics, Technische Universität Berlin, Berlin, Germany.

Institute of Medical Informatics, Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany.

出版信息

Digit Health. 2023 Aug 21;9:20552076231194929. doi: 10.1177/20552076231194929. eCollection 2023 Jan-Dec.

DOI:10.1177/20552076231194929

PMID:37614591

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10444026/

Abstract

OBJECTIVE

To evaluate the ability of case vignettes to assess the performance of symptom checker applications and to suggest refinements to the methodology used in case vignette-based audit studies.

METHODS

We re-analyzed the publicly available data of two prominent case vignette-based symptom checker audit studies by calculating common metrics of test theory. Furthermore, we developed a new metric, the Capability Comparison Score (CCS), which compares symptom checker capability while controlling for the difficulty of the set of cases each symptom checker evaluated. We then scrutinized whether applying test theory and the CCS altered the performance ranking of the investigated symptom checkers.

RESULTS

In both studies, most symptom checkers changed their rank order when adjusting the triage capability for item difficulty (ID) with the CCS. The previously reported triage accuracies commonly overestimated the capability of symptom checkers because they did not account for the fact that symptom checkers tend to selectively appraise easier cases (i.e., with high ID values). Also, many case vignettes in both studies showed insufficient (very low and even negative) values of item-total correlation (ITC), suggesting that individual items or the composition of item sets are of low quality.

CONCLUSIONS

A test-theoretic perspective helps identify previously undetected threats to the validity of case vignette-based symptom checker assessments and provides guidance and specific metrics to improve the quality of case vignettes, in particular by controlling for the difficulty of the vignettes an app was (not) able to evaluate correctly. Such measures might prove more meaningful than accuracy alone for the competitive assessment of symptom checkers. Our approach helps elaborate and standardize the methodology used for appraising symptom checker capability, which, ultimately, may yield more reliable results.

摘要

目的

评估病例 vignettes 评估症状检查器应用程序性能的能力，并对基于病例 vignettes 的审计研究中使用的方法提出改进建议。

方法

我们通过计算测试理论的常见指标，重新分析了两项基于病例 vignettes 的著名症状检查器审计研究的公开可用数据。此外，我们开发了一种新的指标，即能力比较分数（CCS），该指标在控制每个症状检查器评估的病例集难度的同时比较症状检查器的能力。然后，我们仔细研究了应用测试理论和 CCS 是否改变了所研究症状检查器的性能排名。

结果

在两项研究中，当使用 CCS 调整分诊能力以考虑项目难度（ID）时，大多数症状检查器的排名顺序发生了变化。先前报告的分诊准确率通常高估了症状检查器的能力，因为它们没有考虑到症状检查器倾向于选择性地评估较容易的病例（即 ID 值较高的病例）这一事实。此外，两项研究中的许多病例 vignettes 显示项目总分相关性（ITC）的值不足（非常低甚至为负），表明单个项目或项目集的组成质量较低。

结论

从测试理论的角度有助于识别基于病例 vignettes 的症状检查器评估有效性先前未被发现的威胁，并提供指导和具体指标以提高病例 vignettes 的质量，特别是通过控制应用程序无法正确评估的 vignettes 的难度。对于症状检查器的竞争性评估，这些措施可能比单独的准确性更有意义。我们的方法有助于完善和规范用于评估症状检查器能力的方法，最终可能产生更可靠的结果。