量化不同队列规模的客观结构化临床考试标准设定中的误差：一种衡量评估质量的重采样方法。

Quantifying error in OSCE standard setting for varying cohort sizes: A resampling approach to measuring assessment quality.

作者信息

Homer Matt, Pell Godfrey, Fuller Richard, Patterson John

机构信息

a University of Leeds , UK .

b Queen Mary University of London , UK.

出版信息

Med Teach. 2016;38(2):181-8. doi: 10.3109/0142159X.2015.1029898. Epub 2015 Apr 24.

DOI:10.3109/0142159X.2015.1029898

PMID:25909810

Abstract

BACKGROUND

The use of the borderline regression method (BRM) is a widely accepted standard setting method for OSCEs. However, it is unclear whether this method is appropriate for use with small cohorts (e.g. specialist post-graduate examinations).

AIMS AND METHODS

This work uses an innovative application of resampling methods applied to four pre-existing OSCE data sets (number of stations between 17 and 21) from two institutions to investigate how the robustness of the BRM changes as the cohort size varies. Using a variety of metrics, the 'quality' of an OSCE is evaluated for cohorts of approximately n = 300 down to n = 15. Estimates of the standard error in station-level and overall pass marks, R(2) coefficient, and Cronbach's alpha are all calculated as cohort size varies.

RESULTS AND CONCLUSION

For larger cohorts (n > 200), the standard error in the overall pass mark is small (less than 0.5%), and for individual stations is of the order of 1-2%. These errors grow as the sample size reduces, with cohorts of less than 50 candidates showing unacceptably large standard error. Alpha and R(2) also become unstable for small cohorts. The resampling methodology is shown to be robust and has the potential to be more widely applied in standard setting and medical assessment quality assurance and research.

摘要

背景

边界回归法（BRM）的应用是客观结构化临床考试（OSCE）中一种广泛接受的标准设定方法。然而，尚不清楚该方法是否适用于小样本队列（如专科研究生考试）。

目的和方法

本研究创新性地将重采样方法应用于来自两个机构的四个现有的OSCE数据集（站点数量在17至21之间），以研究随着队列规模的变化，BRM的稳健性如何改变。使用各种指标，对样本量从大约n = 300到n = 15的队列的OSCE“质量”进行评估。随着队列规模的变化，计算站点级和总体及格分数的标准误差估计值、R²系数和克朗巴哈系数。

结果与结论

对于较大的队列（n> 200），总体及格分数的标准误差较小（小于0.5%），单个站点的标准误差约为1-2%。随着样本量的减少，这些误差会增大，候选人数少于50人的队列显示出不可接受的大标准误差。对于小样本队列，α和R²也变得不稳定。重采样方法显示出稳健性，并且有可能在标准设定、医学评估质量保证和研究中得到更广泛的应用。