Downing Steven M
Department of Medical Education, College of Medicine, University of Illinois at Chicago, 808 South Wood Street, Chicago, IL 60612-7309, USA.
Med Educ. 2004 Sep;38(9):1006-12. doi: 10.1111/j.1365-2929.2004.01932.x.
All assessment data, like other scientific experimental data, must be reproducible in order to be meaningfully interpreted.
The purpose of this paper is to discuss applications of reliability to the most common assessment methods in medical education. Typical methods of estimating reliability are discussed intuitively and non-mathematically.
Reliability refers to the consistency of assessment outcomes. The exact type of consistency of greatest interest depends on the type of assessment, its purpose and the consequential use of the data. Written tests of cognitive achievement look to internal test consistency, using estimation methods derived from the test-retest design. Rater-based assessment data, such as ratings of clinical performance on the wards, require interrater consistency or agreement. Objective structured clinical examinations, simulated patient examinations and other performance-type assessments generally require generalisability theory analysis to account for various sources of measurement error in complex designs and to estimate the consistency of the generalisations to a universe or domain of skills.
Reliability is a major source of validity evidence for assessments. Low reliability indicates that large variations in scores can be expected upon retesting. Inconsistent assessment scores are difficult or impossible to interpret meaningfully and thus reduce validity evidence. Reliability coefficients allow the quantification and estimation of the random errors of measurement in assessments, such that overall assessment can be improved.
与其他科学实验数据一样,所有评估数据必须具有可重复性,才能进行有意义的解释。
本文旨在讨论信度在医学教育中最常见评估方法中的应用。以直观而非数学的方式讨论了估计信度的典型方法。
信度指评估结果的一致性。最受关注的确切一致性类型取决于评估的类型、目的以及数据的后续用途。认知成就的书面测试关注内部测试一致性,使用从重测设计得出的估计方法。基于评分者的评估数据,如病房临床绩效评分,需要评分者间的一致性或一致性。客观结构化临床考试、模拟患者考试和其他表现型评估通常需要概化理论分析,以考虑复杂设计中各种测量误差来源,并估计对技能全域或领域进行概括的一致性。
信度是评估效度证据的主要来源。低信度表明再次测试时分数可能会有很大差异。不一致的评估分数难以或无法进行有意义的解释,从而降低效度证据。信度系数可以量化和估计评估中测量的随机误差,从而改进整体评估。