S.A.W. Andersen is postdoctoral researcher, Copenhagen Academy for Medical Education and Simulation (CAMES), Center for Human Resources and Education, Capital Region of Denmark, and Department of Otolaryngology, The Ohio State University, Columbus, Ohio, and resident in otorhinolaryngology, Department of Otorhinolaryngology-Head & Neck Surgery, Rigshospitalet, Copenhagen, Denmark; ORCID: https://orcid.org/0000-0002-3491-9790 .
L.J. Nayahangan is researcher, CAMES, Center for Human Resources and Education, Capital Region of Denmark, Copenhagen, Denmark; ORCID: https://orcid.org/0000-0002-6179-1622 .
Acad Med. 2021 Nov 1;96(11):1609-1619. doi: 10.1097/ACM.0000000000004150.
Competency-based education relies on the validity and reliability of assessment scores. Generalizability (G) theory is well suited to explore the reliability of assessment tools in medical education but has only been applied to a limited extent. This study aimed to systematically review the literature using G-theory to explore the reliability of structured assessment of medical and surgical technical skills and to assess the relative contributions of different factors to variance.
In June 2020, 11 databases, including PubMed, were searched from inception through May 31, 2020. Eligible studies included the use of G-theory to explore reliability in the context of assessment of medical and surgical technical skills. Descriptive information on study, assessment context, assessment protocol, participants being assessed, and G-analyses was extracted. Data were used to map G-theory and explore variance components analyses. A meta-analysis was conducted to synthesize the extracted data on the sources of variance and reliability.
Forty-four studies were included; of these, 39 had sufficient data for meta-analysis. The total pool included 35,284 unique assessments of 31,496 unique performances of 4,154 participants. Person variance had a pooled effect of 44.2% (95% confidence interval [CI], 36.8%-51.5%). Only assessment tool type (Objective Structured Assessment of Technical Skills-type vs task-based checklist-type) had a significant effect on person variance. The pooled reliability (G-coefficient) was 0.65 (95% CI, .59-.70). Most studies included decision studies (39, 88.6%) and generally seemed to have higher ratios of performances to assessors to achieve a sufficiently reliable assessment.
G-theory is increasingly being used to examine reliability of technical skills assessment in medical education, but more rigor in reporting is warranted. Contextual factors can potentially affect variance components and thereby reliability estimates and should be considered, especially in high-stakes assessment. Reliability analysis should be a best practice when developing assessment of technical skills.
基于能力的教育依赖于评估分数的有效性和可靠性。广义理论(G 理论)非常适合探索医学教育中评估工具的可靠性,但仅在有限的范围内得到了应用。本研究旨在使用 G 理论系统地回顾文献,以探讨医学和外科技术技能结构化评估的可靠性,并评估不同因素对变异的相对贡献。
2020 年 6 月,从建库到 2020 年 5 月 31 日,通过 11 个数据库(包括 PubMed)进行检索。符合条件的研究包括使用 G 理论探索评估医学和外科技术技能背景下的可靠性,并提取研究、评估背景、评估方案、评估对象和 G 分析的描述性信息。使用数据进行 G 理论映射并探索方差分量分析。对提取的关于变异和可靠性来源的数据进行荟萃分析。
共纳入 44 项研究;其中 39 项有足够的数据进行荟萃分析。总共有 35284 项独特评估的 31496 项独特表现的 4154 名参与者。个体方差的总效应为 44.2%(95%置信区间[CI],36.8%-51.5%)。只有评估工具类型(客观结构化技能评估型与任务型检查表型)对个体方差有显著影响。综合可靠性(G 系数)为 0.65(95% CI,0.59-0.70)。大多数研究包括决策研究(39 项,88.6%),并且通常似乎具有更高的表现与评估者比例,以实现足够可靠的评估。
G 理论越来越多地用于检验医学教育中技术技能评估的可靠性,但需要更严格地报告。上下文因素可能会潜在影响变异分量,从而影响可靠性估计,因此应予以考虑,尤其是在高风险评估中。在开发技术技能评估时,可靠性分析应成为最佳实践。