Technical Medical Centre, University of Twente, Enschede, The Netherlands.
Technical Medical Centre, University of Twente, Enschede, The Netherlands.
J Surg Educ. 2020 Jan-Feb;77(1):189-201. doi: 10.1016/j.jsurg.2019.07.007. Epub 2019 Aug 20.
Reliable performance assessment is a necessary prerequisite for outcome-based assessment of surgical technical skill. Numerous observational instruments for technical skill assessment have been developed in recent years. However, methodological shortcomings of reported studies might negatively impinge on the interpretation of inter-rater reliability.
To synthesize the evidence about the inter-rater reliability of observational instruments for technical skill assessment for high-stakes decisions.
A systematic review and meta-analysis were performed. We searched Scopus (including MEDLINE) and Pubmed, and key publications through December, 2016. This included original studies that evaluated reliability of instruments for the observational assessment of technical skills. Two reviewers independently extracted information on the primary outcome (the reliability statistic), secondary outcomes, and general information. We calculated pooled estimates using multilevel random effects meta-analyses where appropriate.
A total of 247 documents met our inclusion criteria and provided 491 inter-rater reliability estimates. Inappropriate inter-rater reliability indices were reported for 40% of the checklists estimates, 50% of the rating scales estimates and 41% of the other types of assessment instruments estimates. Only 14 documents provided sufficient information to be included in the meta-analyses. The pooled Cohen's kappa was .78 (95% CI 0.69-0.89, p < 0.001) and pooled proportion agreement was 0.84 (95% CI 0.71-0.96, p < 0.001). A moderator analysis was performed to explore the influence of type of assessment instrument as a possible source of heterogeneity.
For high-stakes decisions, there was often insufficient information available on which to base conclusions. The use of suboptimal statistical methods and incomplete reporting of reliability estimates does not support the use of observational assessment instruments for technical skill for high-stakes decisions. Interpretations of inter-rater reliability should consider the reliability index and assessment instrument used. Reporting of inter-rater reliability needs to be improved by detailed descriptions of the assessment process.
可靠的性能评估是基于结果的手术技术技能评估的必要前提。近年来,已经开发出许多用于技术技能评估的观察仪器。然而,报告研究的方法学缺陷可能会对评分者间可靠性的解释产生负面影响。
综合高风险决策中技术技能评估的观察仪器评分者间可靠性的证据。
系统回顾和荟萃分析。我们检索了 Scopus(包括 MEDLINE)和 Pubmed,并通过 2016 年 12 月的关键出版物进行了搜索。这包括评估技术技能观察评估仪器可靠性的原始研究。两名审查员独立提取主要结果(可靠性统计数据)、次要结果和一般信息。在适当的情况下,我们使用多级随机效应荟萃分析计算了汇总估计值。
共有 247 篇文献符合纳入标准,并提供了 491 个评分者间可靠性估计值。40%的检查表评估估计值、50%的评分量表评估估计值和 41%的其他类型评估仪器评估估计值报告了不适当的评分者间可靠性指标。只有 14 篇文献提供了足够的信息进行荟萃分析。汇总的 Cohen's kappa 为.78(95% CI 0.69-0.89,p < 0.001),汇总的一致性比例为 0.84(95% CI 0.71-0.96,p < 0.001)。进行了一项调节分析,以探讨评估仪器类型作为异质性可能来源的影响。
对于高风险决策,往往没有足够的信息来得出结论。观察评估仪器的技术技能的使用不理想的统计方法和不完整的可靠性估计报告并不支持高风险决策的使用。评分者间可靠性的解释应考虑使用的可靠性指标和评估仪器。通过详细描述评估过程,可以提高评分者间可靠性的报告。