Travis T A, Colliver J A, Robbs R S, Barnhart A J, Barrows H S, Giannone L, Henkle J Q, Kelly D P, Nichols-Johnson V, Rabinovitch S, Ramsey D E, Riseman J, Rockey P H, Ross D S, Schrage J P, Steward D E
SIU School of Medicine, Springfield, IL 62794-9230, USA.
Acad Med. 1996 Jan;71(1 Suppl):S84-6. doi: 10.1097/00001888-199601000-00051.
The results are disappointing, providing little support for the validity of the case-passing decisions based on this simple approach to scoring and standard setting. The case-passing decisions predicted what the case author intended for about only 73% or 74% of the students on average and, with agreement expected by chance removed, predicted what the case author intended for about only 25% of the students. Even with the use of the optimal pass/fail cutoffs and the dropping of students with ambiguous borderline global ratings, the case-passing decisions failed to agree with the case authors' global ratings for 15% to 30% of the students. The findings might be dismissed as simply due to low reliabilities of passing decisions and global ratings based on a single case. Although this concern would apply to intercase reliabilities, which would be subject to case specificity, the appropriate reliabilities here would seem to be intracase (i.e., intrarater), which should be fairly high (if they could be computed). Nevertheless, it seems reasonable to expect much better agreement between results of case scoring and of standard setting developed by the case author and the case author's global ratings of performance on that case, given that the case author might recall the checklist, assign a weight to each item, and so forth. Also, case-passing decisions would possibly agree more with global ratings of live or videotaped performances than with ratings of written summaries of performance; however, that question remains a challenge for further research. In conclusion, the study provides only weak evidence, at best, for the validity of the scoring and standard setting commonly used with SP assessment. The results do not undermine claims about the realism of the SP approach, however, nor do they call into question the standardization afforded by this method of assessing clinical competence. The results do raise serious concerns about this simple approach to scoring and standard setting for SP-based assessments and suggest that we should focus more on the observation and evaluation of actual student performance on SP cases in the development of valid scoring and standard setting.
结果令人失望,几乎无法支持基于这种简单评分和标准设定方法做出的病例通过决策的有效性。病例通过决策平均仅能预测约73%或74%的学生的病例撰写者意图,去除偶然一致性后,仅能预测约25%的学生的病例撰写者意图。即使使用最优的通过/不通过临界值并剔除整体评分模糊的学生,病例通过决策仍有15%至30%的学生与病例撰写者的整体评分不一致。这些发现可能会被简单地归结为基于单个病例的通过决策和整体评分的可靠性较低。尽管这种担忧适用于病例间的可靠性,因为其会受到病例特异性的影响,但这里合适的可靠性似乎是病例内(即评分者内)的,本应相当高(如果可以计算的话)。然而,鉴于病例撰写者可能会回忆起检查表、为每个项目赋予权重等等,病例评分结果与病例撰写者制定的标准设定以及病例撰写者对该病例表现的整体评分之间应该有更好的一致性,这似乎是合理的。此外,病例通过决策可能与现场或录像表现的整体评分比与书面表现总结的评分更一致;然而,这个问题仍是进一步研究的挑战。总之,该研究充其量仅为SP评估中常用的评分和标准设定的有效性提供了微弱证据。然而,这些结果并未削弱关于SP方法现实性的主张,也没有质疑这种临床能力评估方法所提供的标准化。这些结果确实引发了对基于SP评估的这种简单评分和标准设定方法的严重担忧,并表明在制定有效的评分和标准设定时,我们应更多地关注对学生在SP病例上实际表现的观察和评估。