E. Ötleş is Medical Scientist Training Program fellow, Department of Industrial and Operations Engineering, University of Michigan Medical School, Ann Arbor, Michigan.
D.E. Kendrick is assistant professor, Department of Surgery, University of Minnesota Medical School, Minneapolis, Minnesota.
Acad Med. 2021 Oct 1;96(10):1457-1460. doi: 10.1097/ACM.0000000000004153.
Learning is markedly improved with high-quality feedback, yet assuring the quality of feedback is difficult to achieve at scale. Natural language processing (NLP) algorithms may be useful in this context as they can automatically classify large volumes of narrative data. However, it is unknown if NLP models can accurately evaluate surgical trainee feedback. This study evaluated which NLP techniques best classify the quality of surgical trainee formative feedback recorded as part of a workplace assessment.
During the 2016-2017 academic year, the SIMPL (Society for Improving Medical Professional Learning) app was used to record operative performance narrative feedback for residents at 3 university-based general surgery residency training programs. Feedback comments were collected for a sample of residents representing all 5 postgraduate year levels and coded for quality. In May 2019, the coded comments were then used to train NLP models to automatically classify the quality of feedback across 4 categories (effective, mediocre, ineffective, or other). Models included support vector machines (SVM), logistic regression, gradient boosted trees, naive Bayes, and random forests. The primary outcome was mean classification accuracy.
The authors manually coded the quality of 600 recorded feedback comments. Those data were used to train NLP models to automatically classify the quality of feedback across 4 categories. The NLP model using an SVM algorithm yielded a maximum mean accuracy of 0.64 (standard deviation, 0.01). When the classification task was modified to distinguish only high-quality vs low-quality feedback, maximum mean accuracy was 0.83, again with SVM.
To the authors' knowledge, this is the first study to examine the use of NLP for classifying feedback quality. SVM NLP models demonstrated the ability to automatically classify the quality of surgical trainee evaluations. Larger training datasets would likely further increase accuracy.
高质量的反馈显著提高学习效果,但要确保反馈的质量在大规模应用中是难以实现的。自然语言处理(NLP)算法在这种情况下可能很有用,因为它们可以自动对大量叙述性数据进行分类。然而,目前还不清楚 NLP 模型是否可以准确评估外科学员的反馈。本研究评估了 NLP 技术在多大程度上可以准确分类作为工作场所评估一部分记录的外科学员形成性反馈的质量。
在 2016-2017 学年期间,SIMPL(改善医学专业学习协会)应用程序被用于记录 3 个大学普通外科住院医师培训项目中住院医师的手术表现叙述性反馈。收集了一个代表所有 5 个住院医师后阶段的住院医师样本的反馈意见,并对其质量进行编码。2019 年 5 月,然后使用编码的反馈意见来训练 NLP 模型,以自动将反馈质量分为 4 个类别(有效、中等、无效或其他)。模型包括支持向量机(SVM)、逻辑回归、梯度提升树、朴素贝叶斯和随机森林。主要结果是平均分类准确率。
作者手动对 600 条记录的反馈质量进行了编码。这些数据被用于训练 NLP 模型,以自动将反馈质量分为 4 个类别。使用 SVM 算法的 NLP 模型的最大平均准确率为 0.64(标准差为 0.01)。当分类任务修改为仅区分高质量和低质量反馈时,最大平均准确率为 0.83,再次使用 SVM。
据作者所知,这是第一项研究使用 NLP 对反馈质量进行分类的研究。SVM NLP 模型展示了自动分类外科学员评估质量的能力。更大的训练数据集可能会进一步提高准确性。