Stoehr Fabian, Kämpgen Benedikt, Müller Lukas, Zufiría Laura Oleaga, Junquero Vanesa, Merino Cristina, Mildenberger Peter, Kloeckner Roman
Department of Diagnostic and Interventional Radiology, University Medical Center, Johannes Gutenberg-University Mainz, Langenbeckst, 1, 55131, Mainz, Germany.
Empolis Information Management GmbH, Leightonstraße 2, 97074, Würzburg, Germany.
Insights Imaging. 2023 Sep 19;14(1):150. doi: 10.1186/s13244-023-01507-5.
Written medical examinations consist of multiple-choice questions and/or free-text answers. The latter require manual evaluation and rating, which is time-consuming and potentially error-prone. We tested whether natural language processing (NLP) can be used to automatically analyze free-text answers to support the review process.
The European Board of Radiology of the European Society of Radiology provided representative datasets comprising sample questions, answer keys, participant answers, and reviewer markings from European Diploma in Radiology examinations. Three free-text questions with the highest number of corresponding answers were selected: Questions 1 and 2 were "unstructured" and required a typical free-text answer whereas question 3 was "structured" and offered a selection of predefined wordings/phrases for participants to use in their free-text answer. The NLP engine was designed using word lists, rule-based synonyms, and decision tree learning based on the answer keys and its performance tested against the gold standard of reviewer markings.
After implementing the NLP approach in Python, F1 scores were calculated as a measure of NLP performance: 0.26 (unstructured question 1, n = 96), 0.33 (unstructured question 2, n = 327), and 0.5 (more structured question, n = 111). The respective precision/recall values were 0.26/0.27, 0.4/0.32, and 0.62/0.55.
This study showed the successful design of an NLP-based approach for automatic evaluation of free-text answers in the EDiR examination. Thus, as a future field of application, NLP could work as a decision-support system for reviewers and support the design of examinations being adjusted to the requirements of an automated, NLP-based review process.
Natural language processing can be successfully used to automatically evaluate free-text answers, performing better with more structured question-answer formats. Furthermore, this study provides a baseline for further work applying, e.g., more elaborated NLP approaches/large language models.
• Free-text answers require manual evaluation, which is time-consuming and potentially error-prone. • We developed a simple NLP-based approach - requiring only minimal effort/modeling - to automatically analyze and mark free-text answers. • Our NLP engine has the potential to support the manual evaluation process. • NLP performance is better on a more structured question-answer format.
书面医学考试包括多项选择题和/或自由文本答案。后者需要人工评估和评分,既耗时又可能容易出错。我们测试了自然语言处理(NLP)是否可用于自动分析自由文本答案,以支持审核过程。
欧洲放射学会的欧洲放射学委员会提供了代表性数据集,包括来自欧洲放射学文凭考试的样题、答案、考生答案和审核员标记。选择了对应答案数量最多的三道自由文本问题:问题1和问题2是“非结构化”的,需要典型的自由文本答案,而问题3是“结构化”的,为考生提供了一系列预定义的措辞/短语以供其在自由文本答案中使用。基于答案设计了使用单词列表、基于规则的同义词和决策树学习的NLP引擎,并对照审核员标记的金标准测试其性能。
在Python中实施NLP方法后,计算F1分数作为NLP性能的指标:0.26(非结构化问题1,n = 96)、0.33(非结构化问题2,n = 327)和0.5(结构化程度更高的问题,n = 111)。各自的精确率/召回率值分别为0.26/0.27、0.4/0.32和0.62/0.55。
本研究表明成功设计了一种基于NLP的方法,用于自动评估欧洲放射学文凭考试中的自由文本答案。因此,作为未来的应用领域,NLP可作为审核员的决策支持系统,并支持根据基于NLP的自动化审核过程的要求调整考试设计。
自然语言处理可成功用于自动评估自由文本答案,在结构化程度更高的问答格式中表现更好。此外,本研究为进一步应用(例如更精细的NLP方法/大语言模型)的工作提供了基线。
• 自由文本答案需要人工评估,既耗时又可能容易出错。• 我们开发了一种简单的基于NLP的方法——只需最少的工作量/建模——来自动分析和标记自由文本答案。• 我们的NLP引擎有潜力支持人工评估过程。• NLP在结构化程度更高的问答格式上性能更好。