让对打印文稿进行人工评分成为过去式：对赫尔曼（2025年）的评论

Making manual scoring of typed transcripts a thing of the past: a commentary on Herrmann (2025).

作者信息

Bosker Hans Rutger

机构信息

Donders Institute for Brain, Cognition and Behaviour, Radboud University, Nijmegen, the Netherlands.

出版信息

Speech Lang Hear. 2025 Jun 9;28(1):2514395. doi: 10.1080/2050571X.2025.2514395. eCollection 2025.

DOI:10.1080/2050571X.2025.2514395

PMID:40757149

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12312738/

Abstract

Coding the accuracy of typed transcripts from experiments testing speech intelligibility is an arduous endeavour. A recent study in this journal [Herrmann, B. 2025. Leveraging natural language processing models to automate speech-intelligibility scoring. (1)] presents a novel approach for automating the scoring of such listener transcripts, leveraging Natural Language Processing (NLP) models. It involves the calculation of the semantic similarity between transcripts and target sentences using high-dimensional vectors, generated by such NLP models as ADA2, GPT2, BERT, and USE. This approach demonstrates exceptional accuracy, with negligible underestimation of intelligibility scores (by about 2-4%), numerically outperforming simpler computational tools like Autoscore and TSR. The method uniquely relies on semantic representations generated by large language models. At the same time, these models also form the Achilles heel of the technique: the transparency, accessibility, data security, ethical framework, and cost of the selected model directly impact the suitability of the NLP-based scoring method. Hence, working with such models can raise serious risks regarding the reproducibility of scientific findings. This in turn emphasises the need for fair, ethical, and evidence-based open source models. With such models, Herrmann's new tool represents a valuable addition to the speech scientist's toolbox.

摘要

对测试语音清晰度的实验中的打字记录准确性进行编码是一项艰巨的工作。本期刊最近的一项研究[赫尔曼，B. 2025年。利用自然语言处理模型实现语音清晰度评分自动化。(1)]提出了一种新颖的方法，利用自然语言处理（NLP）模型实现对此类听众记录的评分自动化。它涉及使用由ADA2、GPT2、BERT和USE等NLP模型生成的高维向量来计算记录与目标句子之间的语义相似度。这种方法显示出极高的准确性，对清晰度分数的低估可以忽略不计（约2 - 4%），在数值上优于Autoscore和TSR等更简单的计算工具。该方法独特地依赖于大语言模型生成的语义表示。与此同时，这些模型也构成了该技术的致命弱点：所选模型的透明度、可访问性、数据安全性、道德框架和成本直接影响基于NLP的评分方法的适用性。因此，使用此类模型可能会给科学发现的可重复性带来严重风险。这反过来强调了对公平、道德且基于证据的开源模型的需求。有了这样的模型，赫尔曼的新工具成为语音科学家工具箱中的一项宝贵补充。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

让对打印文稿进行人工评分成为过去式：对赫尔曼（2025年）的评论

Making manual scoring of typed transcripts a thing of the past: a commentary on Herrmann (2025).

作者信息

机构信息

出版信息

相似文献

本文引用的文献

让对打印文稿进行人工评分成为过去式：对赫尔曼（2025年）的评论

Making manual scoring of typed transcripts a thing of the past: a commentary on Herrmann (2025).

作者信息

机构信息

出版信息

相似文献

本文引用的文献