Biron Tirza, Barboy Moshe, Ben-Artzy Eran, Golubchik Alona, Marmor Yanir, Marron Assaf, Szekely Smadar, Winter Yaron, Harel David
Weizmann Institute of Science, Faculty of Mathematics and Computer Science, Rehovot 7610001, Israel.
Proc Natl Acad Sci U S A. 2025 Sep 16;122(37):e2500510122. doi: 10.1073/pnas.2500510122. Epub 2025 Sep 12.
We propose a theoretical framework and a cost-effective automated method for the interpretation of prosodic messages (e.g., chunking of information, emphasis, conversation action, emotion). At the core of the proposal is a hierarchy of layered prosodic messages that co-occur within the same intonation unit (0.5 to 2 s long). Motivated by this hierarchy, a procedure for the differential detection of three such co-occurring nonverbal messages is then described. In way of implementation, we produce a variant model of the WHISPER automatic speech recognition system that flags intonation unit boundaries, intonation unit prototypes, and emphases therein. The procedure required us to alter WHISPER's token combinations and significantly adjust its prediction process. The variant model was tested on four datasets that contain spontaneous and read speech, and performs on a par with similar human annotation, and often better, using relatively modest training data. Several insights regarding this implementation, such as model size and encoding methods, are described as well. We believe that the proposed framework, coupled with the results of its application herein, can greatly improve the analysis of speech and language, integrating contextual information and speaker intentions into linguistic descriptions for a large array of purposes with modest means.
我们提出了一个理论框架和一种经济高效的自动方法,用于解读韵律信息(例如,信息分块、强调、会话行为、情感)。该提议的核心是在同一个语调单元(时长0.5到2秒)内共同出现的分层韵律信息的层次结构。受此层次结构的启发,接着描述了一种用于差异检测三种此类共同出现的非语言信息的程序。在实现方式上,我们生成了WHISPER自动语音识别系统的一个变体模型,该模型能够标记语调单元边界、语调单元原型及其内部的重点。该程序要求我们改变WHISPER的令牌组合,并显著调整其预测过程。该变体模型在四个包含自发语音和朗读语音的数据集上进行了测试,使用相对较少的训练数据,其表现与类似的人工标注相当,并且常常更好。还描述了关于此实现的一些见解,例如模型大小和编码方法。我们相信,所提出的框架及其在此处的应用结果,可以极大地改进语音和语言分析,以适度的方式将上下文信息和说话者意图整合到用于大量目的的语言描述中。