Suppr超能文献

量化并提高语音识别系统对嗓音障碍语音的性能。

Quantifying and Improving the Performance of Speech Recognition Systems on Dysphonic Speech.

作者信息

Hidalgo Lopez Julio C, Sandeep Shelly, Wright MaKayla, Wandell Grace M, Law Anthony B

机构信息

Emory University School of Medicine, Atlanta, Georgia, USA.

Georgia State University, Atlanta, Georgia, USA.

出版信息

Otolaryngol Head Neck Surg. 2023 May;168(5):1130-1138. doi: 10.1002/ohn.170. Epub 2023 Jan 24.

Abstract

OBJECTIVE

This study seeks to quantify how current speech recognition systems perform on dysphonic input and if they can be improved.

STUDY DESIGN

Experimental machine learning methods based on a retrospective database.

SETTING

Single academic voice center.

METHODS

A database of dysphonic speech recordings was created and tested against 3 speech recognition platforms. Platform performance on dysphonic voice input was compared to platform performance on normal voice input. A custom speech recognition model was trained on voice from patients with spasmodic dysphonia or vocal cord paralysis. Custom model performance was compared to base model performance.

RESULTS

All platforms performed well on normal voice, and 2 platforms performed significantly worse on dysphonic speech. Accuracy metrics on dysphonic speech returned values of 84.55%, 88.57%, and 93.56% for International Business Machines (IBM) Watson, Amazon Transcribe, and Microsoft Azure, respectively. The secondary analysis demonstrated that the lower performance of IBM Watson and Amazon Transcribe was driven by performance on spasmodic dysphonia and vocal fold paralysis. Thus, a custom model was built to increase the accuracy of these pathologies on the Microsoft platform. Overall, the performance of the custom model on dysphonic voices was 96.43% and on normal voices was 97.62%.

CONCLUSION

Current speech recognition systems generally perform worse on dysphonic speech than on normal speech. We theorize that poor performance is a consequence of a lack of dysphonic voices in each platform's original training dataset. We address this limitation with transfer learning used to increase the performance of these systems on all dysphonic speech.

摘要

目的

本研究旨在量化当前语音识别系统在发音障碍语音输入上的表现,以及它们是否可以得到改进。

研究设计

基于回顾性数据库的实验性机器学习方法。

研究地点

单一学术语音中心。

方法

创建了一个发音障碍语音记录数据库,并在3个语音识别平台上进行测试。将发音障碍语音输入时平台的性能与正常语音输入时平台的性能进行比较。在痉挛性发音障碍或声带麻痹患者的语音上训练了一个定制语音识别模型。将定制模型的性能与基础模型的性能进行比较。

结果

所有平台在正常语音上表现良好,2个平台在发音障碍语音上表现明显更差。对于国际商业机器公司(IBM)的沃森、亚马逊转录和微软Azure,发音障碍语音的准确率指标分别为84.55%、88.57%和93.56%。二次分析表明,IBM沃森和亚马逊转录的较低性能是由痉挛性发音障碍和声带麻痹的表现驱动的。因此,构建了一个定制模型以提高微软平台上这些病症的准确率。总体而言,定制模型在发音障碍语音上的性能为96.43%,在正常语音上的性能为97.62%。

结论

当前语音识别系统在发音障碍语音上的表现通常比在正常语音上更差。我们推测性能不佳是每个平台原始训练数据集中缺乏发音障碍语音的结果。我们通过使用迁移学习来解决这一限制,以提高这些系统在所有发音障碍语音上的性能。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验