Varshney Suvi, Farias Dana, Brandman David M, Stavisky Sergey D, Miller Lee M
Department of Neurological Surgery, University of California, Davis.
Computer Science Graduate Group, University of California, Davis.
Int IEEE EMBS Conf Neural Eng. 2023 Apr;2023. doi: 10.1109/ner52421.2023.10123751. Epub 2023 May 19.
Brain-computer interfaces (BCIs) can potentially restore lost function in patients with neurological injury. A promising new application of BCI technology has focused on speech restoration. One approach is to synthesize speech from the neural correlates of a person who cannot speak, as they attempt to do so. However, there is no established gold-standard for quantifying the quality of BCI-synthesized speech. Quantitative metrics, such as applying correlation coefficients between true and decoded speech, are not applicable to anarthric users and fail to capture intelligibility by actual human listeners; by contrast, methods involving people completing forced-choice multiple-choice questionnaires are imprecise, not practical at scale, and cannot be used as cost functions for improving speech decoding algorithms. Here, we present a deep learning-based "AI Listener" that can be used to evaluate BCI speech intelligibility objectively, rapidly, and automatically. We begin by adapting several leading Automatic Speech Recognition (ASR) deep learning models - Deepspeech, Wav2vec 2.0, and Kaldi - to suit our application. We then evaluate the performance of these ASRs on multiple speech datasets with varying levels of intelligibility, including: healthy speech, speech from people with dysarthria, and synthesized BCI speech. Our results demonstrate that the multiple-language ASR model XLSR-Wav2vec 2.0, trained to output phonemes, yields superior performance in terms of speech transcription accuracy. Notably, the AI Listener reports that several previously published BCI output datasets are not intelligible, which is consistent with human listeners.
脑机接口(BCIs)有潜力恢复神经损伤患者丧失的功能。脑机接口技术一个有前景的新应用聚焦于言语恢复。一种方法是在无法说话的人尝试说话时,根据其神经关联来合成语音。然而,对于量化脑机接口合成语音的质量,尚无既定的金标准。诸如应用真实语音与解码语音之间的相关系数等定量指标,不适用于无构音能力的用户,也无法捕捉实际人类听众的可懂度;相比之下,让人们完成强制选择多项选择题问卷的方法不够精确,无法大规模应用,且不能用作改进语音解码算法的成本函数。在此,我们提出一种基于深度学习的“人工智能听众”,可用于客观、快速且自动地评估脑机接口语音的可懂度。我们首先对几个领先的自动语音识别(ASR)深度学习模型——深度语音、Wav2vec 2.0和卡尔迪——进行调整以适应我们的应用。然后,我们在多个具有不同可懂度水平的语音数据集上评估这些自动语音识别模型的性能,这些数据集包括:健康语音、构音障碍患者的语音以及合成的脑机接口语音。我们的结果表明,经过训练以输出音素的多语言自动语音识别模型XLSR-Wav2vec 2.0在语音转录准确性方面表现出色。值得注意的是,人工智能听众报告称,之前发布的几个脑机接口输出数据集不可懂,这与人类听众的判断一致。