Liu Xiaofeng, Xing Fangxu, Zhuo Jiachen, Stone Maureen, Prince Jerry L, El Fakhri Georges, Woo Jonghye
Gordon Center for Medical Imaging, Massachusetts General Hospital and Harvard Medical School, Boston, MA 02114 USA.
Dept. of Radiology, University of Maryland School of Medicine, Baltimore, MD 21201 USA.
Proc SPIE Int Soc Opt Eng. 2024 Feb;12926. doi: 10.1117/12.3006874. Epub 2024 May 1.
Understanding the relationship between tongue motion patterns during speech and their resulting speech acoustic outcomes-i.e., articulatory-acoustic relation-is of great importance in assessing speech quality and developing innovative treatment and rehabilitative strategies. This is especially important when evaluating and detecting abnormal articulatory features in patients with speech-related disorders. In this work, we aim to develop a framework for detecting speech motion anomalies in conjunction with their corresponding speech acoustics. This is achieved through the use of a deep cross-modal translator trained on data from healthy individuals only, which bridges the gap between 4D motion fields obtained from tagged MRI and 2D spectrograms derived from speech acoustic data. The trained translator is used as an anomaly detector, by measuring the spectrogram reconstruction quality on healthy individuals or patients. In particular, the cross-modal translator is likely to yield limited generalization capabilities on patient data, which includes unseen out-of-distribution patterns and demonstrates subpar performance, when compared with healthy individuals. A one-class SVM is then used to distinguish the spectrograms of healthy individuals from those of patients. To validate our framework, we collected a total of 39 paired tagged MRI and speech waveforms, consisting of data from 36 healthy individuals and 3 tongue cancer patients. We used both 3D convolutional and transformer-based deep translation models, training them on the healthy training set and then applying them to both the healthy and patient testing sets. Our framework demonstrates a capability to detect abnormal patient data, thereby illustrating its potential in enhancing the understanding of the articulatory-acoustic relation for both healthy individuals and patients.
理解言语过程中舌部运动模式与其产生的语音声学结果之间的关系,即发音-声学关系,对于评估语音质量以及制定创新的治疗和康复策略至关重要。在评估和检测与言语相关障碍患者的异常发音特征时,这一点尤为重要。在这项工作中,我们旨在开发一个框架,用于结合相应的语音声学来检测言语运动异常。这是通过使用仅在健康个体数据上训练的深度跨模态翻译器来实现的,该翻译器弥合了从标记MRI获得的4D运动场与从语音声学数据导出的2D频谱图之间的差距。通过测量健康个体或患者的频谱图重建质量,将训练好的翻译器用作异常检测器。特别是,与健康个体相比,跨模态翻译器在患者数据上可能具有有限的泛化能力,患者数据包括未见的分布外模式并且表现不佳。然后使用一类支持向量机将健康个体的频谱图与患者的频谱图区分开来。为了验证我们的框架,我们总共收集了39对标记MRI和语音波形,包括来自36名健康个体和3名舌癌患者的数据。我们使用了基于3D卷积和基于Transformer的深度翻译模型,在健康训练集上对它们进行训练,然后将它们应用于健康和患者测试集。我们的框架展示了检测异常患者数据的能力,从而说明了其在增强对健康个体和患者的发音-声学关系理解方面的潜力。