Liu Xiaofeng, Xing Fangxu, Prince Jerry L, Stone Maureen, Fakhri Georges El, Woo Jonghye
Gordon Center for Medical Imaging, Massachusetts General Hospital and Harvard Medical School, Boston, MA 02114 USA.
Department of Electrical and Computer Engineering, Johns Hopkins University, Baltimore, MD 21218 USA.
Proc SPIE Int Soc Opt Eng. 2023 Feb;12464. doi: 10.1117/12.2653345. Epub 2023 Apr 3.
Investigating the relationship between internal tissue point motion of the tongue and oropharyngeal muscle deformation measured from tagged MRI and intelligible speech can aid in advancing speech motor control theories and developing novel treatment methods for speech related-disorders. However, elucidating the relationship between these two sources of information is challenging, due in part to the disparity in data structure between spatiotemporal motion fields (i.e., 4D motion fields) and one-dimensional audio waveforms. In this work, we present an efficient encoder-decoder translation network for exploring the predictive information inherent in 4D motion fields via 2D spectrograms as a surrogate of the audio data. Specifically, our encoder is based on 3D convolutional spatial modeling and transformer-based temporal modeling. The extracted features are processed by an asymmetric 2D convolution decoder to generate spectrograms that correspond to 4D motion fields. Furthermore, we incorporate a generative adversarial training approach into our framework to further improve synthesis quality on our generated spectrograms. We experiment on 63 paired motion field sequences and speech waveforms, demonstrating that our framework enables the generation of clear audio waveforms from a sequence of motion fields. Thus, our framework has the potential to improve our understanding of the relationship between these two modalities and inform the development of treatments for speech disorders.
研究从标记磁共振成像测量得到的舌头内部组织点运动与口咽肌肉变形之间的关系,以及与可理解语音之间的关系,有助于推进言语运动控制理论,并为言语相关障碍开发新的治疗方法。然而,阐明这两种信息来源之间的关系具有挑战性,部分原因是时空运动场(即4D运动场)和一维音频波形之间的数据结构存在差异。在这项工作中,我们提出了一种高效的编码器-解码器翻译网络,用于通过二维频谱图作为音频数据的替代物来探索4D运动场中固有的预测信息。具体而言,我们的编码器基于3D卷积空间建模和基于Transformer的时间建模。提取的特征由非对称二维卷积解码器处理,以生成与4D运动场相对应的频谱图。此外,我们将生成对抗训练方法纳入我们的框架,以进一步提高我们生成的频谱图的合成质量。我们对63对运动序列和语音波形进行了实验,证明我们的框架能够从一系列运动场生成清晰的音频波形。因此,我们的框架有可能增进我们对这两种模态之间关系的理解,并为言语障碍治疗方法的开发提供信息。