Liu Xiaofeng, Xing Fangxu, Prince Jerry L, Zhuo Jiachen, Stone Maureen, Fakhri Georges El, Woo Jonghye
Massachusetts General Hospital and Harvard Medical School, Boston, MA, USA.
Johns Hopkins University, Baltimore, MD, USA.
Med Image Comput Comput Assist Interv. 2022 Sep;13436:376-386. doi: 10.1007/978-3-031-16446-0_36. Epub 2022 Sep 17.
Understanding the underlying relationship between tongue and oropharyngeal muscle deformation seen in tagged-MRI and intelligible speech plays an important role in advancing speech motor control theories and treatment of speech related-disorders. Because of their heterogeneous representations, however, direct mapping between the two modalities-i.e., two-dimensional (mid-sagittal slice) plus time tagged-MRI sequence and its corresponding one-dimensional waveform-is not straightforward. Instead, we resort to two-dimensional spectrograms as an intermediate representation, which contains both pitch and resonance, from which to develop an end-to-end deep learning framework to translate from a sequence of tagged-MRI to its corresponding audio waveform with limited dataset size. Our framework is based on a novel fully convolutional asymmetry translator with guidance of a self residual attention strategy to specifically exploit the moving muscular structures during speech. In addition, we leverage a pairwise correlation of the samples with the same utterances with a latent space representation disentanglement strategy. Furthermore, we incorporate an adversarial training approach with generative adversarial networks to offer improved realism on our generated spectrograms. Our experimental results, carried out with a total of 63 tagged-MRI sequences alongside speech acoustics, showed that our framework enabled the generation of clear audio waveforms from a sequence of tagged-MRI, surpassing competing methods. Thus, our framework provides the great potential to help better understand the relationship between the two modalities.
了解在标记磁共振成像(tagged-MRI)中看到的舌头与口咽肌肉变形和可理解语音之间的潜在关系,对于推进语音运动控制理论和治疗与语音相关的疾病具有重要作用。然而,由于它们的表示形式各异,这两种模态之间的直接映射——即二维(正中矢状切片)加时间标记的MRI序列及其相应的一维波形——并非易事。相反,我们采用二维频谱图作为中间表示,它同时包含音高和共振信息,据此开发一个端到端的深度学习框架,以便在数据集规模有限的情况下,将标记MRI序列转换为其相应的音频波形。我们的框架基于一种新颖的全卷积不对称转换器,并采用自残差注意力策略进行引导,以专门利用语音过程中移动的肌肉结构。此外,我们利用具有潜在空间表示解缠策略的相同话语样本的成对相关性。此外,我们将对抗训练方法与生成对抗网络相结合,以提高生成频谱图的真实感。我们总共使用63个标记MRI序列以及语音声学进行的实验结果表明,我们的框架能够从标记MRI序列生成清晰的音频波形,超过了其他竞争方法。因此,我们的框架具有巨大潜力,有助于更好地理解这两种模态之间的关系。