Erattakulangara Subin, Kelat Karthika, Meyer David, Priya Sarv, Lingala Sajan Goud
Roy J. Carver Department of Biomedical Engineering, University of Iowa, Iowa City, IA 52242, USA.
Janette Ogg Voice Research Center, Shenandoah University, Winchester, VA 22601, USA.
Bioengineering (Basel). 2023 May 22;10(5):623. doi: 10.3390/bioengineering10050623.
Dynamic magnetic resonance imaging has emerged as a powerful modality for investigating upper-airway function during speech production. Analyzing the changes in the vocal tract airspace, including the position of soft-tissue articulators (e.g., the tongue and velum), enhances our understanding of speech production. The advent of various fast speech MRI protocols based on sparse sampling and constrained reconstruction has led to the creation of dynamic speech MRI datasets on the order of 80-100 image frames/second. In this paper, we propose a stacked transfer learning U-NET model to segment the deforming vocal tract in 2D mid-sagittal slices of dynamic speech MRI. Our approach leverages (a) low- and mid-level features and (b) high-level features. The low- and mid-level features are derived from models pre-trained on labeled open-source brain tumor MR and lung CT datasets, and an in-house airway labeled dataset. The high-level features are derived from labeled protocol-specific MR images. The applicability of our approach to segmenting dynamic datasets is demonstrated in data acquired from three fast speech MRI protocols: Protocol 1: 3 T-based radial acquisition scheme coupled with a non-linear temporal regularizer, where speakers were producing French speech tokens; Protocol 2: 1.5 T-based uniform density spiral acquisition scheme coupled with a temporal finite difference (FD) sparsity regularization, where speakers were producing fluent speech tokens in English, and Protocol 3: 3 T-based variable density spiral acquisition scheme coupled with manifold regularization, where speakers were producing various speech tokens from the International Phonetic Alphabetic (IPA). Segments from our approach were compared to those from an expert human user (a vocologist), and the conventional U-NET model without transfer learning. Segmentations from a second expert human user (a radiologist) were used as ground truth. Evaluations were performed using the quantitative DICE similarity metric, the Hausdorff distance metric, and segmentation count metric. This approach was successfully adapted to different speech MRI protocols with only a handful of protocol-specific images (e.g., of the order of 20 images), and provided accurate segmentations similar to those of an expert human.
动态磁共振成像已成为研究言语产生过程中上呼吸道功能的一种强大方法。分析声道空域的变化,包括软组织发音器官(如舌头和软腭)的位置,有助于我们更好地理解言语产生过程。基于稀疏采样和约束重建的各种快速言语MRI协议的出现,使得能够创建每秒80 - 100帧图像的动态言语MRI数据集。在本文中,我们提出了一种堆叠式迁移学习U-NET模型,用于在动态言语MRI的二维正中矢状切片中分割变形的声道。我们的方法利用了(a)低层次和中间层次特征以及(b)高层次特征。低层次和中间层次特征来自于在标记的开源脑肿瘤MR和肺部CT数据集以及内部气道标记数据集上预训练的模型。高层次特征来自于标记的特定协议MR图像。我们的方法在分割动态数据集方面的适用性在从三种快速言语MRI协议获取的数据中得到了证明:协议1:基于3T的径向采集方案,结合非线性时间正则化,受试者说的是法语语音样本;协议2:基于1.5T的均匀密度螺旋采集方案,结合时间有限差分(FD)稀疏正则化,受试者说的是流利的英语语音样本;协议3:基于3T的可变密度螺旋采集方案,结合流形正则化,受试者说的是国际音标(IPA)中的各种语音样本。我们方法得到的分割结果与专家用户(一名嗓音专家)以及未进行迁移学习的传统U-NET模型的分割结果进行了比较。来自第二名专家用户(一名放射科医生)的分割结果用作真实参考。使用定量的DICE相似性度量、豪斯多夫距离度量和分割计数度量进行评估。这种方法仅用少量特定协议的图像(例如大约20张图像)就成功适用于不同的言语MRI协议,并提供了与专家类似的准确分割结果。