Clinical Audiology, Speech and Language Research Centre, Queen Margaret University, Musselburgh EH21 6UU, UK.
Articulate Instruments Ltd., Musselburgh EH21 6UU, UK.
Sensors (Basel). 2022 Feb 2;22(3):1133. doi: 10.3390/s22031133.
Automatic feature extraction from images of speech articulators is currently achieved by detecting edges. Here, we investigate the use of pose estimation deep neural nets with transfer learning to perform markerless estimation of speech articulator keypoints using only a few hundred hand-labelled images as training input. Midsagittal ultrasound images of the tongue, jaw, and hyoid and camera images of the lips were hand-labelled with keypoints, trained using DeepLabCut and evaluated on unseen speakers and systems. Tongue surface contours interpolated from estimated and hand-labelled keypoints produced an average mean sum of distances (MSD) of 0.93, s.d. 0.46 mm, compared with 0.96, s.d. 0.39 mm, for two human labellers, and 2.3, s.d. 1.5 mm, for the best performing edge detection algorithm. A pilot set of simultaneous electromagnetic articulography (EMA) and ultrasound recordings demonstrated partial correlation among three physical sensor positions and the corresponding estimated keypoints and requires further investigation. The accuracy of the estimating lip aperture from a camera video was high, with a mean MSD of 0.70, s.d. 0.56 mm compared with 0.57, s.d. 0.48 mm for two human labellers. DeepLabCut was found to be a fast, accurate and fully automatic method of providing unique kinematic data for tongue, hyoid, jaw, and lips.
目前,从言语构音器官的图像中自动提取特征是通过检测边缘来实现的。在这里,我们研究了使用姿势估计深度神经网络和迁移学习来执行无标记的言语构音器官关键点估计,仅使用几百张手动标记的图像作为训练输入。对舌、颌和舌骨的中矢状面超声图像以及唇的相机图像进行了手动标记,使用 DeepLabCut 进行训练,并在看不见的说话者和系统上进行了评估。从估计和手动标记的关键点插值得到的舌面轮廓产生了平均均方距离(MSD)为 0.93,标准差为 0.46 毫米,而两名人类标记者的平均均方距离为 0.96,标准差为 0.39 毫米,而表现最好的边缘检测算法的平均均方距离为 2.3,标准差为 1.5 毫米。一组同时进行的电磁口动描记术(EMA)和超声记录的初步结果表明,三个物理传感器位置与相应的估计关键点之间存在部分相关性,需要进一步研究。从相机视频中估计唇开口的准确性较高,平均 MSD 为 0.70,标准差为 0.56 毫米,而两名人类标记者的平均 MSD 为 0.57,标准差为 0.48 毫米。DeepLabCut 被发现是一种快速、准确和全自动的方法,可以为舌、舌骨、颌和唇提供独特的运动学数据。