Talaat Mohamed, Barari Kian, Si Xiuhua April, Xi Jinxiang
Department of Biomedical Engineering, University of Massachusetts, Lowell, MA, 01854, USA.
Department of Aerospace, Industrial, and Mechanical Engineering, California Baptist University, Riverside, CA, 92504, USA.
Vis Comput Ind Biomed Art. 2024 May 22;7(1):12. doi: 10.1186/s42492-024-00163-w.
Speech is a highly coordinated process that requires precise control over vocal tract morphology/motion to produce intelligible sounds while simultaneously generating unique exhaled flow patterns. The schlieren imaging technique visualizes airflows with subtle density variations. It is hypothesized that speech flows captured by schlieren, when analyzed using a hybrid of convolutional neural network (CNN) and long short-term memory (LSTM) network, can recognize alphabet pronunciations, thus facilitating automatic speech recognition and speech disorder therapy. This study evaluates the feasibility of using a CNN-based video classification network to differentiate speech flows corresponding to the first four alphabets: /A/, /B/, /C/, and /D/. A schlieren optical system was developed, and the speech flows of alphabet pronunciations were recorded for two participants at an acquisition rate of 60 frames per second. A total of 640 video clips, each lasting 1 s, were utilized to train and test a hybrid CNN-LSTM network. Acoustic analyses of the recorded sounds were conducted to understand the phonetic differences among the four alphabets. The hybrid CNN-LSTM network was trained separately on four datasets of varying sizes (i.e., 20, 30, 40, 50 videos per alphabet), all achieving over 95% accuracy in classifying videos of the same participant. However, the network's performance declined when tested on speech flows from a different participant, with accuracy dropping to around 44%, indicating significant inter-participant variability in alphabet pronunciation. Retraining the network with videos from both participants improved accuracy to 93% on the second participant. Analysis of misclassified videos indicated that factors such as low video quality and disproportional head size affected accuracy. These results highlight the potential of CNN-assisted speech recognition and speech therapy using articulation flows, although challenges remain in expanding the alphabet set and participant cohort.
言语是一个高度协调的过程,需要对声道形态/运动进行精确控制,以产生可理解的声音,同时生成独特的呼出气流模式。纹影成像技术可将具有细微密度变化的气流可视化。据推测,当使用卷积神经网络(CNN)和长短期记忆(LSTM)网络的混合模型对纹影捕捉到的言语气流进行分析时,能够识别字母发音,从而促进自动语音识别和言语障碍治疗。本研究评估了使用基于CNN的视频分类网络来区分对应前四个字母/A/、/B/、/C/和/D/的言语气流的可行性。开发了一个纹影光学系统,并以每秒60帧的采集速率记录了两名参与者的字母发音的言语气流。总共640个时长为1秒的视频片段被用于训练和测试一个混合的CNN-LSTM网络。对记录的声音进行声学分析,以了解这四个字母之间的语音差异。混合的CNN-LSTM网络在四个不同大小的数据集(即每个字母20、30、40、50个视频)上分别进行训练,在对同一参与者的视频进行分类时,所有数据集的准确率均超过95%。然而,当在另一名参与者的言语气流上进行测试时,该网络的性能下降,准确率降至约44%,这表明字母发音存在显著的个体间差异。用两名参与者的视频对网络进行重新训练,在第二名参与者上的准确率提高到了93%。对错误分类视频的分析表明,视频质量低和头部大小不成比例等因素会影响准确率。这些结果凸显了CNN辅助的基于发音气流的语音识别和言语治疗的潜力,尽管在扩大字母集和参与者群体方面仍存在挑战。