Du Zhaohui, Zhao Xiaofeng, Li Lin, Yu Baohua, Miao Lijiang
School of Information Science and Technology, Shihezi University, Shihezi, China.
Xinjiang Uygur Autonomous Region Education Examination Centre, Urumqi, China.
PLoS One. 2025 May 23;20(5):e0324048. doi: 10.1371/journal.pone.0324048. eCollection 2025.
In recent years, empowered by artificial intelligence technologies, computer-assisted language learning systems have gradually become a hot topic of research. Currently, the mainstream pronunciation assessment models rely on advanced speech recognition technology, converting speech into phoneme sequences, and then determining mispronounced phonemes through sequence comparison. To optimize the phoneme recognition task in pronunciation evaluation, this paper proposes a Chinese pronunciation phoneme recognition model based on the improved Zipformer-RNN-T(Pruned) architecture, aiming to improve recognition accuracy and reduce parameter count. First, the AISHELL1-PHONEME and ST-CMDS-PHONEME datasets for Mandarin phoneme recognition through data preprocessing. Then, three layers of the Zipformer Block architecture are introduced into the Zipformer encoder to significantly enhance model performance. In the stateless Pred Network, the GELU activation function is adopted to effectively prevent neuron deactivation. Furthermore, a hybrid Pruned RNN-T/CTC Loss fusion strategy is proposed, further optimizing recognition performance. The experimental results demonstrate that the method performs excellently in the phoneme recognition task, achieving a Word Error Rate (WER) of 1.92% (Dev) and 2.12% (Test) on the AISHELL1-PHONEME dataset, and 4.28% (Dev) and 4.51% (Test) on the ST-CMDS-PHONEME dataset. Moreover, the model requires only 61.1M parameters, striking a balance between performance and efficiency.
近年来,在人工智能技术的推动下,计算机辅助语言学习系统逐渐成为研究热点。目前,主流的发音评估模型依赖先进的语音识别技术,将语音转换为音素序列,然后通过序列比较确定发音错误的音素。为了优化发音评估中的音素识别任务,本文提出了一种基于改进的Zipformer-RNN-T(Pruned)架构的汉语发音音素识别模型,旨在提高识别准确率并减少参数数量。首先,通过数据预处理获得用于普通话音素识别的AISHELL1-PHONEME和ST-CMDS-PHONEME数据集。然后,在Zipformer编码器中引入三层Zipformer Block架构,显著提升模型性能。在无状态的Pred网络中,采用GELU激活函数有效防止神经元失活。此外,提出了一种混合Pruned RNN-T/CTC Loss融合策略,进一步优化识别性能。实验结果表明,该方法在音素识别任务中表现出色,在AISHELL1-PHONEME数据集上的字错误率(WER)为1.92%(开发集)和2.12%(测试集),在ST-CMDS-PHONEME数据集上为4.28%(开发集)和4.51%(测试集)。此外,该模型仅需6110万个参数,在性能和效率之间取得了平衡。