Suyanto Suyanto, Romadhony Ade, Sthevanie Febryanti, Ismail Rezza Nafi
School of Computing, Telkom University, Bandung, Indonesia.
Heliyon. 2021 Oct 5;7(10):e08115. doi: 10.1016/j.heliyon.2021.e08115. eCollection 2021 Oct.
Recent deep learning-based syllabification models generally give low error rates for high-resource languages with big datasets but sometimes produce high error rates for the low-resource ones. In this paper, two procedures: massive data augmentation and validation, are proposed to improve a deep learning-based syllabification, using a combination of bidirectional long short-term memory (BiLSTM), convolutional neural networks (CNN), and conditional random fields (CRF) for a low-resource Indonesian language. The massive data augmentation comprises four methods: transposing nuclei, swapping consonant-graphemes, flipping onsets, and creating acronyms. Meanwhile, the validation is implemented using a phonotactic-based scheme. A preliminary investigation on 50k Indonesian words informs that those augmentation methods significantly enlarge the dataset size by 12.8M valid words based on the phonotactic rules. An examination is then performed using 5-fold cross-validation. It reports that the augmentation methods significantly improve the BiLSTM-CNN-CRF model for 50k formal words and 100k named-entities datasets. A detailed investigation informs that augmenting the training set can reduce the word error rate (WER) coming from the long formal words and named entities.
最近基于深度学习的音节划分模型对于拥有大数据集的高资源语言通常错误率较低,但有时对于低资源语言会产生较高的错误率。在本文中,提出了两种方法:大规模数据增强和验证,以改进基于深度学习的音节划分,针对低资源的印尼语使用双向长短期记忆网络(BiLSTM)、卷积神经网络(CNN)和条件随机场(CRF)的组合。大规模数据增强包括四种方法:转置核心音、交换辅音字素、翻转起首音和创建首字母缩写词。同时,验证使用基于音位结构的方案来实施。对5万个印尼语单词的初步调查表明,基于音位结构规则,那些增强方法显著地将数据集大小扩大了1280万个有效单词。然后使用5折交叉验证进行检验。报告称,对于5万个正式单词和10万个命名实体数据集,增强方法显著改进了BiLSTM-CNN-CRF模型。详细调查表明,扩充训练集可以降低来自长正式单词和命名实体的单词错误率(WER)。