增强词以改进基于深度学习的印尼语音节划分。

Augmented words to improve a deep learning-based Indonesian syllabification.

作者信息

Suyanto Suyanto, Romadhony Ade, Sthevanie Febryanti, Ismail Rezza Nafi

机构信息

School of Computing, Telkom University, Bandung, Indonesia.

出版信息

Heliyon. 2021 Oct 5;7(10):e08115. doi: 10.1016/j.heliyon.2021.e08115. eCollection 2021 Oct.

DOI:10.1016/j.heliyon.2021.e08115

PMID:34693050

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8511842/

Abstract

Recent deep learning-based syllabification models generally give low error rates for high-resource languages with big datasets but sometimes produce high error rates for the low-resource ones. In this paper, two procedures: massive data augmentation and validation, are proposed to improve a deep learning-based syllabification, using a combination of bidirectional long short-term memory (BiLSTM), convolutional neural networks (CNN), and conditional random fields (CRF) for a low-resource Indonesian language. The massive data augmentation comprises four methods: transposing nuclei, swapping consonant-graphemes, flipping onsets, and creating acronyms. Meanwhile, the validation is implemented using a phonotactic-based scheme. A preliminary investigation on 50k Indonesian words informs that those augmentation methods significantly enlarge the dataset size by 12.8M valid words based on the phonotactic rules. An examination is then performed using 5-fold cross-validation. It reports that the augmentation methods significantly improve the BiLSTM-CNN-CRF model for 50k formal words and 100k named-entities datasets. A detailed investigation informs that augmenting the training set can reduce the word error rate (WER) coming from the long formal words and named entities.

摘要

最近基于深度学习的音节划分模型对于拥有大数据集的高资源语言通常错误率较低，但有时对于低资源语言会产生较高的错误率。在本文中，提出了两种方法：大规模数据增强和验证，以改进基于深度学习的音节划分，针对低资源的印尼语使用双向长短期记忆网络（BiLSTM）、卷积神经网络（CNN）和条件随机场（CRF）的组合。大规模数据增强包括四种方法：转置核心音、交换辅音字素、翻转起首音和创建首字母缩写词。同时，验证使用基于音位结构的方案来实施。对5万个印尼语单词的初步调查表明，基于音位结构规则，那些增强方法显著地将数据集大小扩大了1280万个有效单词。然后使用5折交叉验证进行检验。报告称，对于5万个正式单词和10万个命名实体数据集，增强方法显著改进了BiLSTM-CNN-CRF模型。详细调查表明，扩充训练集可以降低来自长正式单词和命名实体的单词错误率（WER）。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6fe9/8511842/e37b4a485aae/gr001.jpg

相似文献

Augmented words to improve a deep learning-based Indonesian syllabification.

Heliyon. 2021 Oct 5;7(10):e08115. doi: 10.1016/j.heliyon.2021.e08115. eCollection 2021 Oct.

Augmented-syllabification of -gram tagger for Indonesian words and named-entities.

Heliyon. 2022 Nov 23;8(11):e11922. doi: 10.1016/j.heliyon.2022.e11922. eCollection 2022 Nov.

Named entity recognition from Chinese adverse drug event reports with lexical feature based BiLSTM-CRF and tri-training.

J Biomed Inform. 2019 Aug;96:103252. doi: 10.1016/j.jbi.2019.103252. Epub 2019 Jul 16.

Biomedical named entity recognition using deep neural networks with contextual information.

BMC Bioinformatics. 2019 Dec 27;20(1):735. doi: 10.1186/s12859-019-3321-4.

Entity recognition in Chinese clinical text using attention-based CNN-LSTM-CRF.

BMC Med Inform Decis Mak. 2019 Apr 4;19(Suppl 3):74. doi: 10.1186/s12911-019-0787-y.

Semi-Supervised Bidirectional Long Short-Term Memory and Conditional Random Fields Model for Named-Entity Recognition Using Embeddings from Language Models Representations.

Entropy (Basel). 2020 Feb 22;22(2):252. doi: 10.3390/e22020252.

Ontology-Based Healthcare Named Entity Recognition from Twitter Messages Using a Recurrent Neural Network Approach.

Int J Environ Res Public Health. 2019 Sep 27;16(19):3628. doi: 10.3390/ijerph16193628.

An improved deep learning method for predicting DNA-binding proteins based on contextual features in amino acid sequences.

PLoS One. 2019 Nov 14;14(11):e0225317. doi: 10.1371/journal.pone.0225317. eCollection 2019.

Assistant diagnosis with Chinese electronic medical records based on CNN and BiLSTM with phrase-level and word-level attentions.

BMC Bioinformatics. 2020 Jun 5;21(1):230. doi: 10.1186/s12859-020-03554-x.

Convolutional and recurrent neural networks for the detection of valvular heart diseases in phonocardiogram recordings.

Comput Methods Programs Biomed. 2021 Mar;200:105940. doi: 10.1016/j.cmpb.2021.105940. Epub 2021 Jan 17.

本文引用的文献

Comparing neural- and N-gram-based language models for word segmentation.

J Assoc Inf Sci Technol. 2019 Feb;70(2):187-197. doi: 10.1002/asi.24082. Epub 2018 Dec 2.

Convex weighting criteria for speaking rate estimation.

IEEE/ACM Trans Audio Speech Lang Process. 2015 Sep;23(9):1421-1430. doi: 10.1109/TASLP.2015.2434213.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

增强词以改进基于深度学习的印尼语音节划分。

Augmented words to improve a deep learning-based Indonesian syllabification.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献