Hidayatullah Ahmad Fathan, Apong Rosyzie Anna, Lai Daphne T C, Qazi Atika
School of Digital Science, Universiti Brunei Darussalam, Bandar Seri Begawan, Brunei.
Department of Informatics, Universitas Islam Indonesia, Sleman, Yogyakarta, Indonesia.
PeerJ Comput Sci. 2023 Jun 22;9:e1312. doi: 10.7717/peerj-cs.1312. eCollection 2023.
With the massive use of social media today, mixing between languages in social media text is prevalent. In linguistics, the phenomenon of mixing languages is known as code-mixing. The prevalence of code-mixing exposes various concerns and challenges in natural language processing (NLP), including language identification (LID) tasks. This study presents a word-level language identification model for code-mixed Indonesian, Javanese, and English tweets. First, we introduce a code-mixed corpus for Indonesian-Javanese-English language identification (IJELID). To ensure reliable dataset annotation, we provide full details of the data collection and annotation standards construction procedures. Some challenges encountered during corpus creation are also discussed in this paper. Then, we investigate several strategies for developing code-mixed language identification models, such as fine-tuning BERT, BLSTM-based, and CRF. Our results show that fine-tuned IndoBERTweet models can identify languages better than the other techniques. This is the result of BERT's ability to understand each word's context from the given text sequence. Finally, we show that sub-word language representation in BERT models can provide a reliable model for identifying languages in code-mixed texts.
在当今社交媒体被大量使用的情况下,社交媒体文本中的语言混合现象很普遍。在语言学中,语言混合现象被称为语码混合。语码混合的普遍存在给自然语言处理(NLP)带来了各种问题和挑战,包括语言识别(LID)任务。本研究提出了一种用于印尼语、爪哇语和英语混合的推特文本的词级语言识别模型。首先,我们引入了一个用于印尼语 - 爪哇语 - 英语语言识别(IJELID)的语码混合语料库。为确保可靠的数据集标注,我们提供了数据收集和标注标准构建过程的全部细节。本文还讨论了语料库创建过程中遇到的一些挑战。然后,我们研究了几种开发语码混合语言识别模型的策略,例如微调BERT、基于双向长短期记忆网络(BLSTM)和条件随机场(CRF)。我们的结果表明,微调后的IndoBERTweet模型比其他技术能更好地识别语言。这是由于BERT能够从给定文本序列中理解每个单词的上下文。最后,我们表明BERT模型中的子词语言表示可以为识别语码混合文本中的语言提供一个可靠的模型。