College of Computer Engineering and Sciences, Prince Sattam Bin Abdulaziz University, Al Kharj, Saudi Arabia.
Faculty of Computers and Artificial Intelligence, Helwan University, Cairo, Egypt.
Comput Intell Neurosci. 2022 Oct 30;2022:3214255. doi: 10.1155/2022/3214255. eCollection 2022.
The Arabic syntactic diacritics restoration problem is often solved using long short-term memory (LSTM) networks. Handcrafted features are used to augment these LSTM networks or taggers to improve performance. A transformer-based machine learning technique known as bidirectional encoder representations from transformers (BERT) has become the state-of-the-art method for natural language understanding in recent years. In this paper, we present a novel tagger based on BERT models to restore Arabic syntactic diacritics. We formulated the syntactic diacritics restoration as a token sequence classification task similar to named-entity recognition (NER). Using the Arabic TreeBank (ATB) corpus, the developed BERT tagger achieves a 1.36% absolute case-ending error rate (CEER) over other systems.
阿拉伯语句法变音符号还原问题通常使用长短时记忆 (LSTM) 网络解决。手工制作的特征用于增强这些 LSTM 网络或标记器以提高性能。近年来,一种基于转换器的机器学习技术,称为来自转换器的双向编码器表示 (BERT),已成为自然语言理解的最新方法。在本文中,我们提出了一种基于 BERT 模型的新标记器,用于还原阿拉伯语句法变音符号。我们将句法变音符号还原制定为类似于命名实体识别 (NER) 的令牌序列分类任务。使用阿拉伯语树库 (ATB) 语料库,开发的 BERT 标记器在其他系统上的绝对词尾错误率 (CEER) 达到 1.36%。