Musyafa Ahmad, Gao Ying, Solyman Aiman, Khan Siraj, Cai Wentian, Khan Muhammad Faizan
School of Computer Science and Engineering, South China University of Technology, Guangzhou, China.
Department of Informatics Engineering, Pamulang University, South Tangerang, Indonesia.
PeerJ Comput Sci. 2024 Jul 5;10:e2122. doi: 10.7717/peerj-cs.2122. eCollection 2024.
Grammar error correction systems are pivotal in the field of natural language processing (NLP), with a primary focus on identifying and correcting the grammatical integrity of written text. This is crucial for both language learning and formal communication. Recently, neural machine translation (NMT) has emerged as a promising approach in high demand. However, this approach faces significant challenges, particularly the scarcity of training data and the complexity of grammar error correction (GEC), especially for low-resource languages such as Indonesian. To address these challenges, we propose InSpelPoS, a confusion method that combines two synthetic data generation methods: the Inverted Spellchecker and Patterns+POS. Furthermore, we introduce an adapted seq2seq framework equipped with a dynamic decoding method and state-of-the-art Transformer-based neural language models to enhance the accuracy and efficiency of GEC. The dynamic decoding method is capable of navigating the complexities of GEC and correcting a wide range of errors, including contextual and grammatical errors. The proposed model leverages the contextual information of words and sentences to generate a corrected output. To assess the effectiveness of our proposed framework, we conducted experiments using synthetic data and compared its performance with existing GEC systems. The results demonstrate a significant improvement in the accuracy of Indonesian GEC compared to existing methods.
语法错误纠正系统在自然语言处理(NLP)领域至关重要,主要专注于识别和纠正书面文本的语法完整性。这对于语言学习和正式交流都至关重要。最近,神经机器翻译(NMT)已成为一种需求旺盛的有前途的方法。然而,这种方法面临重大挑战,特别是训练数据的稀缺以及语法错误纠正(GEC)的复杂性,尤其是对于像印尼语这样的低资源语言。为了应对这些挑战,我们提出了InSpelPoS,一种结合了两种合成数据生成方法的混淆方法:反向拼写检查器和模式+词性标注。此外,我们引入了一个经过改进的序列到序列框架,配备了动态解码方法和基于Transformer的最先进神经语言模型,以提高GEC的准确性和效率。动态解码方法能够应对GEC的复杂性,并纠正各种错误,包括上下文和语法错误。所提出的模型利用单词和句子的上下文信息来生成纠正后的输出。为了评估我们提出的框架的有效性,我们使用合成数据进行了实验,并将其性能与现有的GEC系统进行了比较。结果表明,与现有方法相比,印尼语GEC的准确性有了显著提高。