Systems Engineering and Computer Science Program/COPPE, Universidade Federal do Rio de Janeiro (UFRJ) - Caixa Postal 68511, Cidade Universitária, Rio de Janeiro, Rio de Janeiro 21941-972, Brazil.
Instituto Tércio Pacitti de Aplicações e Pesquisas Computacionais (NCE), Universidade Federal do Rio de Janeiro (UFRJ) - Av. Athos da Silveira Ramos, 274 - Edifício do Centro de Ciências Matemáticas e da Natureza, Bloco E, Cidade Universitária, Rio de Janeiro, Rio de Janeiro 21941-916, Brazil.
Neural Netw. 2015 Jun;66:11-21. doi: 10.1016/j.neunet.2015.02.012. Epub 2015 Mar 2.
Training part-of-speech taggers (POS-taggers) requires iterative time-consuming convergence-dependable steps, which involve either expectation maximization or weight balancing processes, depending on whether the tagger uses stochastic or neural approaches, respectively. Due to the complexity of these steps, multilingual part-of-speech tagging can be an intractable task, where as the number of languages increases so does the time demanded by these steps. WiSARD (Wilkie, Stonham and Aleksander's Recognition Device), a weightless artificial neural network architecture that proved to be both robust and efficient in classification tasks, has been previously used in order to turn the training phase faster. WiSARD is a RAM-based system that requires only one memory writing operation to train each sentence. Additionally, the mechanism is capable of learning new tagged sentences during the classification phase, on an incremental basis. Nevertheless, parameters such as RAM size, context window, and probability bit mapping, make the multilingual part-of-speech tagging task hard. This article proposes mWANN-Tagger (multilingual Weightless Artificial Neural Network tagger), a WiSARD POS-tagger. This tagger is proposed due to its one-pass learning capability. It allows language-specific parameter configurations to be thoroughly searched in quite an agile fashion. Experimental evaluation indicates that mWANN-Tagger either outperforms or matches state-of-art methods in accuracy with very low standard deviation, i.e., lower than 0.25%. Experimental results also suggest that the vast majority of the languages can benefit from this architecture.
训练词性标注器(POS 标注器)需要反复进行耗时的收敛依赖步骤,具体取决于标注器是使用随机方法还是神经方法,分别涉及期望最大化或权重平衡过程。由于这些步骤的复杂性,多语言词性标注可能是一项艰巨的任务,随着语言数量的增加,这些步骤所需的时间也会增加。WiSARD(Wilkie、Stonham 和 Aleksander 的识别设备)是一种无权重的人工神经网络架构,在分类任务中被证明是强大且高效的,之前曾被用于加快训练阶段。WiSARD 是一个基于 RAM 的系统,每个句子只需一次内存写入操作即可进行训练。此外,该机制在分类阶段能够增量地学习新的标记句子。然而,RAM 大小、上下文窗口和概率位映射等参数使得多语言词性标注任务变得困难。本文提出了 mWANN-Tagger(多语言 Weightless Artificial Neural Network 标注器),这是一种 WiSARD POS 标注器。该标注器是由于其单步学习能力而提出的。它允许以非常灵活的方式彻底搜索特定于语言的参数配置。实验评估表明,mWANN-Tagger 在准确性方面要么优于,要么与最先进的方法相匹配,标准偏差非常低,即低于 0.25%。实验结果还表明,绝大多数语言都可以从这种架构中受益。