Gómez-Adorno Helena, Markov Ilia, Sidorov Grigori, Posadas-Durán Juan-Pablo, Sanchez-Perez Miguel A, Chanona-Hernandez Liliana
Instituto Politécnico Nacional (IPN), Centro de Invetigación en Computación (CIC), Mexico City, Mexico.
Instituto Politécnico Nacional (IPN), Escuela Superior de Ingeniería Mecánica y Eléctrica Unidad Zacatenco (ESIME-Zacatenco), Mexico City, Mexico.
Comput Intell Neurosci. 2016;2016:1638936. doi: 10.1155/2016/1638936. Epub 2016 Oct 3.
We introduce a lexical resource for preprocessing social media data. We show that a neural network-based feature representation is enhanced by using this resource. We conducted experiments on the PAN 2015 and PAN 2016 author profiling corpora and obtained better results when performing the data preprocessing using the developed lexical resource. The resource includes dictionaries of slang words, contractions, abbreviations, and emoticons commonly used in social media. Each of the dictionaries was built for the English, Spanish, Dutch, and Italian languages. The resource is freely available.
我们介绍了一种用于预处理社交媒体数据的词汇资源。我们表明,使用该资源可以增强基于神经网络的特征表示。我们在PAN 2015和PAN 2016作者剖析语料库上进行了实验,当使用开发的词汇资源进行数据预处理时,取得了更好的结果。该资源包括社交媒体中常用的俚语、缩写词、缩略语和表情符号词典。每个词典都是针对英语、西班牙语、荷兰语和意大利语构建的。该资源可免费获取。