Suppr超能文献

研究预处理技术和预训练词嵌入在社交媒体上检测阿拉伯语健康信息方面的影响。

Investigating the impact of pre-processing techniques and pre-trained word embeddings in detecting Arabic health information on social media.

作者信息

Albalawi Yahya, Buckley Jim, Nikolov Nikola S

机构信息

Department of Computer Science and Information Systems, University of Limerick, Limerick, Ireland.

Department of Computer and Information Sciences, College of Arts and Science, University of Taibah, Al-Ula, Saudi Arabia.

出版信息

J Big Data. 2021;8(1):95. doi: 10.1186/s40537-021-00488-w. Epub 2021 Jul 2.

Abstract

This paper presents a comprehensive evaluation of data pre-processing and word embedding techniques in the context of Arabic document classification in the domain of health-related communication on social media. We evaluate 26 text pre-processings applied to Arabic tweets within the process of training a classifier to identify health-related tweets. For this task we use the (traditional) machine learning classifiers KNN, SVM, Multinomial NB and Logistic Regression. Furthermore, we report experimental results with the deep learning architectures BLSTM and CNN for the same text classification problem. Since word embeddings are more typically used as the input layer in deep networks, in the deep learning experiments we evaluate several state-of-the-art pre-trained word embeddings with the same text pre-processing applied. To achieve these goals, we use two data sets: one for both training and testing, and another for testing the generality of our models only. Our results point to the conclusion that only four out of the 26 pre-processings improve the classification accuracy significantly. For the first data set of Arabic tweets, we found that Mazajak CBOW pre-trained word embeddings as the input to a BLSTM deep network led to the most accurate classifier with F score of 89.7%. For the second data set, Mazajak Skip-Gram pre-trained word embeddings as the input to BLSTM led to the most accurate model with F score of 75.2% and accuracy of 90.7% compared to F score of 90.8% achieved by Mazajak CBOW for the same architecture but with lower accuracy of 70.89%. Our results also show that the performance of the best of the traditional classifier we trained is comparable to the deep learning methods on the first dataset, but significantly worse on the second dataset.

摘要

本文对社交媒体上健康相关交流领域阿拉伯语文档分类中的数据预处理和词嵌入技术进行了全面评估。在训练一个用于识别与健康相关推文的分类器的过程中,我们评估了应用于阿拉伯语推文的26种文本预处理方法。对于此任务,我们使用(传统)机器学习分类器KNN、SVM、多项式朴素贝叶斯和逻辑回归。此外,我们报告了针对相同文本分类问题使用深度学习架构BLSTM和CNN的实验结果。由于词嵌入通常更多地用作深度网络的输入层,在深度学习实验中,我们使用相同的文本预处理方法评估了几种最先进的预训练词嵌入。为实现这些目标,我们使用了两个数据集:一个用于训练和测试,另一个仅用于测试我们模型的通用性。我们的结果得出结论,26种预处理方法中只有4种能显著提高分类准确率。对于第一个阿拉伯语推文数据集,我们发现将Mazajak CBOW预训练词嵌入作为BLSTM深度网络的输入可得到最准确的分类器,F分数为89.7%。对于第二个数据集,将Mazajak Skip-Gram预训练词嵌入作为BLSTM的输入可得到最准确的模型,F分数为75.2%,准确率为90.7%,而对于相同架构,Mazajak CBOW的F分数为90.8%,但准确率较低,为70.89%。我们的结果还表明,我们训练的最佳传统分类器在第一个数据集上的性能与深度学习方法相当,但在第二个数据集上则明显较差。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0e03/8253467/95fcd756c3df/40537_2021_488_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验