Suppr超能文献

社交媒体上阿拉伯语文本的预处理

Preprocessing Arabic text on social media.

作者信息

Hegazi Mohamed Osman, Al-Dossari Yasser, Al-Yahy Abdullah, Al-Sumari Abdulaziz, Hilal Anwer

机构信息

Department of Computer Science, College of Computer Engineering and Sciences, Prince Sattam Bin Abdulaziz University, Al-Kharj 11942, Saudi Arabia.

Department of Computer and Self Development, Preparatory Year Deanship, Prince Sattam Bin Abdulaziz University, Al-Kharj 11942, Saudi Arabia.

出版信息

Heliyon. 2021 Feb 13;7(2):e06191. doi: 10.1016/j.heliyon.2021.e06191. eCollection 2021 Feb.

Abstract

urrently, social media plays an important role in daily life and routine. Millions of people use social media for different purposes. Large amounts of data flow through online networks every second, and these data contain valuable information that can be extracted if the data are properly processed and analyzed. However, most of the processing results are affected by preprocessing difficulties. This paper presents an approach to extract information from social media Arabic text. It provides an integrated solution for the challenges in preprocessing Arabic text on social media in four stages: data collection, cleaning, enrichment, and availability. The preprocessed Arabic text is stored in structured database tables to provide a useful corpus to which, information extraction and data analysis algorithms can be applied. The experiment in this study reveals that the implementation of the proposed approach yields a useful and full-featured dataset and valuable information. The resultant dataset presented the Arabic text in three structured levels with more than 20 features. Additionally, the experiment provides valuable information and processed results such as topic classification and sentiment analysis.

摘要

目前,社交媒体在日常生活和日常事务中发挥着重要作用。数百万人出于不同目的使用社交媒体。每秒都有大量数据流经在线网络,并且这些数据包含有价值的信息,如果数据得到妥善处理和分析,这些信息就可以被提取出来。然而,大多数处理结果受到预处理困难的影响。本文提出了一种从社交媒体阿拉伯语文本中提取信息的方法。它在四个阶段为社交媒体上阿拉伯语文本预处理中的挑战提供了一个综合解决方案:数据收集、清理、充实和可用性。预处理后的阿拉伯语文本存储在结构化数据库表中,以提供一个有用的语料库,信息提取和数据分析算法可以应用于该语料库。本研究中的实验表明,所提出方法的实施产生了一个有用且功能齐全的数据集以及有价值的信息。所得数据集以三个结构化级别呈现阿拉伯语文本,具有20多个特征。此外,该实验提供了有价值的信息和处理结果,如主题分类和情感分析。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f4a5/7895730/ac90b318d0d1/gr1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验