Awan Malik Daler Ali, Kajla Nadeem Iqbal, Firdous Amnah, Husnain Mujtaba, Missen Malik Muhammad Saad
Department of Software Engineering, Faculty of Computing, The Islamia University of Bahawalpur, Bahawalpur, Punjab, Pakistan.
Department of Computer Science, Muhammad Nawaz Sharif University of Agriculture, Multan, Multan, Punjab, Pakistan.
PeerJ Comput Sci. 2021 Nov 18;7:e775. doi: 10.7717/peerj-cs.775. eCollection 2021.
The real-time availability of the Internet has engaged millions of users around the world. The usage of regional languages is being preferred for effective and ease of communication that is causing multilingual data on social networks and news channels. People share ideas, opinions, and events that are happening globally ., sports, inflation, protest, explosion, and sexual assault, . in regional (local) languages on social media. Extraction and classification of events from multilingual data have become bottlenecks because of resource lacking. In this research paper, we presented the event classification task for the Urdu language text existing on social media and the news channels by using machine learning classifiers. The dataset contains more than 0.1 million (102,962) labeled instances of twelve (12) different types of events. The title, its length, and the last four words of a sentence are used as features to classify the events. The Term Frequency-Inverse Document Frequency () showed the best results as a feature vector to evaluate the performance of the six popular machine learning classifiers. Random Forest (RF) and K-Nearest Neighbor (KNN) are among the classifiers that out-performed among other classifiers by achieving 98.00% and 99.00% accuracy, respectively. The novelty lies in the fact that the features aforementioned are not applied, up to the best of our knowledge, in the event extraction of the text written in the Urdu language.
互联网的实时可用性吸引了全球数百万用户。为了实现有效且便捷的交流,人们更倾向于使用地区语言,这导致社交网络和新闻频道上出现了多语言数据。人们用地区(本地)语言在社交媒体上分享全球正在发生的各种想法、观点和事件,如体育赛事、通货膨胀、抗议活动、爆炸事件和性侵犯等。由于资源匮乏,从多语言数据中提取和分类事件已成为瓶颈。在本研究论文中,我们使用机器学习分类器对社交媒体和新闻频道上的乌尔都语文本进行了事件分类任务。该数据集包含超过10万(102,962)个标记实例,涵盖十二(12)种不同类型的事件。标题、其长度以及句子的最后四个单词被用作事件分类的特征。词频 - 逆文档频率()作为特征向量在评估六种流行机器学习分类器的性能时显示出最佳结果。随机森林(RF)和K近邻(KNN)是表现优于其他分类器的分类器,分别达到了98.00%和99.00%的准确率。据我们所知,上述特征在乌尔都语文本的事件提取中尚未被应用,这是本研究的新颖之处。