• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

从社交媒体上的乌尔都语文本进行事件分类。

Event classification from the Urdu language text on social media.

作者信息

Awan Malik Daler Ali, Kajla Nadeem Iqbal, Firdous Amnah, Husnain Mujtaba, Missen Malik Muhammad Saad

机构信息

Department of Software Engineering, Faculty of Computing, The Islamia University of Bahawalpur, Bahawalpur, Punjab, Pakistan.

Department of Computer Science, Muhammad Nawaz Sharif University of Agriculture, Multan, Multan, Punjab, Pakistan.

出版信息

PeerJ Comput Sci. 2021 Nov 18;7:e775. doi: 10.7717/peerj-cs.775. eCollection 2021.

DOI:10.7717/peerj-cs.775
PMID:34901431
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8627225/
Abstract

The real-time availability of the Internet has engaged millions of users around the world. The usage of regional languages is being preferred for effective and ease of communication that is causing multilingual data on social networks and news channels. People share ideas, opinions, and events that are happening globally ., sports, inflation, protest, explosion, and sexual assault, . in regional (local) languages on social media. Extraction and classification of events from multilingual data have become bottlenecks because of resource lacking. In this research paper, we presented the event classification task for the Urdu language text existing on social media and the news channels by using machine learning classifiers. The dataset contains more than 0.1 million (102,962) labeled instances of twelve (12) different types of events. The title, its length, and the last four words of a sentence are used as features to classify the events. The Term Frequency-Inverse Document Frequency () showed the best results as a feature vector to evaluate the performance of the six popular machine learning classifiers. Random Forest (RF) and K-Nearest Neighbor (KNN) are among the classifiers that out-performed among other classifiers by achieving 98.00% and 99.00% accuracy, respectively. The novelty lies in the fact that the features aforementioned are not applied, up to the best of our knowledge, in the event extraction of the text written in the Urdu language.

摘要

互联网的实时可用性吸引了全球数百万用户。为了实现有效且便捷的交流,人们更倾向于使用地区语言,这导致社交网络和新闻频道上出现了多语言数据。人们用地区(本地)语言在社交媒体上分享全球正在发生的各种想法、观点和事件,如体育赛事、通货膨胀、抗议活动、爆炸事件和性侵犯等。由于资源匮乏,从多语言数据中提取和分类事件已成为瓶颈。在本研究论文中,我们使用机器学习分类器对社交媒体和新闻频道上的乌尔都语文本进行了事件分类任务。该数据集包含超过10万(102,962)个标记实例,涵盖十二(12)种不同类型的事件。标题、其长度以及句子的最后四个单词被用作事件分类的特征。词频 - 逆文档频率()作为特征向量在评估六种流行机器学习分类器的性能时显示出最佳结果。随机森林(RF)和K近邻(KNN)是表现优于其他分类器的分类器,分别达到了98.00%和99.00%的准确率。据我们所知,上述特征在乌尔都语文本的事件提取中尚未被应用,这是本研究的新颖之处。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7f96/8627225/634691597498/peerj-cs-07-775-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7f96/8627225/a53a9da954cf/peerj-cs-07-775-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7f96/8627225/eefc88b1887a/peerj-cs-07-775-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7f96/8627225/634691597498/peerj-cs-07-775-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7f96/8627225/a53a9da954cf/peerj-cs-07-775-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7f96/8627225/eefc88b1887a/peerj-cs-07-775-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7f96/8627225/634691597498/peerj-cs-07-775-g003.jpg

相似文献

1
Event classification from the Urdu language text on social media.从社交媒体上的乌尔都语文本进行事件分类。
PeerJ Comput Sci. 2021 Nov 18;7:e775. doi: 10.7717/peerj-cs.775. eCollection 2021.
2
Fake news detection in Urdu language using machine learning.使用机器学习进行乌尔都语假新闻检测。
PeerJ Comput Sci. 2023 May 23;9:e1353. doi: 10.7717/peerj-cs.1353. eCollection 2023.
3
Normalized effect size (NES): a novel feature selection model for Urdu fake news classification.归一化效应大小(NES):一种用于乌尔都语假新闻分类的新型特征选择模型。
PeerJ Comput Sci. 2023 Oct 24;9:e1612. doi: 10.7717/peerj-cs.1612. eCollection 2023.
4
Comparative analysis of machine learning methods to detect fake news in an Urdu language .用于检测乌尔都语假新闻的机器学习方法的比较分析
PeerJ Comput Sci. 2022 Jun 28;8:e1004. doi: 10.7717/peerj-cs.1004. eCollection 2022.
5
Multi-class sentiment analysis of urdu text using multilingual BERT.使用多语言 BERT 进行乌尔都语文本的多类情感分析。
Sci Rep. 2022 Mar 31;12(1):5436. doi: 10.1038/s41598-022-09381-9.
6
Cursive-Text: A Comprehensive Dataset for End-to-End Urdu Text Recognition in Natural Scene Images.连笔文本:用于自然场景图像中乌尔都语文本端到端识别的综合数据集。
Data Brief. 2020 May 21;31:105749. doi: 10.1016/j.dib.2020.105749. eCollection 2020 Aug.
7
A computer vision-based system for recognition and classification of Urdu sign language dataset.一种基于计算机视觉的乌尔都语手语数据集识别与分类系统。
PeerJ Comput Sci. 2022 Dec 14;8:e1174. doi: 10.7717/peerj-cs.1174. eCollection 2022.
8
Recognition of Urdu sign language: a systematic review of the machine learning classification.乌尔都语手语识别:机器学习分类的系统综述
PeerJ Comput Sci. 2022 Feb 18;8:e883. doi: 10.7717/peerj-cs.883. eCollection 2022.
9
Detection of Depression Severity Using Bengali Social Media Posts on Mental Health: Study Using Natural Language Processing Techniques.利用孟加拉语心理健康社交媒体帖子检测抑郁症严重程度:使用自然语言处理技术的研究
JMIR Form Res. 2022 Sep 28;6(9):e36118. doi: 10.2196/36118.
10
Large scale biomedical texts classification: a kNN and an ESA-based approaches.大规模生物医学文本分类:基于k近邻算法和基于词嵌入语义分析的方法。
J Biomed Semantics. 2016 Jun 16;7:40. doi: 10.1186/s13326-016-0073-1.

引用本文的文献

1
Resolving ambiguity in natural language for enhancement of aspect-based sentiment analysis of hotel reviews.解决自然语言中的歧义以增强酒店评论的基于方面的情感分析。
PeerJ Comput Sci. 2025 Jan 13;11:e2635. doi: 10.7717/peerj-cs.2635. eCollection 2025.

本文引用的文献

1
Portable automatic text classification for adverse drug reaction detection via multi-corpus training.通过多语料库训练实现用于药物不良反应检测的便携式自动文本分类
J Biomed Inform. 2015 Feb;53:196-207. doi: 10.1016/j.jbi.2014.11.002. Epub 2014 Nov 8.