使用大语言模型对乌尔都语新闻数据集进行基准测试，以进行领域无关和多语言假新闻检测。

Benchmarking Hook and Bait Urdu news dataset for domain-agnostic and multilingual fake news detection using large language models.

作者信息

Harris Sheetal, Liu Jinshuo, Hadi Hassan Jalil, Ahmad Naveed, Alshara Mohammed Ali

机构信息

School of Cyber Science and Engineering, Wuhan University, Wuhan, China.

Prince Sultan University, Riyadh, Saudi Arabia.

出版信息

Sci Rep. 2025 May 3;15(1):15553. doi: 10.1038/s41598-025-98271-x.

DOI:10.1038/s41598-025-98271-x

PMID:40319160

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12049497/

Abstract

Fake News (FN) prevalence on Online Social Networks (OSNs) and online websites is a worldwide issue. The previous studies on Fake News Detection (FND) have focused on rich-resource languages with limited relevance to users other than native speakers. Despite the progressing multilingual approaches, FND in low-resource languages remains obscure due to a lack of large-sized annotated corpora of real-world news. Large Language Models (LLMs) have emerged as a promising solution for multilingual FND. This study leverages the power of LLMs for an automated mechanism compared to traditional feature extraction methods. We curate the first large-sized multi-domain corpus, the Hook and Bait Urdu, with 78,409 fake and true news, and fine-tune the LLaMA 2 model for our proposed approach. We implement the curated dataset for two experiments. First, this study evaluates the dataset for unimodal text-based Urdu FND. The proposed LLaMA-based approach shows an accuracy of 0.978 and an F1-score of 0.971. For our second experiment, we fine-tuned the LLaMA 2-based framework for multilingual FND using the curated dataset (in Urdu) and the ISOT Fake News dataset (in English). Analytical and prediction performance comparisons with the previous studies validate the efficacy of the proposed framework with an accuracy of 0.984 and an F1-score of 0.980. The lightweight LoRA fine-tuning method, with 0.032% trainable parameters, ensured robust data handling, computational efficiency while leveraging early stopping and optimized hyperparameters for reliable and high-performing monolingual and multilingual FND. The real-world news dataset is publicly available for developing an automated FND mechanism to curb the threat of FN and related cybercrimes.

摘要

在线社交网络（OSN）和网站上假新闻（FN）的盛行是一个全球性问题。以往关于假新闻检测（FND）的研究主要集中在资源丰富的语言上，与非母语使用者以外的用户相关性有限。尽管多语言方法不断发展，但由于缺乏大规模的真实世界新闻标注语料库，低资源语言的FND仍然不明朗。大语言模型（LLM）已成为多语言FND的一个有前景的解决方案。与传统特征提取方法相比，本研究利用LLM的能力实现了一种自动化机制。我们精心策划了第一个大规模多领域语料库——“钩与饵乌尔都语语料库”，其中包含78409条真假新闻，并针对我们提出的方法对LLaMA 2模型进行了微调。我们将精心策划的数据集用于两个实验。首先，本研究评估该数据集用于基于单模态文本的乌尔都语FND。所提出的基于LLaMA的方法显示准确率为0.978，F1分数为0.971。在我们的第二个实验中，我们使用精心策划的数据集（乌尔都语）和ISOT假新闻数据集（英语）对基于LLaMA 2的框架进行多语言FND微调。与先前研究的分析和预测性能比较验证了所提出框架的有效性，准确率为0.984，F1分数为0.980。轻量级LoRA微调方法的可训练参数为0.032%，确保了强大的数据处理能力、计算效率，同时利用早期停止和优化的超参数实现可靠且高性能的单语言和多语言FND。这个真实世界新闻数据集已公开可用，用于开发自动FND机制，以遏制FN及相关网络犯罪的威胁。