Imran Abdullah Al, Shovon Md Sakib Hossain, Mridha M F
Department of Computer Science, American International University-Bangladesh, Dhaka, Bangladesh.
Data Brief. 2024 Feb 27;53:110239. doi: 10.1016/j.dib.2024.110239. eCollection 2024 Apr.
This study presents a large multi-modal Bangla YouTube clickbait dataset consisting of 253,070 data points collected through an automated process using the YouTube API and Python web automation frameworks. The dataset contains 18 diverse features categorized into metadata, primary content, engagement statistics, and labels for individual videos from 58 Bangla YouTube channels. A rigorous preprocessing step has been applied to denoise, deduplicate, and remove bias from the features, ensuring unbiased and reliable analysis. As the largest and most robust clickbait corpus in Bangla to date, this dataset provides significant value for natural language processing and data science researchers seeking to advance modeling of clickbait phenomena in low-resource languages. Its multi-modal nature allows for comprehensive analyses of clickbait across content, user interactions, and linguistic dimensions to develop more sophisticated detection methods with cross-linguistic applications.
本研究展示了一个大型多模态孟加拉语YouTube标题党数据集,该数据集由253,070个数据点组成,这些数据点是通过使用YouTube API和Python网络自动化框架的自动化过程收集的。该数据集包含18种不同的特征,分为元数据、主要内容、参与统计信息,以及来自58个孟加拉语YouTube频道的单个视频的标签。已经应用了严格的预处理步骤来对特征进行去噪、去重和消除偏差,以确保进行无偏差且可靠的分析。作为迄今为止最大且最强大的孟加拉语标题党语料库,该数据集为寻求推进低资源语言中标题党现象建模的自然语言处理和数据科学研究人员提供了重要价值。其多模态性质允许对标题党在内容、用户交互和语言维度上进行全面分析,以开发具有跨语言应用的更复杂的检测方法。