Arnob Noor Mairukh Khan, Faiyaz A, Fuad Md Mubtasim, Al Masud Shah Murtaza Rashid, Das Baivab, Mridha M F
Department of Computer Science and Engineering, University of Asia Pacific, Dhaka, Bangladesh.
Department of Computer Science, American International University-Bangladesh, Dhaka, Bangladesh.
Data Brief. 2024 Jul 3;55:110690. doi: 10.1016/j.dib.2024.110690. eCollection 2024 Aug.
The Languages of the Indian subcontinent are less represented in current NLP literature. To mitigate this gap, we present the IndicDialogue dataset, which contains subtitles and dialogues in 10 major Indic languages: Hindi, Bengali, Marathi, Telugu, Tamil, Urdu, Odia, Sindhi, Nepali, and Assamese. This dataset is sourced from OpenSubtitles.org, with subtitles pre-processed to remove irrelevant tags, timestamps, square brackets, and links, ensuring the retention of relevant dialogues in JSONL files. The IndicDialogue dataset comprises 7750 raw subtitle files (SRT), 11 JSONL files, 6,853,518 dialogues, and 42,188,569 words. It is designed to serve as a foundation for language model pre-training for low-resource languages, enabling a wide range of downstream tasks including word embeddings, topic modeling, conversation synthesis, neural machine translation, and text summarization.
印度次大陆的语言在当前的自然语言处理文献中较少被提及。为了弥补这一差距,我们展示了IndicDialogue数据集,其中包含10种主要印度语言的字幕和对话:印地语、孟加拉语、马拉地语、泰卢固语、泰米尔语、乌尔都语、奥里亚语、信德语、尼泊尔语和阿萨姆语。该数据集源自OpenSubtitles.org,字幕经过预处理以去除无关标签、时间戳、方括号和链接,确保在JSONL文件中保留相关对话。IndicDialogue数据集包括7750个原始字幕文件(SRT)、11个JSONL文件、6853518个对话和42188569个单词。它旨在作为低资源语言的语言模型预训练基础,支持包括词嵌入、主题建模、对话合成、神经机器翻译和文本摘要在内的广泛下游任务。