Suppr超能文献

印度语对话:一个用于印度语语言建模的包含10种印度语字幕的数据集。

IndicDialogue: A dataset of subtitles in 10 Indic languages for Indic language modeling.

作者信息

Arnob Noor Mairukh Khan, Faiyaz A, Fuad Md Mubtasim, Al Masud Shah Murtaza Rashid, Das Baivab, Mridha M F

机构信息

Department of Computer Science and Engineering, University of Asia Pacific, Dhaka, Bangladesh.

Department of Computer Science, American International University-Bangladesh, Dhaka, Bangladesh.

出版信息

Data Brief. 2024 Jul 3;55:110690. doi: 10.1016/j.dib.2024.110690. eCollection 2024 Aug.

Abstract

The Languages of the Indian subcontinent are less represented in current NLP literature. To mitigate this gap, we present the IndicDialogue dataset, which contains subtitles and dialogues in 10 major Indic languages: Hindi, Bengali, Marathi, Telugu, Tamil, Urdu, Odia, Sindhi, Nepali, and Assamese. This dataset is sourced from OpenSubtitles.org, with subtitles pre-processed to remove irrelevant tags, timestamps, square brackets, and links, ensuring the retention of relevant dialogues in JSONL files. The IndicDialogue dataset comprises 7750 raw subtitle files (SRT), 11 JSONL files, 6,853,518 dialogues, and 42,188,569 words. It is designed to serve as a foundation for language model pre-training for low-resource languages, enabling a wide range of downstream tasks including word embeddings, topic modeling, conversation synthesis, neural machine translation, and text summarization.

摘要

印度次大陆的语言在当前的自然语言处理文献中较少被提及。为了弥补这一差距,我们展示了IndicDialogue数据集,其中包含10种主要印度语言的字幕和对话:印地语、孟加拉语、马拉地语、泰卢固语、泰米尔语、乌尔都语、奥里亚语、信德语、尼泊尔语和阿萨姆语。该数据集源自OpenSubtitles.org,字幕经过预处理以去除无关标签、时间戳、方括号和链接,确保在JSONL文件中保留相关对话。IndicDialogue数据集包括7750个原始字幕文件(SRT)、11个JSONL文件、6853518个对话和42188569个单词。它旨在作为低资源语言的语言模型预训练基础,支持包括词嵌入、主题建模、对话合成、神经机器翻译和文本摘要在内的广泛下游任务。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1f9c/11301086/896f53cfa8d0/gr1.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验