印度语对话：一个用于印度语语言建模的包含10种印度语字幕的数据集。

IndicDialogue: A dataset of subtitles in 10 Indic languages for Indic language modeling.

作者信息

Arnob Noor Mairukh Khan, Faiyaz A, Fuad Md Mubtasim, Al Masud Shah Murtaza Rashid, Das Baivab, Mridha M F

机构信息

Department of Computer Science and Engineering, University of Asia Pacific, Dhaka, Bangladesh.

Department of Computer Science, American International University-Bangladesh, Dhaka, Bangladesh.

出版信息

Data Brief. 2024 Jul 3;55:110690. doi: 10.1016/j.dib.2024.110690. eCollection 2024 Aug.

DOI:10.1016/j.dib.2024.110690

PMID:39109169

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11301086/

Abstract

The Languages of the Indian subcontinent are less represented in current NLP literature. To mitigate this gap, we present the IndicDialogue dataset, which contains subtitles and dialogues in 10 major Indic languages: Hindi, Bengali, Marathi, Telugu, Tamil, Urdu, Odia, Sindhi, Nepali, and Assamese. This dataset is sourced from OpenSubtitles.org, with subtitles pre-processed to remove irrelevant tags, timestamps, square brackets, and links, ensuring the retention of relevant dialogues in JSONL files. The IndicDialogue dataset comprises 7750 raw subtitle files (SRT), 11 JSONL files, 6,853,518 dialogues, and 42,188,569 words. It is designed to serve as a foundation for language model pre-training for low-resource languages, enabling a wide range of downstream tasks including word embeddings, topic modeling, conversation synthesis, neural machine translation, and text summarization.

摘要

印度次大陆的语言在当前的自然语言处理文献中较少被提及。为了弥补这一差距，我们展示了IndicDialogue数据集，其中包含10种主要印度语言的字幕和对话：印地语、孟加拉语、马拉地语、泰卢固语、泰米尔语、乌尔都语、奥里亚语、信德语、尼泊尔语和阿萨姆语。该数据集源自OpenSubtitles.org，字幕经过预处理以去除无关标签、时间戳、方括号和链接，确保在JSONL文件中保留相关对话。IndicDialogue数据集包括7750个原始字幕文件（SRT）、11个JSONL文件、6853518个对话和42188569个单词。它旨在作为低资源语言的语言模型预训练基础，支持包括词嵌入、主题建模、对话合成、神经机器翻译和文本摘要在内的广泛下游任务。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1f9c/11301086/896f53cfa8d0/gr1.jpg

相似文献

IndicDialogue: A dataset of subtitles in 10 Indic languages for Indic language modeling.印度语对话：一个用于印度语语言建模的包含10种印度语字幕的数据集。

Data Brief. 2024 Jul 3;55:110690. doi: 10.1016/j.dib.2024.110690. eCollection 2024 Aug.

ChatSubs: A dataset of dialogues in Spanish, Catalan, Basque and Galician extracted from movie subtitles for developing advanced conversational models.ChatSubs：一个从电影字幕中提取的西班牙语、加泰罗尼亚语、巴斯克语和加利西亚语对话数据集，用于开发先进的对话模型。

Data Brief. 2023 Sep 14;50:109565. doi: 10.1016/j.dib.2023.109565. eCollection 2023 Oct.

Investigating translation for Indic languages with BLOOMZ-3b through prompting and LoRA fine-tuning.通过提示和LoRA微调，使用BLOOMZ-3b研究印度语言的翻译。

Sci Rep. 2024 Oct 15;14(1):24202. doi: 10.1038/s41598-024-74617-9.

Enhancing African low-resource languages: Swahili data for language modelling.提升非洲资源匮乏语言：用于语言建模的斯瓦希里语数据

Data Brief. 2020 Jun 30;31:105951. doi: 10.1016/j.dib.2020.105951. eCollection 2020 Aug.

Unicode-8 based linguistics data set of annotated Sindhi text.基于Unicode-8的信德语带注释文本的语言学数据集。

Data Brief. 2018 May 22;19:1504-1514. doi: 10.1016/j.dib.2018.05.062. eCollection 2018 Aug.

Parallel texts dataset for Uzbek-Kazakh machine translation.乌兹别克语-哈萨克语机器翻译的平行文本数据集。

Data Brief. 2024 Feb 15;53:110194. doi: 10.1016/j.dib.2024.110194. eCollection 2024 Apr.

A comparison of word embeddings for the biomedical natural language processing.生物医学自然语言处理中词嵌入的比较。

J Biomed Inform. 2018 Nov;87:12-20. doi: 10.1016/j.jbi.2018.09.008. Epub 2018 Sep 12.

Cursive-Text: A Comprehensive Dataset for End-to-End Urdu Text Recognition in Natural Scene Images.连笔文本：用于自然场景图像中乌尔都语文本端到端识别的综合数据集。

Data Brief. 2020 May 21;31:105749. doi: 10.1016/j.dib.2020.105749. eCollection 2020 Aug.

subs2vec: Word embeddings from subtitles in 55 languages.subs2vec：来自 55 种语言字幕的单词嵌入。

Behav Res Methods. 2021 Apr;53(2):629-655. doi: 10.3758/s13428-020-01406-3.

Building lexicon-based sentiment analysis model for low-resource languages.为低资源语言构建基于词典的情感分析模型。

MethodsX. 2023 Oct 22;11:102460. doi: 10.1016/j.mex.2023.102460. eCollection 2023 Dec.

引用本文的文献

The impact of social security systems on public health outcomes: an economic perspective on machine translation applications.社会保障体系对公共卫生结果的影响：机器翻译应用的经济学视角

Front Public Health. 2025 Jul 10;13:1597381. doi: 10.3389/fpubh.2025.1597381. eCollection 2025.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

印度语对话：一个用于印度语语言建模的包含10种印度语字幕的数据集。

IndicDialogue: A dataset of subtitles in 10 Indic languages for Indic language modeling.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献