Kastrati Zenun, Kurti Arianit, Imran Ali Shariq
Dept. of Computer Science and Media Technology, Linnaeus University, Växjö, Sweden.
Dept. of Computer Science, Norwegian University of Science and Technology, Trondheim, Norway.
Data Brief. 2020 Jan 3;28:105090. doi: 10.1016/j.dib.2019.105090. eCollection 2020 Feb.
In this article, we present a dataset containing word embeddings and document topic distribution vectors generated from MOOCs video lecture transcripts. Transcripts of 12,032 video lectures from 200 courses were collected from Coursera learning platform. This large corpus of transcripts was used as input to two well-known NLP techniques, namely Word2Vec and Latent Dirichlet Allocation (LDA) to generate word embeddings and topic vectors, respectively. We used Word2Vec and LDA implementation in the Gensim package in Python. The data presented in this article are related to the research article entitled "Integrating word embeddings and document topics with deep learning in a video classification framework" [1]. The dataset is hosted in the Mendeley Data repository [2].
在本文中,我们展示了一个数据集,该数据集包含从大规模开放在线课程(MOOC)视频讲座转录本生成的词嵌入和文档主题分布向量。从Coursera学习平台收集了来自200门课程的12,032个视频讲座的转录本。这个庞大的转录本语料库被用作两种著名的自然语言处理(NLP)技术的输入,即Word2Vec和潜在狄利克雷分配(LDA),分别用于生成词嵌入和主题向量。我们使用了Python中Gensim包的Word2Vec和LDA实现。本文呈现的数据与题为“在视频分类框架中通过深度学习整合词嵌入和文档主题”的研究文章[1]相关。该数据集托管在Mendeley数据存储库[2]中。