Suppr超能文献

WET:大规模在线开放课程(MOOC)视频讲座数据集的词嵌入-主题分布向量

WET: Word embedding-topic distribution vectors for MOOC video lectures dataset.

作者信息

Kastrati Zenun, Kurti Arianit, Imran Ali Shariq

机构信息

Dept. of Computer Science and Media Technology, Linnaeus University, Växjö, Sweden.

Dept. of Computer Science, Norwegian University of Science and Technology, Trondheim, Norway.

出版信息

Data Brief. 2020 Jan 3;28:105090. doi: 10.1016/j.dib.2019.105090. eCollection 2020 Feb.

Abstract

In this article, we present a dataset containing word embeddings and document topic distribution vectors generated from MOOCs video lecture transcripts. Transcripts of 12,032 video lectures from 200 courses were collected from Coursera learning platform. This large corpus of transcripts was used as input to two well-known NLP techniques, namely Word2Vec and Latent Dirichlet Allocation (LDA) to generate word embeddings and topic vectors, respectively. We used Word2Vec and LDA implementation in the Gensim package in Python. The data presented in this article are related to the research article entitled "Integrating word embeddings and document topics with deep learning in a video classification framework" [1]. The dataset is hosted in the Mendeley Data repository [2].

摘要

在本文中,我们展示了一个数据集,该数据集包含从大规模开放在线课程(MOOC)视频讲座转录本生成的词嵌入和文档主题分布向量。从Coursera学习平台收集了来自200门课程的12,032个视频讲座的转录本。这个庞大的转录本语料库被用作两种著名的自然语言处理(NLP)技术的输入,即Word2Vec和潜在狄利克雷分配(LDA),分别用于生成词嵌入和主题向量。我们使用了Python中Gensim包的Word2Vec和LDA实现。本文呈现的数据与题为“在视频分类框架中通过深度学习整合词嵌入和文档主题”的研究文章[1]相关。该数据集托管在Mendeley数据存储库[2]中。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/047d/6950834/2a4eea2bfc4e/gr1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验