Instituto de Computação, Universidade Federal Fluminense, Niterói 24210-330, RJ, Brazil.
Sensors (Basel). 2021 Nov 3;21(21):7328. doi: 10.3390/s21217328.
The Internet's popularization has increased the amount of content produced and consumed on the web. To take advantage of this new market, major content producers such as Netflix and Amazon Prime have emerged, focusing on video streaming services. However, despite the large number and diversity of videos made available by these content providers, few of them attract the attention of most users. For example, in the data explored in this article, only 6% of the most popular videos account for 85% of total views. Finding out in advance which videos will be popular is not trivial, especially given many influencing variables. Nevertheless, a tool with this ability would be of great value to help dimension network infrastructure and properly recommend new content to users. In this way, this manuscript examines the machine learning-based approaches that have been proposed to solve the prediction of web content popularity. To this end, we first survey the literature and elaborate a taxonomy that classifies models according to predictive features and describes state-of-the-art features and techniques used to solve this task. While analyzing previous works, we saw an opportunity to use textual features for video prediction. Thus, additionally, we propose a case study that combines features acquired through attribute engineering and word embedding to predict the popularity of a video. The first approach is based on predictive attributes defined by resource engineering. The second takes advantage of word embeddings from video descriptions and titles. We experimented with the proposed techniques in a set of videos from GloboPlay, the largest provider of video streaming services in Latin America. A combination of engineering features and embeddings using the Random Forest algorithm achieved the best result, with an accuracy of 87%.
互联网的普及增加了网络上生成和消费的内容量。为了利用这个新市场,Netflix 和亚马逊 Prime 等主要内容制作商已经出现,专注于视频流媒体服务。然而,尽管这些内容提供商提供了大量多样化的视频,但很少有视频能吸引大多数用户的注意力。例如,在本文所探索的数据中,只有 6%的最受欢迎视频占据了总观看量的 85%。提前预测哪些视频会受欢迎并非易事,尤其是考虑到许多影响变量。然而,具有这种能力的工具将具有巨大的价值,可以帮助衡量网络基础设施,并向用户正确推荐新内容。本文旨在研究基于机器学习的方法,以解决网络内容流行度预测问题。为此,我们首先对文献进行了调查,并详细阐述了一种根据预测特征对模型进行分类的分类法,并描述了用于解决此任务的最新特征和技术。在分析以前的工作时,我们看到了一个机会,可以使用文本特征进行视频预测。因此,我们还提出了一个案例研究,该研究结合了通过属性工程和词嵌入获取的特征来预测视频的流行度。第一种方法基于资源工程定义的预测属性。第二种方法利用视频描述和标题中的词嵌入。我们在 GloboPlay 的一组视频中对提出的技术进行了实验,GloboPlay 是拉丁美洲最大的视频流媒体服务提供商。使用随机森林算法的工程特征和嵌入的组合实现了最佳结果,准确率为 87%。