WMG, University of Warwick, Coventry, UK.
System1 Group PLC, London, UK.
Sci Rep. 2024 Nov 2;14(1):26382. doi: 10.1038/s41598-024-76968-9.
Understanding and predicting viewers' emotional responses to videos has emerged as a pivotal challenge due to its multifaceted applications in video indexing, summarization, personalized content recommendation, and effective advertisement design. A major roadblock in this domain has been the lack of expansive datasets with videos paired with viewer-reported emotional annotations. We address this challenge by employing a deep learning methodology trained on a dataset derived from the application of System1's proprietary methodologies on over 30,000 real video advertisements, each annotated by an average of 75 viewers. This equates to over 2.3 million emotional annotations across eight distinct categories: anger, contempt, disgust, fear, happiness, sadness, surprise, and neutral, coupled with the temporal onset of these emotions. Leveraging 5-second video clips, our approach aims to capture pronounced emotional responses. Our convolutional neural network, which integrates both video and audio data, predicts salient 5-second emotional clips with an average balanced accuracy of 43.6%, and shows particularly high performance for detecting happiness (55.8%) and sadness (60.2%). When applied to full advertisements, our model achieves a strong average AUC of 75% in determining emotional undertones. To facilitate further research, our trained networks are freely available upon request for research purposes. This work not only overcomes previous data limitations but also provides an accurate deep learning solution for video emotion understanding.
理解和预测观众对视频的情绪反应已经成为一个关键挑战,因为它在视频索引、摘要、个性化内容推荐和有效广告设计等方面有广泛的应用。该领域的一个主要障碍是缺乏带有观众报告的情感注释的视频的扩展数据集。我们通过使用一种深度学习方法来解决这个问题,该方法是基于 System1 的专有方法应用于超过 30000 个真实视频广告的数据集中进行训练的,每个广告平均由 75 个观众进行注释。这相当于超过 230 万条情感注释,涵盖八个不同的类别:愤怒、轻蔑、厌恶、恐惧、幸福、悲伤、惊讶和中性,以及这些情绪的时间起始。利用 5 秒的视频片段,我们的方法旨在捕捉明显的情绪反应。我们的卷积神经网络整合了视频和音频数据,预测显著的 5 秒情感片段,平均平衡准确率为 43.6%,在检测幸福(55.8%)和悲伤(60.2%)方面表现特别出色。当应用于完整的广告时,我们的模型在确定情感底色方面的平均 AUC 达到了 75%。为了促进进一步的研究,我们的训练网络可以根据需要免费提供给研究人员使用。这项工作不仅克服了以前的数据限制,而且为视频情感理解提供了一个准确的深度学习解决方案。