Department of Computer Science and Engineering, Jeonbuk National University, Jeonju, South Korea.
Sci Rep. 2021 Oct 6;11(1):19834. doi: 10.1038/s41598-021-98856-2.
Affective computing has suffered by the precise annotation because the emotions are highly subjective and vague. The music video emotion is complex due to the diverse textual, acoustic, and visual information which can take the form of lyrics, singer voice, sounds from the different instruments, and visual representations. This can be one reason why there has been a limited study in this domain and no standard dataset has been produced before now. In this study, we proposed an unsupervised method for music video emotion analysis using music video contents on the Internet. We also produced a labelled dataset and compared the supervised and unsupervised methods for emotion classification. The music and video information are processed through a multimodal architecture with audio-video information exchange and boosting method. The general 2D and 3D convolution networks compared with the slow-fast network with filter and channel separable convolution in multimodal architecture. Several supervised and unsupervised networks were trained in an end-to-end manner and results were evaluated using various evaluation metrics. The proposed method used a large dataset for unsupervised emotion classification and interpreted the results quantitatively and qualitatively in the music video that had never been applied in the past. The result shows a large increment in classification score using unsupervised features and information sharing techniques on audio and video network. Our best classifier attained 77% accuracy, an f1-score of 0.77, and an area under the curve score of 0.94 with minimum computational cost.
情感计算受到了精确标注的影响,因为情感是高度主观和模糊的。音乐视频的情感是复杂的,因为它包含了多种文本、声学和视觉信息,可以采用歌词、歌手的声音、来自不同乐器的声音以及视觉表现形式。这可能是这个领域研究有限,以前没有产生标准数据集的一个原因。在这项研究中,我们提出了一种使用互联网上的音乐视频内容进行音乐视频情感分析的无监督方法。我们还制作了一个标记数据集,并比较了监督和无监督方法的情感分类。音乐和视频信息通过具有音频-视频信息交换和提升方法的多模态架构进行处理。与多模态架构中的滤波器和通道可分离卷积的慢-快网络相比,一般的 2D 和 3D 卷积网络。以端到端的方式训练了几个监督和无监督网络,并使用各种评估指标评估了结果。该方法使用了一个大型数据集进行无监督情感分类,并对过去从未应用过的音乐视频进行了定量和定性的结果解释。结果表明,在音频和视频网络上使用无监督特征和信息共享技术可以大大提高分类得分。我们的最佳分类器以最小的计算成本获得了 77%的准确率、0.77 的 F1 分数和 0.94 的曲线下面积分数。