State Key Laboratory of Media Convergence and Communication, Beijing 100024, China.
Key Laboratory of Acoustic Visual Technology and Intelligent Control System, Ministry of Culture and Tourism, Beijing 100024, China.
Sensors (Basel). 2024 Aug 31;24(17):5681. doi: 10.3390/s24175681.
Most existing intelligent editing tools for music and video rely on the cross-modal matching technology of the affective consistency or the similarity of feature representations. However, these methods are not fully applicable to complex audiovisual matching scenarios, resulting in low matching accuracy and suboptimal audience perceptual effects due to ambiguous matching rules and associated factors. To address these limitations, this paper focuses on both the similarity and integration of affective distribution for the artistic audiovisual works of movie and television video and music. Based on the rich emotional perception elements, we propose a hybrid matching model based on feature canonical correlation analysis (CCA) and fine-grained affective similarity. The model refines KCCA fusion features by analyzing both matched and unmatched music-video pairs. Subsequently, the model employs XGBoost to predict relevance and to compute similarity by considering fine-grained affective semantic distance as well as affective factor distance. Ultimately, the matching prediction values are obtained through weight allocation. Experimental results on a self-built dataset demonstrate that the proposed affective matching model balances feature parameters and affective semantic cognitions, yielding relatively high prediction accuracy and better subjective experience of audiovisual association. This paper is crucial for exploring the affective association mechanisms of audiovisual objects from a sensory perspective and improving related intelligent tools, thereby offering a novel technical approach to retrieval and matching in music-video editing.
现有的音乐和视频智能编辑工具大多依赖情感一致性或特征表示相似性的跨模态匹配技术。然而,这些方法并不完全适用于复杂的视听匹配场景,由于匹配规则不明确和相关因素的影响,导致匹配精度低,观众感知效果不佳。为了解决这些限制,本文关注电影和电视视频以及音乐的艺术视听作品的情感分布的相似性和融合。基于丰富的情感感知元素,我们提出了一种基于特征典型相关分析(CCA)和细粒度情感相似性的混合匹配模型。该模型通过分析匹配和不匹配的音乐-视频对来细化 KCCA 融合特征。然后,该模型使用 XGBoost 通过考虑细粒度情感语义距离和情感因素距离来预测相关性并计算相似性。最终,通过权重分配获得匹配预测值。在自建数据集上的实验结果表明,所提出的情感匹配模型平衡了特征参数和情感语义认知,具有相对较高的预测准确性和更好的视听关联主观体验。本文从感官角度探索视听对象的情感关联机制,改进相关智能工具,为音乐视频编辑中的检索和匹配提供了一种新的技术方法。