Ma Mingyang, Mei Shaohui, Wan Shuai, Wang Zhiyong, Hua Xian-Sheng, Feng David Dagan
IEEE Trans Image Process. 2022;31:1789-1804. doi: 10.1109/TIP.2022.3146012. Epub 2022 Feb 10.
Video Summarization (VS) has become one of the most effective solutions for quickly understanding a large volume of video data. Dictionary selection with self representation and sparse regularization has demonstrated its promise for VS by formulating the VS problem as a sparse selection task on video frames. However, existing dictionary selection models are generally designed only for data reconstruction, which results in the neglect of the inherent structured information among video frames. In addition, the sparsity commonly constrained by L norm is not strong enough, which causes the redundancy of keyframes, i.e., similar keyframes are selected. Therefore, to address these two issues, in this paper we propose a general framework called graph convolutional dictionary selection with L ( ) norm (GCDS ) for both keyframe selection and skimming based summarization. Firstly, we incorporate graph embedding into dictionary selection to generate the graph embedding dictionary, which can take the structured information depicted in videos into account. Secondly, we propose to use L ( ) norm constrained row sparsity, in which p can be flexibly set for two forms of video summarization. For keyframe selection, can be utilized to select diverse and representative keyframes; and for skimming, p=1 can be utilized to select key shots. In addition, an efficient iterative algorithm is devised to optimize the proposed model, and the convergence is theoretically proved. Experimental results including both keyframe selection and skimming based summarization on four benchmark datasets demonstrate the effectiveness and superiority of the proposed method.
视频摘要(VS)已成为快速理解大量视频数据的最有效解决方案之一。通过将VS问题表述为视频帧上的稀疏选择任务,具有自表示和稀疏正则化的字典选择已证明其在VS方面的前景。然而,现有的字典选择模型通常仅为数据重建而设计,这导致忽略了视频帧之间固有的结构化信息。此外,通常由L范数约束的稀疏性不够强,这导致关键帧冗余,即选择了相似的关键帧。因此,为了解决这两个问题,在本文中,我们提出了一个名为带L( )范数的图卷积字典选择(GCDS )的通用框架,用于关键帧选择和基于浏览的摘要。首先,我们将图嵌入纳入字典选择以生成图嵌入字典,其可以考虑视频中描绘的结构化信息。其次,我们建议使用L( )范数约束的行稀疏性,其中p可以针对两种形式的视频摘要灵活设置。对于关键帧选择,可以利用 来选择多样且有代表性的关键帧;对于浏览,p = 1可以用于选择关键镜头。此外,设计了一种高效的迭代算法来优化所提出的模型,并从理论上证明了其收敛性。在四个基准数据集上进行的包括关键帧选择和基于浏览的摘要的实验结果证明了所提方法的有效性和优越性。