IEEE Trans Pattern Anal Mach Intell. 2024 Jul;46(7):4579-4596. doi: 10.1109/TPAMI.2024.3356548. Epub 2024 Jun 5.
Almost all digital videos are coded into compact representations before being transmitted. Such compact representations need to be decoded back to pixels before being displayed to humans and - as usual - before being enhanced/analyzed by machine vision algorithms. Intuitively, it is more efficient to enhance/analyze the coded representations directly without decoding them into pixels. Therefore, we propose a versatile neural video coding (VNVC) framework, which targets learning compact representations to support both reconstruction and direct enhancement/analysis, thereby being versatile for both human and machine vision. Our VNVC framework has a feature-based compression loop. In the loop, one frame is encoded into compact representations and decoded to an intermediate feature that is obtained before performing reconstruction. The intermediate feature can be used as reference in motion compensation and motion estimation through feature-based temporal context mining and cross-domain motion encoder-decoder to compress the following frames. The intermediate feature is directly fed into video reconstruction, video enhancement, and video analysis networks to evaluate its effectiveness. The evaluation shows that our framework with the intermediate feature achieves high compression efficiency for video reconstruction and satisfactory task performances with lower complexities.
几乎所有的数字视频在传输前都被编码成紧凑的表示形式。这种紧凑的表示形式需要被解码回像素,然后才能被人类显示,并且 - 通常 - 在被机器视觉算法增强/分析之前。直观地说,直接增强/分析编码表示而不将其解码为像素更有效率。因此,我们提出了一种通用的神经视频编码 (VNVC) 框架,旨在学习紧凑的表示形式,以支持重建和直接增强/分析,从而对人类和机器视觉都具有通用性。我们的 VNVC 框架具有基于特征的压缩循环。在循环中,一帧被编码成紧凑的表示形式,并被解码为在执行重建之前获得的中间特征。该中间特征可通过基于特征的时间上下文挖掘和跨域运动编码器-解码器用作运动补偿和运动估计的参考,以压缩后续帧。中间特征直接输入到视频重建、视频增强和视频分析网络中,以评估其有效性。评估表明,我们的框架具有中间特征,可实现视频重建的高压缩效率,并且在较低复杂度下具有令人满意的任务性能。