Shin Chajin, Kim Yonghwan, Choi KwangPyo, Lee Sangyoun
School of Electrical and Electronic Engineering, Yonsei University, Seoul 03722, Republic of Korea.
Samsung Seoul R&D Campus, Seoul 06765, Republic of Korea.
Sensors (Basel). 2025 Jul 17;25(14):4460. doi: 10.3390/s25144460.
In neural video compression, an approximation of the target frame is predicted, and a mask is subsequently applied to it. Then, the masked predicted frame is subtracted from the target frame and fed into the encoder along with the conditional information. However, this structure has two limitations. First, in the pixel domain, even if the mask is perfectly predicted, the residuals cannot be significantly reduced. Second, reconstructed features with abundant temporal context information cannot be used as references for compressing the next frame. To address these problems, we propose Conditional Masked Feature Residual (CMFR) Coding. We extract features from the target frame and the predicted features using neural networks. Then, we predict the mask and subtract the masked predicted features from the target features. Thereafter, the difference is fed into the encoder with the conditional information. Moreover, to more effectively remove conditional information from the target frame, we introduce a Scaled Feature Fusion (SFF) module. In addition, we introduce a Motion Refiner to enhance the quality of the decoded optical flow. Experimental results show that our model achieves an 11.76% bit saving over the model without the proposed methods, averaged over all HEVC test sequences, demonstrating the effectiveness of the proposed methods.
在神经视频压缩中,首先预测目标帧的近似值,随后对其应用掩码。然后,从目标帧中减去掩码后的预测帧,并将其与条件信息一起输入编码器。然而,这种结构有两个局限性。第一,在像素域中,即使掩码被完美预测,残差也无法显著减少。第二,具有丰富时间上下文信息的重建特征不能用作压缩下一帧的参考。为了解决这些问题,我们提出了条件掩码特征残差(CMFR)编码。我们使用神经网络从目标帧和预测特征中提取特征。然后,我们预测掩码,并从目标特征中减去掩码后的预测特征。此后,将差值与条件信息一起输入编码器。此外,为了更有效地从目标帧中去除条件信息,我们引入了缩放特征融合(SFF)模块。另外,我们引入了运动细化器来提高解码光流的质量。实验结果表明,在所有HEVC测试序列上平均,我们的模型比未采用所提方法的模型节省了11.76%的比特,证明了所提方法的有效性。