IEEE Trans Image Process. 2020;29:214-224. doi: 10.1109/TIP.2019.2925550. Epub 2019 Jul 17.
Compositing is one of the most important editing operations for images and videos. The process of improving the realism of composite results is often called harmonization. Previous approaches for harmonization mainly focus on images. In this paper, we take one step further to attack the problem of video harmonization. Specifically, we train a convolutional neural network in an adversarial way, exploiting a pixel-wise disharmony discriminator to achieve more realistic harmonized results and introducing a temporal loss to increase temporal consistency between consecutive harmonized frames. Thanks to the pixel-wise disharmony discriminator, we are also able to relieve the need of input foreground masks. Since existing video datasets which have ground-truth foreground masks and optical flows are not sufficiently large, we propose a simple yet efficient method to build up a synthetic dataset supporting supervised training of the proposed adversarial network. The experiments show that training on our synthetic dataset generalizes well to the real-world composite dataset. In addition, our method successfully incorporates temporal consistency during training and achieves more harmonious visual results than previous methods.
合成是图像和视频中最重要的编辑操作之一。提高合成结果逼真度的过程通常称为协调。以前用于协调的方法主要集中在图像上。在本文中,我们更进一步地攻击视频协调的问题。具体来说,我们以对抗的方式训练卷积神经网络,利用逐像素不协调鉴别器来实现更逼真的协调结果,并引入时间损失以增加连续协调帧之间的时间一致性。由于像素级不协调鉴别器,我们也能够减轻对输入前景掩模的需求。由于现有的具有地面实况前景掩模和光流的视频数据集不够大,我们提出了一种简单而有效的方法来构建一个支持所提出的对抗网络的有监督训练的合成数据集。实验表明,在我们的合成数据集上进行训练可以很好地推广到真实的复合数据集。此外,我们的方法在训练过程中成功地结合了时间一致性,并实现了比以前的方法更和谐的视觉效果。