Suppr超能文献

基于时空协同注意力Transformer的无监督低光照视频增强

Unsupervised Low-Light Video Enhancement With Spatial-Temporal Co-Attention Transformer.

作者信息

Lv Xiaoqian, Zhang Shengping, Wang Chenyang, Zhang Weigang, Yao Hongxun, Huang Qingming

出版信息

IEEE Trans Image Process. 2023;32:4701-4715. doi: 10.1109/TIP.2023.3301332. Epub 2023 Aug 16.

Abstract

Existing low-light video enhancement methods are dominated by Convolution Neural Networks (CNNs) that are trained in a supervised manner. Due to the difficulty of collecting paired dynamic low/normal-light videos in real-world scenes, they are usually trained on synthetic, static, and uniform motion videos, which undermines their generalization to real-world scenes. Additionally, these methods typically suffer from temporal inconsistency (e.g., flickering artifacts and motion blurs) when handling large-scale motions since the local perception property of CNNs limits them to model long-range dependencies in both spatial and temporal domains. To address these problems, we propose the first unsupervised method for low-light video enhancement to our best knowledge, named LightenFormer, which models long-range intra- and inter-frame dependencies with a spatial-temporal co-attention transformer to enhance brightness while maintaining temporal consistency. Specifically, an effective but lightweight S-curve Estimation Network (SCENet) is first proposed to estimate pixel-wise S-shaped non-linear curves (S-curves) to adaptively adjust the dynamic range of an input video. Next, to model the temporal consistency of the video, we present a Spatial-Temporal Refinement Network (STRNet) to refine the enhanced video. The core module of STRNet is a novel Spatial-Temporal Co-attention Transformer (STCAT), which exploits multi-scale self- and cross-attention interactions to capture long-range correlations in both spatial and temporal domains among frames for implicit motion estimation. To achieve unsupervised training, we further propose two non-reference loss functions based on the invertibility of the S-curve and the noise independence among frames. Extensive experiments on the SDSD and LLIV-Phone datasets demonstrate that our LightenFormer outperforms state-of-the-art methods.

摘要

现有的低光视频增强方法主要由以监督方式训练的卷积神经网络(CNN)主导。由于在现实场景中收集成对的动态低光/正常光视频存在困难,这些方法通常在合成、静态和均匀运动视频上进行训练,这削弱了它们对现实场景的泛化能力。此外,由于CNN的局部感知特性限制了它们在空间和时间域中对长距离依赖关系进行建模,这些方法在处理大规模运动时通常会出现时间不一致性(例如闪烁伪像和运动模糊)。为了解决这些问题,据我们所知,我们提出了第一种用于低光视频增强的无监督方法,名为LightenFormer,它使用时空协同注意力变换器对帧内和帧间的长距离依赖关系进行建模,以增强亮度并保持时间一致性。具体来说,首先提出了一种有效但轻量级的S曲线估计网络(SCENet)来估计逐像素的S形非线性曲线(S曲线),以自适应调整输入视频的动态范围。接下来,为了对视频的时间一致性进行建模,我们提出了一个时空细化网络(STRNet)来细化增强后的视频。STRNet的核心模块是一个新颖的时空协同注意力变换器(STCAT),它利用多尺度自注意力和交叉注意力交互来捕捉帧之间在空间和时间域中的长距离相关性,用于隐式运动估计。为了实现无监督训练,我们进一步基于S曲线的可逆性和帧之间的噪声独立性提出了两个非参考损失函数。在SDSD和LLIV-Phone数据集上进行的大量实验表明,我们的LightenFormer优于现有方法。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验