Cao Yuqin, Min Xiongkuo, Sun Wei, Zhai Guangtao
IEEE Trans Image Process. 2023;32:1882-1896. doi: 10.1109/TIP.2023.3251695. Epub 2023 Mar 21.
With the popularity of mobile Internet, audio and video (A/V) have become the main way for people to entertain and socialize daily. However, in order to reduce the cost of media storage and transmission, A/V signals will be compressed by service providers before they are transmitted to end-users, which inevitably causes distortions in the A/V signals and degrades the end-user's Quality of Experience (QoE). This motivates us to research the objective audio-visual quality assessment (AVQA). In the field of AVQA, most previous works only focus on single-mode audio or visual signals, which ignores that the perceptual quality of users depends on both audio and video signals. Therefore, we propose an objective AVQA architecture for multi-mode signals based on attentional neural networks. Specifically, we first utilize an attention prediction model to extract the salient regions of video frames. Then, a pre-trained convolutional neural network is used to extract short-time features of the salient regions and the corresponding audio signals. Next, the short-time features are fed into Gated Recurrent Unit (GRU) networks to model the temporal relationship between adjacent frames. Finally, the fully connected layers are utilized to fuse the temporal related features of A/V signals modeled by the GRU network into the final quality score. The proposed architecture is flexible and can be applied to both full-reference and no-reference AVQA. Experimental results on the LIVE-SJTU Database and UnB-AVC Database demonstrate that our model outperforms the state-of-the-art AVQA methods. The code of the proposed method will be publicly available to promote the development of the field of AVQA.
随着移动互联网的普及,音频和视频(A/V)已成为人们日常娱乐和社交的主要方式。然而,为了降低媒体存储和传输成本,服务提供商在将A/V信号传输给终端用户之前会对其进行压缩,这不可避免地会导致A/V信号失真,并降低终端用户的体验质量(QoE)。这促使我们研究客观的视听质量评估(AVQA)。在AVQA领域,以往的大多数工作只关注单模式音频或视觉信号,而忽略了用户的感知质量取决于音频和视频信号两者。因此,我们提出了一种基于注意力神经网络的多模式信号客观AVQA架构。具体来说,我们首先利用注意力预测模型提取视频帧的显著区域。然后,使用预训练的卷积神经网络提取显著区域和相应音频信号的短时特征。接下来,将短时特征输入门控循环单元(GRU)网络,以对相邻帧之间的时间关系进行建模。最后,利用全连接层将GRU网络建模的A/V信号的时间相关特征融合到最终质量得分中。所提出的架构具有灵活性,可应用于全参考和无参考AVQA。在LIVE-SJTU数据库和UnB-AVC数据库上的实验结果表明,我们的模型优于当前最先进的AVQA方法。所提方法的代码将公开提供,以促进AVQA领域的发展。