Min Xiongkuo, Zhai Guangtao, Zhou Jiantao, Farias Mylene C Q, Bovik Alan Conrad
IEEE Trans Image Process. 2020 Apr 21. doi: 10.1109/TIP.2020.2988148.
The topics of visual and audio quality assessment (QA) have been widely researched for decades, yet nearly all of this prior work has focused only on single-mode visual or audio signals. However, visual signals rarely are presented without accompanying audio, including heavy-bandwidth video streaming applications. Moreover, the distortions that may separately (or conjointly) afflict the visual and audio signals collectively shape user-perceived quality of experience (QoE). This motivated us to conduct a subjective study of audio and video (A/V) quality, which we then used to compare and develop A/V quality measurement models and algorithms. The new LIVE-SJTU Audio and Video Quality Assessment (A/V-QA) Database includes 336 A/V sequences that were generated from 14 original source contents by applying 24 different A/V distortion combinations on them. We then conducted a subjective A/V quality perception study on the database towards attaining a better understanding of how humans perceive the overall combined quality of A/V signals. We also designed four different families of objective A/V quality prediction models, using a multimodal fusion strategy. The different types of A/V quality models differ in both the unimodal audio and video quality prediction models comprising the direct signal measurements and in the way that the two perceptual signal modes are combined. The objective models are built using both existing state-of-the-art audio and video quality prediction models and some new prediction models, as well as quality-predictive features delivered by a deep neural network. The methods of fusing audio and video quality predictions that are considered include simple product combinations as well as learned mappings. Using the new subjective A/V database as a tool, we validated and tested all of the objective A/V quality prediction models. We will make the database publicly available to facilitate further research.
视觉和音频质量评估(QA)的主题已经被广泛研究了数十年,但几乎所有先前的工作都只专注于单模态视觉或音频信号。然而,视觉信号很少在没有伴随音频的情况下出现,包括高带宽视频流应用。此外,可能分别(或共同)影响视觉和音频信号的失真共同塑造了用户感知的体验质量(QoE)。这促使我们进行了一项关于音频和视频(A/V)质量的主观研究,然后我们用它来比较和开发A/V质量测量模型及算法。新的LIVE-SJTU音频和视频质量评估(A/V-QA)数据库包含336个A/V序列,这些序列是通过对14个原始源内容应用24种不同的A/V失真组合生成的。然后,我们对该数据库进行了一项主观A/V质量感知研究以便更好地理解人类如何感知A/V信号的整体综合质量。我们还使用多模态融合策略设计了四个不同系列的客观A/V质量预测模型。不同类型的A/V质量模型在构成直接信号测量的单模态音频和视频质量预测模型以及两种感知信号模式的组合方式上都有所不同。客观模型是使用现有的最先进音频和视频质量预测模型、一些新的预测模型以及深度神经网络提供的质量预测特征构建的。所考虑的融合音频和视频质量预测的方法包括简单的乘积组合以及学习映射。使用新的主观A/V数据库作为工具,我们对所有客观A/V质量预测模型进行了验证和测试。我们将公开提供该数据库以促进进一步的研究。