Madhusudana Pavan C, Birkbeck Neil, Wang Yilin, Adsumilli Balu, Bovik Alan C
IEEE Trans Image Process. 2023;32:5138-5152. doi: 10.1109/TIP.2023.3310344. Epub 2023 Sep 15.
Perceptual video quality assessment (VQA) is an integral component of many streaming and video sharing platforms. Here we consider the problem of learning perceptually relevant video quality representations in a self-supervised manner. Distortion type identification and degradation level determination is employed as an auxiliary task to train a deep learning model containing a deep Convolutional Neural Network (CNN) that extracts spatial features, as well as a recurrent unit that captures temporal information. The model is trained using a contrastive loss and we therefore refer to this training framework and resulting model as CONtrastive VIdeo Quality EstimaTor (CONVIQT). During testing, the weights of the trained model are frozen, and a linear regressor maps the learned features to quality scores in a no-reference (NR) setting. We conduct comprehensive evaluations of the proposed model against leading algorithms on multiple VQA databases containing wide ranges of spatial and temporal distortions. We analyze the correlations between model predictions and ground-truth quality ratings, and show that CONVIQT achieves competitive performance when compared to state-of-the-art NR-VQA models, even though it is not trained on those databases. Our ablation experiments demonstrate that the learned representations are highly robust and generalize well across synthetic and realistic distortions. Our results indicate that compelling representations with perceptual bearing can be obtained using self-supervised learning.
感知视频质量评估(VQA)是许多流媒体和视频共享平台的一个重要组成部分。在这里,我们考虑以自监督方式学习与感知相关的视频质量表示的问题。失真类型识别和降级水平确定被用作辅助任务,以训练一个深度学习模型,该模型包含一个用于提取空间特征的深度卷积神经网络(CNN)以及一个用于捕捉时间信息的循环单元。该模型使用对比损失进行训练,因此我们将此训练框架和所得模型称为对比视频质量估计器(CONVIQT)。在测试期间,训练模型的权重被冻结,并且一个线性回归器在无参考(NR)设置下将学习到的特征映射为质量分数。我们在包含广泛空间和时间失真的多个VQA数据库上,针对领先算法对所提出的模型进行了全面评估。我们分析了模型预测与真实质量评级之间的相关性,并表明CONVIQT与最先进的NR-VQA模型相比具有竞争力,尽管它不是在那些数据库上进行训练的。我们的消融实验表明,学习到的表示具有高度鲁棒性,并且在合成失真和现实失真中都能很好地泛化。我们的结果表明,使用自监督学习可以获得具有感知意义的引人注目的表示。