IEEE Trans Image Process. 2017 Feb;26(2):1004-1016. doi: 10.1109/TIP.2016.2631888. Epub 2016 Nov 22.
The great content diversity of real-world digital images poses a grand challenge to image quality assessment (IQA) models, which are traditionally designed and validated on a handful of commonly used IQA databases with very limited content variation. To test the generalization capability and to facilitate the wide usage of IQA techniques in real-world applications, we establish a large-scale database named the Waterloo Exploration Database, which in its current state contains 4744 pristine natural images and 94 880 distorted images created from them. Instead of collecting the mean opinion score for each image via subjective testing, which is extremely difficult if not impossible, we present three alternative test criteria to evaluate the performance of IQA models, namely, the pristine/distorted image discriminability test, the listwise ranking consistency test, and the pairwise preference consistency test (P-test). We compare 20 well-known IQA models using the proposed criteria, which not only provide a stronger test in a more challenging testing environment for existing models, but also demonstrate the additional benefits of using the proposed database. For example, in the P-test, even for the best performing no-reference IQA model, more than 6 million failure cases against the model are "discovered" automatically out of over 1 billion test pairs. Furthermore, we discuss how the new database may be exploited using innovative approaches in the future, to reveal the weaknesses of existing IQA models, to provide insights on how to improve the models, and to shed light on how the next-generation IQA models may be developed. The database and codes are made publicly available at: https://ece.uwaterloo.ca/~k29ma/exploration/.
真实世界数字图像的巨大内容多样性对图像质量评估(IQA)模型构成了巨大挑战,传统的IQA模型是在少数内容变化非常有限的常用IQA数据库上设计和验证的。为了测试IQA技术的泛化能力并促进其在实际应用中的广泛使用,我们建立了一个名为滑铁卢探索数据库的大规模数据库,其当前状态包含4744张原始自然图像以及由它们生成的94880张失真图像。我们没有通过主观测试来收集每张图像的平均意见得分(如果不是不可能的话,这极其困难),而是提出了三个替代测试标准来评估IQA模型的性能,即原始/失真图像可辨别性测试、列表排序一致性测试和成对偏好一致性测试(P测试)。我们使用所提出的标准比较了20个知名的IQA模型,这些标准不仅在更具挑战性的测试环境中对现有模型提供了更强的测试,还展示了使用所提出数据库的额外优势。例如,在P测试中,即使对于性能最佳的无参考IQA模型,在超过10亿对测试图像中也能自动“发现”超过600万个与该模型不符的失败案例。此外,我们讨论了未来如何使用创新方法利用这个新数据库,以揭示现有IQA模型的弱点,为如何改进模型提供见解,并为下一代IQA模型的开发提供思路。该数据库和代码可在以下网址公开获取:https://ece.uwaterloo.ca/~k29ma/exploration/ 。