Zhang Zhichao, Sun Wei, Zhai Guangtao
Department of Electronic Engineering, Shanghai Jiao Tong University, Shanghai 200030, China.
Sensors (Basel). 2025 Jul 28;25(15):4668. doi: 10.3390/s25154668.
Recent breakthroughs in AI-generated content (AIGC) have transformed video creation, empowering systems to translate text, images, or audio into visually compelling stories. Yet reliable evaluation of these machine-crafted videos remains elusive because quality is governed not only by spatial fidelity within individual frames but also by temporal coherence across frames and precise semantic alignment with the intended message. The foundational role of sensor technologies is critical, as they determine the physical plausibility of AIGC outputs. In this perspective, we argue that multimodal large language models (MLLMs) are poised to become the cornerstone of next-generation video quality assessment (VQA). By jointly encoding cues from multiple modalities such as vision, language, sound, and even depth, the MLLM can leverage its powerful language understanding capabilities to assess the quality of scene composition, motion dynamics, and narrative consistency, overcoming the fragmentation of hand-engineered metrics and the poor generalization ability of CNN-based methods. Furthermore, we provide a comprehensive analysis of current methodologies for assessing AIGC video quality, including the evolution of generation models, dataset design, quality dimensions, and evaluation frameworks. We argue that advances in sensor fusion enable MLLMs to combine low-level physical constraints with high-level semantic interpretations, further enhancing the accuracy of visual quality assessment.
人工智能生成内容(AIGC)的最新突破改变了视频创作,使系统能够将文本、图像或音频转化为视觉上引人入胜的故事。然而,对这些机器制作的视频进行可靠评估仍然很困难,因为视频质量不仅取决于单个帧内的空间保真度,还取决于帧与帧之间的时间连贯性以及与预期信息的精确语义对齐。传感器技术的基础作用至关重要,因为它们决定了AIGC输出的物理合理性。从这个角度来看,我们认为多模态大语言模型(MLLM)有望成为下一代视频质量评估(VQA)的基石。通过联合编码来自视觉、语言、声音甚至深度等多种模态的线索,MLLM可以利用其强大的语言理解能力来评估场景构图、运动动态和叙事一致性的质量,克服手工设计指标的碎片化以及基于卷积神经网络(CNN)方法的泛化能力差的问题。此外,我们对当前评估AIGC视频质量的方法进行了全面分析,包括生成模型的演变、数据集设计、质量维度和评估框架。我们认为传感器融合的进步使MLLM能够将低级物理约束与高级语义解释相结合,进一步提高视觉质量评估的准确性。