Kellogg School of Management, Northwestern University, Evanston, IL, USA.
Media Lab, Massachusetts Institute of Technology, Cambridge, MA, USA.
Nat Commun. 2024 Sep 2;15(1):7629. doi: 10.1038/s41467-024-51998-z.
Recent advances in technology for hyper-realistic visual and audio effects provoke the concern that deepfake videos of political speeches will soon be indistinguishable from authentic video. We conduct 5 pre-registered randomized experiments with N = 2215 participants to evaluate how accurately humans distinguish real political speeches from fabrications across base rates of misinformation, audio sources, question framings with and without priming, and media modalities. We do not find base rates of misinformation have statistically significant effects on discernment. We find deepfakes with audio produced by the state-of-the-art text-to-speech algorithms are harder to discern than the same deepfakes with voice actor audio. Moreover across all experiments and question framings, we find audio and visual information enables more accurate discernment than text alone: human discernment relies more on how something is said, the audio-visual cues, than what is said, the speech content.
最近在超逼真视觉和音频效果方面的技术进步引发了人们的担忧,即政治演讲的深度伪造视频将很快与真实视频无法区分。我们进行了 5 项预先注册的随机实验,共有 2215 名参与者,以评估人类在各种错误信息基数、音频来源、带有和不带有启动的问题框架以及媒体模式下,对真实政治演讲和伪造演讲的准确区分程度。我们没有发现错误信息的基本比率对识别有统计学上的显著影响。我们发现,由最先进的文本到语音算法生成的音频的深度伪造比具有语音演员音频的深度伪造更难识别。此外,在所有实验和问题框架中,我们发现音频和视觉信息比仅文本更能准确识别:人类识别更多地依赖于说话的方式,即音频-视觉线索,而不是所说的内容,即演讲内容。