White David, Dunn James D, Schmid Alexandra C, Kemp Richard I
School of Psychology, The University of New South Wales, Sydney, Australia.
School of Psychology, The University of Sydney, Sydney, Australia.
PLoS One. 2015 Oct 14;10(10):e0139827. doi: 10.1371/journal.pone.0139827. eCollection 2015.
In recent years, wide deployment of automatic face recognition systems has been accompanied by substantial gains in algorithm performance. However, benchmarking tests designed to evaluate these systems do not account for the errors of human operators, who are often an integral part of face recognition solutions in forensic and security settings. This causes a mismatch between evaluation tests and operational accuracy. We address this by measuring user performance in a face recognition system used to screen passport applications for identity fraud. Experiment 1 measured target detection accuracy in algorithm-generated 'candidate lists' selected from a large database of passport images. Accuracy was notably poorer than in previous studies of unfamiliar face matching: participants made over 50% errors for adult target faces, and over 60% when matching images of children. Experiment 2 then compared performance of student participants to trained passport officers-who use the system in their daily work-and found equivalent performance in these groups. Encouragingly, a group of highly trained and experienced "facial examiners" outperformed these groups by 20 percentage points. We conclude that human performance curtails accuracy of face recognition systems-potentially reducing benchmark estimates by 50% in operational settings. Mere practise does not attenuate these limits, but superior performance of trained examiners suggests that recruitment and selection of human operators, in combination with effective training and mentorship, can improve the operational accuracy of face recognition systems.
近年来,随着算法性能的显著提升,自动人脸识别系统得到了广泛应用。然而,用于评估这些系统的基准测试并未考虑人类操作员的错误,而在法医和安全环境中,人类操作员往往是人脸识别解决方案中不可或缺的一部分。这导致评估测试与实际操作准确性之间存在不匹配。我们通过在一个用于筛查护照申请身份欺诈的人脸识别系统中测量用户表现来解决这个问题。实验1测量了从大量护照图像数据库中选择的算法生成的“候选列表”中的目标检测准确性。准确性明显低于以往对不熟悉面孔匹配的研究:参与者对成年目标面孔的错误率超过50%,对儿童图像进行匹配时错误率超过60%。实验2随后将学生参与者的表现与训练有素的护照官员(他们在日常工作中使用该系统)进行了比较,发现这两组表现相当。令人鼓舞的是,一组训练有素且经验丰富的“面部审查员”比这些组的表现高出20个百分点。我们得出结论,人类表现会降低人脸识别系统的准确性——在实际操作环境中可能会使基准估计降低50%。仅仅练习并不能减轻这些限制,但训练有素的审查员的卓越表现表明,招聘和选拔人类操作员,再加上有效的培训和指导,可以提高人脸识别系统的实际操作准确性。