IEEE Trans Pattern Anal Mach Intell. 2022 May;44(5):2689-2697. doi: 10.1109/TPAMI.2020.3038760. Epub 2022 Apr 1.
When demonstrating the effectiveness of a new algorithm, researchers are traditionally encouraged to compare their algorithm's performance against existing algorithms on well-studied benchmark test suites. In the absence of more nuanced methodologies, algorithm performance is typically summarized on average across the test suite examples. This paper highlights the potential bias of conclusions drawn by analyzing "on average" performance, and the opportunities offered by a recent testing methodology known as instance space analysis. To illustrate, we revisit our 2007 comparative study of algorithms for facial age estimation, and rigorously stress-test to challenge the original conclusions. The case study demonstrates how powerful visualizations offered by instance space analysis enable greater insights into unique strengths and weaknesses, and which algorithm should be used when and why. Inspired by such insights, a new algorithm is proposed, and its unique advantage is demonstrated. The bias often hidden in well-studied datasets, and the ramifications for drawing biased conclusions, are also illustrated in this case study. While focused on facial age estimation, the methodology and lessons learned from the case study are broadly applicable to any study seeking to draw conclusions about algorithm performance based on empirical results.
当展示新算法的有效性时,研究人员通常被鼓励将其算法的性能与基准测试套件中现有的算法进行比较。在缺乏更细致的方法的情况下,算法性能通常在测试套件示例的平均值上进行总结。本文强调了通过分析“平均值”性能得出结论的潜在偏差,以及最近称为实例空间分析的测试方法提供的机会。为了说明这一点,我们重新审视了我们 2007 年对面部年龄估计算法的比较研究,并严格进行压力测试以挑战原始结论。案例研究展示了实例空间分析提供的强大可视化功能如何使我们更深入地了解独特的优势和劣势,以及何时以及为何应使用哪种算法。受此启发,提出了一种新算法,并展示了其独特的优势。该案例研究还说明了隐藏在研究充分的数据集背后的偏差以及得出有偏差结论的后果。虽然专注于面部年龄估计,但该方法和从案例研究中获得的经验教训广泛适用于任何旨在根据经验结果得出算法性能结论的研究。