IEEE Trans Pattern Anal Mach Intell. 2020 Sep;42(9):2065-2081. doi: 10.1109/TPAMI.2019.2910523. Epub 2019 Apr 11.
Deep learning revolutionized data science, and recently its popularity has grown exponentially, as did the amount of papers employing deep networks. Vision tasks, such as human pose estimation, did not escape from this trend. There is a large number of deep models, where small changes in the network architecture, or in the data pre-processing, together with the stochastic nature of the optimization procedures, produce notably different results, making extremely difficult to sift methods that significantly outperform others. This situation motivates the current study, in which we perform a systematic evaluation and statistical analysis of vanilla deep regression, i.e., convolutional neural networks with a linear regression top layer. This is the first comprehensive analysis of deep regression techniques. We perform experiments on four vision problems, and report confidence intervals for the median performance as well as the statistical significance of the results, if any. Surprisingly, the variability due to different data pre-processing procedures generally eclipses the variability due to modifications in the network architecture. Our results reinforce the hypothesis according to which, in general, a general-purpose network (e.g., VGG-16 or ResNet-50) adequately tuned can yield results close to the state-of-the-art without having to resort to more complex and ad-hoc regression models.
深度学习彻底改变了数据科学,最近它的普及程度呈指数级增长,使用深度网络的论文数量也随之增加。视觉任务,如人体姿态估计,也没有逃脱这一趋势。有大量的深度模型,其中网络架构的微小变化,或者数据预处理的变化,加上优化过程的随机性,会产生明显不同的结果,使得很难筛选出明显优于其他方法的方法。这种情况促使我们进行了当前的研究,我们对普通的深度回归(即带有线性回归顶层的卷积神经网络)进行了系统的评估和统计分析。这是对深度回归技术的首次全面分析。我们在四个视觉问题上进行了实验,并报告了中位数性能的置信区间,以及如果有的话,结果的统计显著性。令人惊讶的是,由于不同的数据预处理过程引起的变化通常超过了由于网络架构的修改引起的变化。我们的结果强化了这样一种假设,即通常情况下,一个通用的网络(例如,VGG-16 或 ResNet-50)经过适当调整,可以产生接近最新技术水平的结果,而不必使用更复杂和特定于任务的回归模型。