Huang Mengqing, Yu Hongchuan, Zhang Jianjun
National Centre for Computer Animation, Bournemouth University, Poole, BH12 5BB, UK.
Sci Rep. 2025 Mar 21;15(1):9747. doi: 10.1038/s41598-025-93005-5.
There is an ongoing and dedicated effort to estimate bounds on the generalization error of deep learning models, coupled with an increasing interest with practical metrics that can be used to experimentally evaluate a model's ability to generalize. This interest is not only driven by practical considerations but is also vital for theoretical research, as theoretical estimations require practical validation. However, there is currently a lack of research on benchmarking the generalization capacity of various deep networks and verifying these theoretical estimations. This paper aims to introduce a practical generalization metric for benchmarking different deep networks and proposes a novel testbed for the verification of theoretical estimations. Our findings indicate that a deep network's generalization capacity in classification tasks is contingent upon both classification accuracy and the diversity of unseen data. The proposed metric system is capable of quantifying the accuracy of deep learning models and the diversity of data, providing an intuitive and quantitative evaluation method - a trade-off point. Furthermore, we compare our practical metric with existing generalization theoretical estimations using our benchmarking testbed. It is discouraging to note that most of the available generalization estimations do not correlate with the practical measurements obtained using our testbed. On the other hand, this finding is significant as it exposes the shortcomings of theoretical estimations and inspires new exploration.
人们正在持续且专注地努力估计深度学习模型泛化误差的界限,同时,对于可用于通过实验评估模型泛化能力的实用指标的兴趣也与日俱增。这种兴趣不仅受到实际考量的驱动,对于理论研究也至关重要,因为理论估计需要实际验证。然而,目前缺乏关于对各种深度网络的泛化能力进行基准测试并验证这些理论估计的研究。本文旨在引入一种用于对不同深度网络进行基准测试的实用泛化指标,并提出一个用于验证理论估计的新型测试平台。我们的研究结果表明,深度网络在分类任务中的泛化能力取决于分类准确率和未见数据的多样性。所提出的指标系统能够量化深度学习模型的准确性和数据的多样性,提供一种直观且定量的评估方法——一个权衡点。此外,我们使用我们的基准测试平台将我们的实用指标与现有的泛化理论估计进行比较。值得注意的是,大多数现有的泛化估计与使用我们的测试平台获得的实际测量结果不相关,这令人沮丧。另一方面,这一发现意义重大,因为它揭示了理论估计的缺点并激发了新的探索。