Pernot Pascal, Savin Andreas
Institut de Chimie Physique, UMR8000, CNRS, Université Paris-Saclay, 91405 Orsay, France.
Laboratoire de Chimie Théorique, CNRS and UPMC Université Paris 06, Sorbonne Universités, 75252 Paris, France.
J Chem Phys. 2020 Apr 30;152(16):164108. doi: 10.1063/5.0006202.
The comparison of benchmark error sets is an essential tool for the evaluation of theories in computational chemistry. The standard ranking of methods by their mean unsigned error is unsatisfactory for several reasons linked to the non-normality of the error distributions and the presence of underlying trends. Complementary statistics have recently been proposed to palliate such deficiencies, such as quantiles of the absolute error distribution or the mean prediction uncertainty. We introduce here a new score, the systematic improvement probability, based on the direct system-wise comparison of absolute errors. Independent of the chosen scoring rule, the uncertainty of the statistics due to the incompleteness of the benchmark datasets is also generally overlooked. However, this uncertainty is essential to appreciate the robustness of rankings. In the present article, we develop two indicators based on robust statistics to address this problem: P, the inversion probability between two values of a statistic, and P, the ranking probability matrix. We demonstrate also the essential contribution of the correlations between error sets in these scores comparisons.
基准误差集的比较是评估计算化学理论的重要工具。按平均绝对误差对方法进行标准排名存在若干不足之处,原因与误差分布的非正态性以及潜在趋势的存在有关。最近有人提出了补充统计量来缓解这些不足,例如绝对误差分布的分位数或平均预测不确定性。我们在此引入一种新的分数,即系统改进概率,它基于绝对误差的直接系统间比较。无论选择何种评分规则,由于基准数据集不完整导致的统计量不确定性通常也被忽视。然而,这种不确定性对于评估排名的稳健性至关重要。在本文中,我们基于稳健统计量开发了两个指标来解决这个问题:P,即统计量两个值之间的反转概率;以及P,即排名概率矩阵。我们还证明了误差集之间的相关性在这些分数比较中的重要作用。