Zollanvari Amin, Dougherty Edward R
Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843 ; Department of Statistics, Texas A&M University, College Station, TX 77843.
Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843 ; Translational Genomics Research Institute (TGEN), Phoenix, AZ 85004.
Pattern Recognit. 2014 Jun 1;47(6):2178-2192. doi: 10.1016/j.patcog.2013.11.022.
The most important aspect of any classifier is its error rate, because this quantifies its predictive capacity. Thus, the accuracy of error estimation is critical. Error estimation is problematic in small-sample classifier design because the error must be estimated using the same data from which the classifier has been designed. Use of prior knowledge, in the form of a prior distribution on an uncertainty class of feature-label distributions to which the true, but unknown, feature-distribution belongs, can facilitate accurate error estimation (in the mean-square sense) in circumstances where accurate completely model-free error estimation is impossible. This paper provides analytic asymptotically exact finite-sample approximations for various performance metrics of the resulting Bayesian Minimum Mean-Square-Error (MMSE) error estimator in the case of linear discriminant analysis (LDA) in the multivariate Gaussian model. These performance metrics include the first, second, and cross moments of the Bayesian MMSE error estimator with the true error of LDA, and therefore, the Root-Mean-Square (RMS) error of the estimator. We lay down the theoretical groundwork for Kolmogorov double-asymptotics in a Bayesian setting, which enables us to derive asymptotic expressions of the desired performance metrics. From these we produce analytic finite-sample approximations and demonstrate their accuracy via numerical examples. Various examples illustrate the behavior of these approximations and their use in determining the necessary sample size to achieve a desired RMS. The Supplementary Material contains derivations for some equations and added figures.
任何分类器最重要的方面是其错误率,因为这量化了它的预测能力。因此,错误估计的准确性至关重要。在小样本分类器设计中,错误估计存在问题,因为必须使用设计分类器所依据的相同数据来估计错误。在真实但未知的特征分布所属的特征 - 标签分布的不确定性类上,以先验分布的形式使用先验知识,在无法进行完全无模型的准确错误估计的情况下,可以促进(在均方意义上)准确的错误估计。本文针对多元高斯模型中线性判别分析(LDA)情况下所得贝叶斯最小均方误差(MMSE)错误估计器的各种性能指标,提供了分析渐近精确的有限样本近似。这些性能指标包括贝叶斯MMSE错误估计器与LDA真实错误的一阶、二阶和交叉矩,因此也包括估计器的均方根(RMS)误差。我们为贝叶斯环境下的柯尔莫哥洛夫双渐近性奠定了理论基础,这使我们能够推导出所需性能指标的渐近表达式。由此我们得出分析有限样本近似,并通过数值示例证明其准确性。各种示例说明了这些近似的行为及其在确定实现所需RMS所需样本量方面的用途。补充材料包含一些方程的推导和补充图表。