通过可解析求解的熵离群值稀疏化实现数据异常的廉价稳健学习。

Cheap robust learning of data anomalies with analytically solvable entropic outlier sparsification.

作者信息

Horenko Illia

机构信息

Faculty of Informatics, Institute of Computing, Universitá della Svizzera Italiana, TI-6900 Lugano, Switzerland

出版信息

Proc Natl Acad Sci U S A. 2022 Mar 1;119(9). doi: 10.1073/pnas.2119659119.

DOI:10.1073/pnas.2119659119

PMID:35197293

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8917346/

Abstract

Entropic outlier sparsification (EOS) is proposed as a cheap and robust computational strategy for learning in the presence of data anomalies and outliers. EOS dwells on the derived analytic solution of the (weighted) expected loss minimization problem subject to Shannon entropy regularization. An identified closed-form solution is proven to impose additional costs that depend linearly on statistics size and are independent of data dimension. Obtained analytic results also explain why the mixtures of spherically symmetric Gaussians-used heuristically in many popular data analysis algorithms-represent an optimal and least-biased choice for the nonparametric probability distributions when working with squared Euclidean distances. The performance of EOS is compared to a range of commonly used tools on synthetic problems and on partially mislabeled supervised classification problems from biomedicine. Applying EOS for coinference of data anomalies during learning is shown to allow reaching an accuracy of [Formula: see text] when predicting patient mortality after heart failure, statistically significantly outperforming predictive performance of common learning tools for the same data.

摘要

熵离群值稀疏化（EOS）被提出作为一种在存在数据异常和离群值的情况下进行学习的廉价且稳健的计算策略。EOS基于受香农熵正则化约束的（加权）期望损失最小化问题的推导解析解。已证明一个确定的闭式解会带来额外成本，这些成本线性依赖于统计量大小且与数据维度无关。所获得的分析结果还解释了为什么在许多流行数据分析算法中启发式使用的球对称高斯混合，在使用平方欧几里得距离时对于非参数概率分布而言代表了一种最优且偏差最小的选择。在合成问题以及来自生物医学的部分错误标记的监督分类问题上，将EOS的性能与一系列常用工具进行了比较。结果表明，在学习过程中应用EOS进行数据异常的共推断，在预测心力衰竭后患者死亡率时能够达到[公式：见原文]的准确率，在统计学上显著优于相同数据的常见学习工具的预测性能。