de Bodt Cyril, Mulders Dounia, Verleysen Michel, Lee John Aldo
IEEE Trans Neural Netw Learn Syst. 2019 Apr;30(4):1166-1179. doi: 10.1109/TNNLS.2018.2861891.
Dimensionality reduction (DR) aims at faithfully and meaningfully representing high-dimensional (HD) data into a low-dimensional (LD) space. Recently developed neighbor embedding DR methods lead to outstanding performances, thanks to their ability to foil the curse of dimensionality. Unfortunately, they cannot be directly employed on incomplete data sets, which become ubiquitous in machine learning. Discarding samples with missing features prevents their LD coordinates computation and deteriorates the complete samples treatment. Common missing data imputation schemes are not appropriate in the nonlinear DR context either. Indeed, even if they model the data distribution in the feature space, they can, at best, enable the application of a DR scheme on the expected data set. In practice, one would, instead, like to obtain the LD embedding with the closest cost function value on average with respect to the complete data case. As the state-of-the-art DR techniques are nonlinear, the latter embedding results from minimizing the expected cost function on the incomplete database, not from considering the expected data set. This paper addresses these limitations by developing a general methodology for nonlinear DR with missing data, being directly applicable with any DR scheme optimizing some criterion. In order to model the feature dependences, an HD extension of Gaussian mixture models is first fitted on the incomplete data set. It is afterward employed under the multiple imputation paradigms to obtain a single relevant LD embedding, thus minimizing the cost function expectation. Extensive experiments demonstrate the superiority of the suggested framework over alternative approaches.
降维(DR)旨在将高维(HD)数据如实地、有意义地表示到低维(LD)空间中。最近开发的邻域嵌入降维方法由于能够克服维数灾难,因而取得了出色的性能。不幸的是,它们不能直接应用于不完整数据集,而不完整数据集在机器学习中已变得十分普遍。丢弃具有缺失特征的样本会妨碍其低维坐标的计算,并使完整样本的处理效果变差。常见的缺失数据插补方案在非线性降维的情况下也不合适。实际上,即使它们对特征空间中的数据分布进行建模,充其量也只能使降维方案应用于期望的数据集。在实际中,人们更希望获得相对于完整数据情况平均具有最接近成本函数值的低维嵌入。由于最先进的降维技术是非线性的,后者的嵌入是通过在不完整数据库上最小化期望成本函数得到的,而不是通过考虑期望数据集得到的。本文通过开发一种用于处理缺失数据的非线性降维通用方法来解决这些局限性,该方法可直接应用于任何优化某种准则的降维方案。为了对特征依赖性进行建模,首先在不完整数据集上拟合高斯混合模型的高维扩展。随后在多重插补范式下使用它来获得单个相关的低维嵌入,从而最小化成本函数期望。大量实验证明了所提出框架相对于其他方法的优越性。