Zhang Min, Mishne Gal, Chi Eric C
Department of Statistics, North Carolina State University, Raleigh, North Carolina, USA.
Halcıoğlu Data Science Institute, University of California, San Diego, California, USA.
Stat Anal Data Min. 2022 Jun;15(3):303-313. doi: 10.1002/sam.11561. Epub 2021 Nov 5.
Many machine learning algorithms depend on weights that quantify row and column similarities of a data matrix. The choice of weights can dramatically impact the effectiveness of the algorithm. Nonetheless, the problem of choosing weights has arguably not been given enough study. When a data matrix is completely observed, Gaussian kernel affinities can be used to quantify the local similarity between pairs of rows and pairs of columns. Computing weights in the presence of missing data, however, becomes challenging. In this paper, we propose a new method to construct row and column affinities even when data are missing by building off a co-clustering technique. This method takes advantage of solving the optimization problem for multiple pairs of cost parameters and filling in the missing values with increasingly smooth estimates. It exploits the coupled similarity structure among both the rows and columns of a data matrix. We show these affinities can be used to perform tasks such as data imputation, clustering, and matrix completion on graphs.
许多机器学习算法依赖于量化数据矩阵行和列相似度的权重。权重的选择会极大地影响算法的有效性。然而,权重选择问题的研究可能还不够充分。当数据矩阵被完全观测到时,高斯核亲和度可用于量化行对和列对之间的局部相似度。然而,在存在缺失数据的情况下计算权重变得具有挑战性。在本文中,我们提出了一种新方法,即使数据缺失,也能通过基于共聚类技术构建行亲和度和列亲和度。该方法利用为多对成本参数求解优化问题,并使用越来越平滑的估计值填充缺失值。它利用了数据矩阵行和列之间的耦合相似度结构。我们表明,这些亲和度可用于在图上执行数据插补、聚类和矩阵补全等任务。