具有缺失数据的多尺度亲和力：估计与应用。

Multi-scale affinities with missing data: Estimation and applications.

作者信息

Zhang Min, Mishne Gal, Chi Eric C

机构信息

Department of Statistics, North Carolina State University, Raleigh, North Carolina, USA.

Halcıoğlu Data Science Institute, University of California, San Diego, California, USA.

出版信息

Stat Anal Data Min. 2022 Jun;15(3):303-313. doi: 10.1002/sam.11561. Epub 2021 Nov 5.

DOI:10.1002/sam.11561

PMID:35756358

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9216212/

Abstract

Many machine learning algorithms depend on weights that quantify row and column similarities of a data matrix. The choice of weights can dramatically impact the effectiveness of the algorithm. Nonetheless, the problem of choosing weights has arguably not been given enough study. When a data matrix is completely observed, Gaussian kernel affinities can be used to quantify the local similarity between pairs of rows and pairs of columns. Computing weights in the presence of missing data, however, becomes challenging. In this paper, we propose a new method to construct row and column affinities even when data are missing by building off a co-clustering technique. This method takes advantage of solving the optimization problem for multiple pairs of cost parameters and filling in the missing values with increasingly smooth estimates. It exploits the coupled similarity structure among both the rows and columns of a data matrix. We show these affinities can be used to perform tasks such as data imputation, clustering, and matrix completion on graphs.

摘要

许多机器学习算法依赖于量化数据矩阵行和列相似度的权重。权重的选择会极大地影响算法的有效性。然而，权重选择问题的研究可能还不够充分。当数据矩阵被完全观测到时，高斯核亲和度可用于量化行对和列对之间的局部相似度。然而，在存在缺失数据的情况下计算权重变得具有挑战性。在本文中，我们提出了一种新方法，即使数据缺失，也能通过基于共聚类技术构建行亲和度和列亲和度。该方法利用为多对成本参数求解优化问题，并使用越来越平滑的估计值填充缺失值。它利用了数据矩阵行和列之间的耦合相似度结构。我们表明，这些亲和度可用于在图上执行数据插补、聚类和矩阵补全等任务。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bc6d/9216212/03db318fe25f/nihms-1751214-f0001.jpg

相似文献

Multi-scale affinities with missing data: Estimation and applications.具有缺失数据的多尺度亲和力：估计与应用。

Stat Anal Data Min. 2022 Jun;15(3):303-313. doi: 10.1002/sam.11561. Epub 2021 Nov 5.

Multiple Kernel k-Means with Incomplete Kernels.具有不完整核的多核k均值算法

IEEE Trans Pattern Anal Mach Intell. 2020 May;42(5):1191-1204. doi: 10.1109/TPAMI.2019.2892416. Epub 2019 Jan 14.

Missing Value Estimation using Clustering and Deep Learning within Multiple Imputation Framework.在多重填补框架内使用聚类和深度学习进行缺失值估计

Knowl Based Syst. 2022 Aug 5;249. doi: 10.1016/j.knosys.2022.108968. Epub 2022 May 10.

Handling missing rows in multi-omics data integration: multiple imputation in multiple factor analysis framework.多组学数据整合中缺失行的处理：多因素分析框架下的多重填补

BMC Bioinformatics. 2016 Oct 3;17(1):402. doi: 10.1186/s12859-016-1273-5.

A multiple kernel density clustering algorithm for incomplete datasets in bioinformatics.一种用于生物信息学中不完整数据集的多核密度聚类算法。

BMC Syst Biol. 2018 Nov 22;12(Suppl 6):111. doi: 10.1186/s12918-018-0630-6.

Advanced methods for missing values imputation based on similarity learning.基于相似性学习的缺失值插补先进方法。

PeerJ Comput Sci. 2021 Jul 21;7:e619. doi: 10.7717/peerj-cs.619. eCollection 2021.

Multiple imputation with sequential penalized regression.多重插补与序贯惩罚回归。

Stat Methods Med Res. 2019 May;28(5):1311-1327. doi: 10.1177/0962280218755574. Epub 2018 Feb 16.

Multiple Matrix Gaussian Graphs Estimation.多元矩阵高斯图估计

J R Stat Soc Series B Stat Methodol. 2018 Nov;80(5):927-950. doi: 10.1111/rssb.12278. Epub 2018 Jun 14.

Missing value estimation methods for DNA microarrays.DNA微阵列的缺失值估计方法。

Bioinformatics. 2001 Jun;17(6):520-5. doi: 10.1093/bioinformatics/17.6.520.

Towards clustering of incomplete microarray data without the use of imputation.迈向无需插补的不完整微阵列数据聚类

Bioinformatics. 2007 Jan 1;23(1):107-13. doi: 10.1093/bioinformatics/btl555. Epub 2006 Oct 31.

本文引用的文献

COBRAC: a fast implementation of convex biclustering with compression.COBRAC：一种具有压缩功能的凸双聚类快速实现方法。

Bioinformatics. 2021 Oct 25;37(20):3667-3669. doi: 10.1093/bioinformatics/btab248.

Multiway Graph Signal Processing on Tensors: Integrative analysis of irregular geometries.张量上的多路图信号处理：不规则几何形状的综合分析。

IEEE Signal Process Mag. 2020 Nov;37(6):160-173. doi: 10.1109/MSP.2020.3013555. Epub 2020 Oct 29.

Clustering with t-SNE, provably.使用t-SNE进行聚类，可证明。

SIAM J Math Data Sci. 2019;1(2):313-332. doi: 10.1137/18m1216134. Epub 2019 May 28.

Optimal clustering with missing values.最优聚类处理缺失值。

BMC Bioinformatics. 2019 Jun 20;20(Suppl 12):321. doi: 10.1186/s12859-019-2832-3.

Data-Driven Tree Transforms and Metrics.数据驱动的树变换与度量

IEEE Trans Signal Inf Process Netw. 2018 Sep;4(3):451-466. doi: 10.1109/TSIPN.2017.2743561. Epub 2017 Aug 23.

Convex biclustering.凸双聚类

Biometrics. 2017 Mar;73(1):10-19. doi: 10.1111/biom.12540. Epub 2016 May 10.

Objective Automatic Assessment of Rehabilitative Speech Treatment in Parkinson's Disease.帕金森病康复性言语治疗的客观评估

IEEE Trans Neural Syst Rehabil Eng. 2014 Jan;22(1):181-90. doi: 10.1109/TNSRE.2013.2293575.

Self-Organizing Feature Maps Identify Proteins Critical to Learning in a Mouse Model of Down Syndrome.自组织特征映射识别唐氏综合征小鼠模型中对学习至关重要的蛋白质。

PLoS One. 2015 Jun 25;10(6):e0129126. doi: 10.1371/journal.pone.0129126. eCollection 2015.

Image processing using smooth ordering of its patches.使用平滑排序的补丁进行图像处理。

IEEE Trans Image Process. 2013 Jul;22(7):2764-74. doi: 10.1109/TIP.2013.2257813. Epub 2013 Apr 12.

Spectral Regularization Algorithms for Learning Large Incomplete Matrices.用于学习大型不完整矩阵的谱正则化算法

J Mach Learn Res. 2010 Mar 1;11:2287-2322.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验