Stanley Jay S, Gigante Scott, Wolf Guy, Krishnaswamy Smita
Yale University, Appl. Math. Prog.
Yale University, Comp. Bio. & Bioinf. Prog.
Proc SIAM Int Conf Data Min. 2020;2020:316-324. doi: 10.1137/1.9781611976236.36.
We propose a novel framework for combining datasets via alignment of their intrinsic geometry. This alignment can be used to fuse data originating from disparate modalities, or to correct batch effects while preserving intrinsic data structure. Importantly, we do not assume any pointwise correspondence between datasets, but instead rely on correspondence between a (possibly unknown) subset of data features. We leverage this assumption to construct an isometric alignment between the data. This alignment is obtained by relating the expansion of data features in harmonics derived from diffusion operators defined over each dataset. These expansions encode each feature as a function of the data geometry. We use this to relate the diffusion coordinates of each dataset through our assumption of partial feature correspondence. Then, a unified diffusion geometry is constructed over the aligned data, which can also be used to correct the original data measurements. We demonstrate our method on several datasets, showing in particular its effectiveness in biological applications including fusion of single-cell RNA sequencing (scRNA-seq) and single-cell ATAC sequencing (scATAC-seq) data measured on the same population of cells, and removal of batch effect between biological samples.
我们提出了一种通过对齐数据集的内在几何结构来组合数据集的新颖框架。这种对齐可用于融合来自不同模态的数据,或在保留内在数据结构的同时校正批次效应。重要的是,我们不假设数据集之间存在任何逐点对应关系,而是依赖于数据特征的(可能未知的)子集之间的对应关系。我们利用这一假设来构建数据之间的等距对齐。这种对齐是通过关联在每个数据集上定义的扩散算子导出的谐波中数据特征的扩展来获得的。这些扩展将每个特征编码为数据几何的函数。我们利用这一点,通过我们的部分特征对应假设来关联每个数据集的扩散坐标。然后,在对齐的数据上构建统一的扩散几何结构,这也可用于校正原始数据测量。我们在几个数据集上展示了我们的方法,特别展示了其在生物学应用中的有效性,包括融合在同一细胞群体上测量的单细胞RNA测序(scRNA-seq)和单细胞ATAC测序(scATAC-seq)数据,以及消除生物样本之间的批次效应。