Zhu Qin, Jiang Zuzhi, Thomson Matt, Gartner Zev
Department of Pharmaceutical Chemistry, University of California San Francisco; San Francisco, CA 94158, USA.
Tetrad Graduate Program, University of California San Francisco; San Francisco, CA 94158, USA.
bioRxiv. 2025 Apr 11:2025.03.13.643146. doi: 10.1101/2025.03.13.643146.
Batch integration, denoising, and dimensionality reduction remain fundamental challenges in single-cell data analysis. While many machine learning tools aim to overcome these challenges by engineering model architectures, we use a different strategy, building on the insight that optimized mini-batch sampling during training can profoundly influence learning outcomes. We present CONCORD, a self-supervised learning approach that implements a unified, probabilistic data sampling scheme combining neighborhood-aware and dataset-aware sampling: the former enhancing resolution while the latter removing batch effects. Using only a minimalist one-hidden-layer neural network and contrastive learning, CONCORD achieves state-of-the-art performance without relying on deep architectures, auxiliary losses, or supervision. It generates high-resolution cell atlases that seamlessly integrate data across batches, technologies, and species, without relying on prior assumptions about data structure. The resulting latent representations are denoised, interpretable, and biologically meaningful-capturing gene co-expression programs, resolving subtle cellular states, and preserving both local geometric relationships and global topological organization. We demonstrate CONCORD's broad applicability across diverse datasets, establishing it as a general-purpose framework for learning unified, high-fidelity representations of cellular identity and dynamics.
批量整合、去噪和降维仍然是单细胞数据分析中的基本挑战。虽然许多机器学习工具旨在通过设计模型架构来克服这些挑战,但我们采用了一种不同的策略,基于这样一种见解:训练期间优化的小批量采样会深刻影响学习结果。我们提出了CONCORD,这是一种自监督学习方法,它实现了一种统一的概率数据采样方案,结合了邻域感知采样和数据集感知采样:前者提高分辨率,而后者消除批量效应。仅使用一个极简的单隐藏层神经网络和对比学习,CONCORD在不依赖深度架构、辅助损失或监督的情况下实现了领先的性能。它生成高分辨率的细胞图谱,可无缝整合跨批次、技术和物种的数据,而无需依赖关于数据结构的先验假设。由此产生的潜在表示经过去噪、可解释且具有生物学意义——捕捉基因共表达程序,解析微妙的细胞状态,并保留局部几何关系和全局拓扑组织。我们展示了CONCORD在各种数据集上的广泛适用性,将其确立为学习细胞身份和动态的统一、高保真表示的通用框架。