Wang Zexuan, Zhan Qipeng, Yang Shu, Mu Shizhuo, Chen Jiong, Garai Sumita, Orzechowski Patryk, Wagenaar Joost, Shen Li
Graduate Group in Applied Mathematics and Computational Science, University of Pennsylvania.
Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania.
bioRxiv. 2024 Feb 6:2024.02.06.578032. doi: 10.1101/2024.02.06.578032.
Single-cell technologies have emerged as a transformative technology enabling high-dimensional characterization of cell populations at an unprecedented scale. The data's innate complexity and voluminous nature pose significant computational and analytical challenges, especially in comparative studies delineating cellular architectures across various biological conditions (i.e., generation of sample level distance matrices). Optimal Transport (OT) is a mathematical tool that captures the intrinsic structure of data geometrically and has been applied to many bioinformatics tasks. In this paper, we propose QOT (Quantized Optimal Transport), a new method enables efficient computation of sample level distance matrix from large-scale single-cell omics data through a quantization step. We apply our algorithm to real-world single-cell genomics and pathomics datasets, aiming to extrapolate cell-level insights to inform sample level categorizations. Our empirical study shows that QOT outperforms OT-based algorithms in terms of accuracy and robustness when obtaining a distance matrix at the sample level from high throughput single-cell measures. Moreover, the sample level distance matrix could be used in downstream analysis (i.e. uncover the trajectory of disease progression), highlighting its usage in biomedical informatics and data science.
单细胞技术已成为一项变革性技术,能够以前所未有的规模对细胞群体进行高维表征。数据固有的复杂性和庞大的性质带来了重大的计算和分析挑战,特别是在描绘各种生物学条件下细胞结构的比较研究中(即生成样本水平距离矩阵)。最优传输(OT)是一种从几何角度捕捉数据内在结构的数学工具,已应用于许多生物信息学任务。在本文中,我们提出了量化最优传输(QOT),这是一种新方法,通过量化步骤能够从大规模单细胞组学数据高效计算样本水平距离矩阵。我们将算法应用于实际的单细胞基因组学和病理组学数据集,旨在推断细胞水平的见解以指导样本水平的分类。我们的实证研究表明,在从高通量单细胞测量中获取样本水平距离矩阵时,QOT在准确性和稳健性方面优于基于OT的算法。此外,样本水平距离矩阵可用于下游分析(即揭示疾病进展轨迹),突出了其在生物医学信息学和数据科学中的应用。