Wang Zexuan, Zhan Qipeng, Yang Shu, Mu Shizhuo, Chen Jiong, Garai Sumita, Orzechowski Patryk, Wagenaar Joost, Shen Li
Graduate Group in Applied Mathematics and Computational Science, University of Pennsylvania, Philadelphia, PA 19104, United States.
Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Philadelphia, PA 19104, United States.
Brief Bioinform. 2024 Nov 22;26(1). doi: 10.1093/bib/bbae713.
Single-cell technologies have enabled the high-dimensional characterization of cell populations at an unprecedented scale. The innate complexity and increasing volume of data pose significant computational and analytical challenges, especially in comparative studies delineating cellular architectures across various biological conditions (i.e. generation of sample-level distance matrices). Optimal Transport is a mathematical tool that captures the intrinsic structure of data geometrically and has been applied to many bioinformatics tasks. In this paper, we propose QOT (Quantized Optimal Transport), a new method enabling efficient computation of sample-level distance matrix from large-scale single-cell omics data through a quantization step. We apply our algorithm to real-world single-cell genomics and pathomics datasets, aiming to extrapolate cell-level insights to inform sample-level categorizations. Our empirical study shows that QOT outperforms existing two OT-based algorithms in accuracy and robustness when obtaining a distance matrix from high throughput single-cell measures at the sample level. Moreover, the sample level distance matrix could be used in the downstream analysis (i.e. uncover the trajectory of disease progression), highlighting its usage in biomedical informatics and data science.
单细胞技术能够以前所未有的规模对细胞群体进行高维表征。数据固有的复杂性和体量的不断增加带来了重大的计算和分析挑战,尤其是在描绘不同生物学条件下细胞结构的比较研究中(即生成样本级距离矩阵)。最优传输是一种从几何角度捕捉数据内在结构的数学工具,已应用于许多生物信息学任务。在本文中,我们提出了量化最优传输(QOT),这是一种通过量化步骤从大规模单细胞组学数据高效计算样本级距离矩阵的新方法。我们将算法应用于真实世界的单细胞基因组学和病理组学数据集,旨在推断细胞水平的见解以指导样本级分类。我们的实证研究表明,在从样本水平的高通量单细胞测量中获取距离矩阵时,QOT在准确性和稳健性方面优于现有的两种基于最优传输的算法。此外,样本级距离矩阵可用于下游分析(即揭示疾病进展轨迹),突出了其在生物医学信息学和数据科学中的应用。