Center for Complex Biological Systems, University of California, Irvine, Irvine, CA, USA.
Department of Developmental and Cell Biology, University of California, Irvine, Irvine, CA, USA.
BMC Bioinformatics. 2024 Sep 18;25(1):305. doi: 10.1186/s12859-024-05926-z.
Many approaches have been developed to overcome technical noise in single cell RNA-sequencing (scRNAseq). As researchers dig deeper into data-looking for rare cell types, subtleties of cell states, and details of gene regulatory networks-there is a growing need for algorithms with controllable accuracy and fewer ad hoc parameters and thresholds. Impeding this goal is the fact that an appropriate null distribution for scRNAseq cannot simply be extracted from data in which ground truth about biological variation is unknown (i.e., usually).
We approach this problem analytically, assuming that scRNAseq data reflect only cell heterogeneity (what we seek to characterize), transcriptional noise (temporal fluctuations randomly distributed across cells), and sampling error (i.e., Poisson noise). We analyze scRNAseq data without normalization-a step that skews distributions, particularly for sparse data-and calculate p values associated with key statistics. We develop an improved method for selecting features for cell clustering and identifying gene-gene correlations, both positive and negative. Using simulated data, we show that this method, which we call BigSur (Basic Informatics and Gene Statistics from Unnormalized Reads), captures even weak yet significant correlation structures in scRNAseq data. Applying BigSur to data from a clonal human melanoma cell line, we identify thousands of correlations that, when clustered without supervision into gene communities, align with known cellular components and biological processes, and highlight potentially novel cell biological relationships.
New insights into functionally relevant gene regulatory networks can be obtained using a statistically grounded approach to the identification of gene-gene correlations.
为了克服单细胞 RNA 测序(scRNAseq)中的技术噪声,已经开发了许多方法。随着研究人员深入挖掘数据——寻找罕见的细胞类型、细胞状态的细微差别和基因调控网络的细节——需要越来越多的算法,这些算法具有可控的准确性和较少的特殊参数和阈值。阻碍这一目标的是,由于关于生物变异的真实情况未知(即通常情况下),因此不能简单地从数据中提取 scRNAseq 的适当零分布。
我们从分析的角度来解决这个问题,假设 scRNAseq 数据仅反映细胞异质性(我们试图描述的)、转录噪声(随机分布在细胞之间的时间波动)和抽样误差(即泊松噪声)。我们对未进行归一化的 scRNAseq 数据进行分析——这一步骤会扭曲分布,尤其是对于稀疏数据——并计算与关键统计数据相关的 p 值。我们开发了一种改进的方法,用于选择细胞聚类和识别正相关和负相关基因的特征。使用模拟数据,我们表明,这种方法(我们称之为 BigSur,即未归一化读取的基本信息学和基因统计)可以捕获 scRNAseq 数据中即使是微弱但却显著的相关结构。将 BigSur 应用于克隆人黑色素瘤细胞系的数据,我们鉴定了数千个相关性,当无监督地将这些相关性聚类为基因社区时,这些相关性与已知的细胞成分和生物过程相吻合,并突出了潜在的新的细胞生物学关系。
使用基于统计学的方法识别基因-基因相关性,可以深入了解功能相关的基因调控网络。