利用单细胞转录组数据中的基因相关性。

Leveraging gene correlations in single cell transcriptomic data.

作者信息

Silkwood Kai, Dollinger Emmanuel, Gervin Josh, Atwood Scott, Nie Qing, Lander Arthur D

机构信息

Center for Complex Biological Systems, University of California, Irvine, Irvine CA.

Department of Developmental and Cell Biology, University of California, Irvine, Irvine CA.

出版信息

bioRxiv. 2023 Nov 1:2023.03.14.532643. doi: 10.1101/2023.03.14.532643.

DOI:10.1101/2023.03.14.532643

PMID:36993765

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10055147/

Abstract

BACKGROUND

Many approaches have been developed to overcome technical noise in single cell RNA-sequencing (scRNAseq). As researchers dig deeper into data-looking for rare cell types, subtleties of cell states, and details of gene regulatory networks-there is a growing need for algorithms with controllable accuracy and fewer parameters and thresholds. Impeding this goal is the fact that an appropriate null distribution for scRNAseq cannot simply be extracted from data when ground truth about biological variation is unknown (i.e., usually).

RESULTS

We approach this problem analytically, assuming that scRNAseq data reflect only cell heterogeneity (what we seek to characterize), transcriptional noise (temporal fluctuations randomly distributed across cells), and sampling error (i.e., Poisson noise). We analyze scRNAseq data without normalization-a step that skews distributions, particularly for sparse data-and calculate -values associated with key statistics. We develop an improved method for selecting features for cell clustering and identifying gene-gene correlations, both positive and negative. Using simulated data, we show that this method, which we call BigSur (Basic Informatics and Gene Statistics from Unnormalized Reads), captures even weak yet significant correlation structures in scRNAseq data. Applying BigSur to data from a clonal human melanoma cell line, we identify thousands of correlations that, when clustered without supervision into gene communities, align with known cellular components and biological processes, and highlight potentially novel cell biological relationships.

CONCLUSIONS

New insights into functionally relevant gene regulatory networks can be obtained using a statistically grounded approach to the identification of gene-gene correlations.

摘要

背景

已经开发出许多方法来克服单细胞RNA测序（scRNAseq）中的技术噪声。随着研究人员更深入地挖掘数据，寻找稀有细胞类型、细胞状态的细微差别以及基因调控网络的细节，对具有可控准确性、更少参数和阈值的算法的需求日益增长。阻碍这一目标的是，当关于生物学变异的真实情况未知时（即通常情况下），无法简单地从数据中提取scRNAseq的合适空分布。

结果

我们通过分析方法解决这个问题，假设scRNAseq数据仅反映细胞异质性（我们试图表征的内容）、转录噪声（随机分布在细胞间的时间波动）和采样误差（即泊松噪声）。我们在不进行归一化的情况下分析scRNAseq数据——归一化这一步骤会扭曲分布，特别是对于稀疏数据——并计算与关键统计量相关的p值。我们开发了一种改进方法，用于选择细胞聚类的特征以及识别正负基因-基因相关性。使用模拟数据，我们表明这种我们称为BigSur（来自未归一化读数的基本信息学和基因统计）的方法能够捕捉scRNAseq数据中即使是微弱但显著的相关结构。将BigSur应用于克隆人黑色素瘤细胞系的数据，我们识别出数千种相关性，在无监督的情况下将这些相关性聚类成基因群落时，它们与已知的细胞成分和生物学过程一致，并突出了潜在的新型细胞生物学关系。