Department of Biostatistics, Harvard T.H. Chan School of Public Health, Building 2 435, 655 Huntington Ave, Boston, MA 02115, United States.
Department of Statistics and Data Science, Carnegie Mellon University, Baker Hall 228B, 4909 Frew St, Pittsburgh, PA 15213, United States.
Biostatistics. 2024 Oct 1;25(4):1254-1272. doi: 10.1093/biostatistics/kxae010.
CRISPR genome engineering and single-cell RNA sequencing have accelerated biological discovery. Single-cell CRISPR screens unite these two technologies, linking genetic perturbations in individual cells to changes in gene expression and illuminating regulatory networks underlying diseases. Despite their promise, single-cell CRISPR screens present considerable statistical challenges. We demonstrate through theoretical and real data analyses that a standard method for estimation and inference in single-cell CRISPR screens-"thresholded regression"-exhibits attenuation bias and a bias-variance tradeoff as a function of an intrinsic, challenging-to-select tuning parameter. To overcome these difficulties, we introduce GLM-EIV ("GLM-based errors-in-variables"), a new method for single-cell CRISPR screen analysis. GLM-EIV extends the classical errors-in-variables model to responses and noisy predictors that are exponential family-distributed and potentially impacted by the same set of confounding variables. We develop a computational infrastructure to deploy GLM-EIV across hundreds of processors on clouds (e.g. Microsoft Azure) and high-performance clusters. Leveraging this infrastructure, we apply GLM-EIV to analyze two recent, large-scale, single-cell CRISPR screen datasets, yielding several new insights.
CRISPR 基因组工程和单细胞 RNA 测序加速了生物学发现。单细胞 CRISPR 筛选将这两种技术结合在一起,将单个细胞中的遗传扰动与基因表达的变化联系起来,并阐明了疾病背后的调控网络。尽管它们很有前途,但单细胞 CRISPR 筛选存在相当大的统计挑战。我们通过理论和真实数据分析证明,单细胞 CRISPR 筛选中用于估计和推断的一种标准方法-"阈值回归" - 表现出衰减偏差和偏差-方差权衡,这是一个内在的、难以选择的调整参数的函数。为了克服这些困难,我们引入了 GLM-EIV(基于广义线性模型的误差变量),这是一种用于单细胞 CRISPR 筛选分析的新方法。GLM-EIV 将经典的误差变量模型扩展到响应和嘈杂预测变量,这些响应和嘈杂预测变量是指数家族分布的,并且可能受到同一组混杂变量的影响。我们开发了一种计算基础设施,以便在云(例如 Microsoft Azure)和高性能集群上的数百个处理器上部署 GLM-EIV。利用这个基础设施,我们应用 GLM-EIV 来分析两个最近的大规模单细胞 CRISPR 筛选数据集,得出了一些新的见解。