Department of Biostatistics, University of Florida, Gainesville, Florida, United States of America.
Department of Human Genetics, Emory University, Atlanta, Georgia, United States of America.
PLoS Comput Biol. 2024 Aug 2;20(8):e1011854. doi: 10.1371/journal.pcbi.1011854. eCollection 2024 Aug.
Single-cell ATAC-seq sequencing data (scATAC-seq) has been widely used to investigate chromatin accessibility on the single-cell level. One important application of scATAC-seq data analysis is differential chromatin accessibility (DA) analysis. However, the data characteristics of scATAC-seq such as excessive zeros and large variability of chromatin accessibility across cells impose a unique challenge for DA analysis. Existing statistical methods focus on detecting the mean difference of the chromatin accessible regions while overlooking the distribution difference. Motivated by real data exploration that distribution difference exists among cell types, we introduce a novel composite statistical test named "scaDA", which is based on zero-inflated negative binomial model (ZINB), for performing differential distribution analysis of chromatin accessibility by jointly testing the abundance, prevalence and dispersion simultaneously. Benefiting from both dispersion shrinkage and iterative refinement of mean and prevalence parameter estimates, scaDA demonstrates its superiority to both ZINB-based likelihood ratio tests and published methods by achieving the highest power and best FDR control in a comprehensive simulation study. In addition to demonstrating the highest power in three real sc-multiome data analyses, scaDA successfully identifies differentially accessible regions in microglia from sc-multiome data for an Alzheimer's disease (AD) study that are most enriched in GO terms related to neurogenesis and the clinical phenotype of AD, and AD-associated GWAS SNPs.
单细胞 ATAC-seq 测序数据(scATAC-seq)已被广泛用于研究单细胞水平的染色质可及性。scATAC-seq 数据分析的一个重要应用是差异染色质可及性(DA)分析。然而,scATAC-seq 的数据特征,如过多的零值和细胞间染色质可及性的巨大变异性,为 DA 分析带来了独特的挑战。现有的统计方法侧重于检测染色质可及区域的均值差异,而忽略了分布差异。受真实数据探索中细胞类型之间存在分布差异的启发,我们引入了一种新的组合统计检验方法,称为“scaDA”,它基于零膨胀负二项模型(ZINB),通过同时联合测试丰度、出现率和离散度来进行染色质可及性的差异分布分析。scaDA 受益于离散度收缩和均值和出现率参数估计的迭代细化,在综合模拟研究中,与基于 ZINB 的似然比检验和已发表的方法相比,它表现出最高的功效和最佳的 FDR 控制。除了在三个真实的 sc-multiome 数据分析中表现出最高的功效外,scaDA 还成功地在阿尔茨海默病(AD)研究的 sc-multiome 数据中识别出差异可及区域,这些区域在与神经发生和 AD 的临床表型相关的 GO 术语中富集程度最高,并且与 AD 相关的 GWAS SNPs 相关。