Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA 02115, United States.
Department of Data Science, Dana-Farber Cancer Institute, Boston, MA 02215, United States.
Bioinformatics. 2024 Sep 2;40(9). doi: 10.1093/bioinformatics/btae531.
Systems biology analyses often use correlations in gene expression profiles to infer co-expression networks that are then used as input for gene regulatory network inference or to identify functional modules of co-expressed or putatively co-regulated genes. While systematic biases, including batch effects, are known to induce spurious associations and confound differential gene expression analyses (DE), the impact of batch effects on gene co-expression has not been fully explored. Methods have been developed to adjust expression values, ensuring conditional independence of mean and variance from batch or other covariates for each gene, resulting in improved fidelity of DE analysis. However, such adjustments do not address the potential for spurious differential co-expression (DC) between groups. Consequently, uncorrected, artifactual DC can skew the correlation structure, leading to the identification of false, non-biological associations, even when the input data are corrected using standard batch correction.
In this work, we demonstrate the persistence of confounders in covariance after standard batch correction using synthetic and real-world gene expression data examples. We then introduce Co-expression Batch Reduction Adjustment (COBRA), a method for computing a batch-corrected gene co-expression matrix based on estimating a conditional covariance matrix. COBRA estimates a reduced set of parameters expressing the co-expression matrix as a function of the sample covariates, allowing control for continuous and categorical covariates. COBRA is computationally efficient, leveraging the inherently modular structure of genomic data to estimate accurate gene regulatory associations and facilitate functional analysis for high-dimensional genomic data.
COBRA is available under the GLP3 open source license in R and Python in netZoo (https://netzoo.github.io).
系统生物学分析通常使用基因表达谱中的相关性来推断共表达网络,然后将其用作基因调控网络推断的输入,或识别共表达或推测共调控基因的功能模块。虽然系统偏差,包括批次效应,已知会引起虚假关联并混淆差异基因表达分析 (DE),但批次效应对基因共表达的影响尚未得到充分探索。已经开发了方法来调整表达值,确保每个基因的均值和方差与批次或其他协变量的条件独立性,从而提高 DE 分析的保真度。然而,这种调整并不能解决组间虚假差异共表达 (DC) 的潜在问题。因此,未经校正的、人为的 DC 会扭曲相关结构,导致即使使用标准批次校正校正输入数据,也会识别出虚假的、非生物学的关联。
在这项工作中,我们使用合成和真实的基因表达数据示例,证明了在使用标准批次校正后,协方差中的混杂因素仍然存在。然后,我们引入了共表达批次减少调整 (COBRA),这是一种基于估计条件协方差矩阵来计算批次校正后基因共表达矩阵的方法。COBRA 估计一组减少的参数,将共表达矩阵表示为样本协变量的函数,允许控制连续和分类协变量。COBRA 计算效率高,利用基因组数据固有的模块化结构来估计准确的基因调控关联,并为高维基因组数据提供功能分析。
COBRA 在 GLP3 开源许可证下以 R 和 Python 的形式在 netZoo(https://netzoo.github.io)中提供。