Zhang Yuqing, Parmigiani Giovanni, Johnson W Evan
Department of Bioinformatics and Clinical Data Science, Gilead Sciences, Inc., 333 Lakeside Dr, Foster City, CA 94404, USA.
Department of Data Sciences, Dana-Farber Cancer Institute, 450 Brookline Ave, Boston, MA 02215, USA.
NAR Genom Bioinform. 2020 Sep;2(3):lqaa078. doi: 10.1093/nargab/lqaa078. Epub 2020 Sep 21.
The benefit of integrating batches of genomic data to increase statistical power is often hindered by batch effects, or unwanted variation in data caused by differences in technical factors across batches. It is therefore critical to effectively address batch effects in genomic data to overcome these challenges. Many existing methods for batch effects adjustment assume the data follow a continuous, bell-shaped Gaussian distribution. However in RNA-seq studies the data are typically skewed, over-dispersed counts, so this assumption is not appropriate and may lead to erroneous results. Negative binomial regression models have been used previously to better capture the properties of counts. We developed a batch correction method, ComBat-seq, using a negative binomial regression model that retains the integer nature of count data in RNA-seq studies, making the batch adjusted data compatible with common differential expression software packages that require integer counts. We show in realistic simulations that the ComBat-seq adjusted data results in better statistical power and control of false positives in differential expression compared to data adjusted by the other available methods. We further demonstrated in a real data example that ComBat-seq successfully removes batch effects and recovers the biological signal in the data.
整合多批次基因组数据以提高统计功效的益处常常受到批次效应的阻碍,批次效应是指由于各批次技术因素差异导致的数据中出现的不必要变异。因此,有效解决基因组数据中的批次效应对于克服这些挑战至关重要。许多现有的批次效应调整方法假定数据服从连续的钟形高斯分布。然而,在RNA测序研究中,数据通常是偏态的、过度分散的计数,因此这种假设并不合适,可能会导致错误的结果。负二项回归模型此前已被用于更好地捕捉计数数据的特性。我们开发了一种批次校正方法ComBat-seq,它使用负二项回归模型,在RNA测序研究中保留计数数据的整数性质,使批次调整后的数据与需要整数计数的常见差异表达软件包兼容。我们在实际模拟中表明,与其他可用方法调整后的数据相比,ComBat-seq调整后的数据在差异表达中具有更好的统计功效和假阳性控制。我们在一个实际数据示例中进一步证明,ComBat-seq成功消除了批次效应并恢复了数据中的生物学信号。