Adamer Michael F, Brüningk Sarah C, Tejada-Arranz Alejandro, Estermann Fabienne, Basler Marek, Borgwardt Karsten
Department of Biosystems Science and Engineering, ETH Zurich, Basel 4058, Switzerland.
Swiss Institute for Bioinformatics (SIB), Lausanne 1015, Switzerland.
Bioinform Adv. 2022 Oct 6;2(1):vbac071. doi: 10.1093/bioadv/vbac071. eCollection 2022.
With the steadily increasing abundance of omics data produced all over the world under vastly different experimental conditions residing in public databases, a crucial step in many data-driven bioinformatics applications is that of data integration. The challenge of batch-effect removal for entire databases lies in the large number of batches and biological variation, which can result in design matrix singularity. This problem can currently not be solved satisfactorily by any common batch-correction algorithm.
We present , a regularized version of the empirical Bayes method to overcome this limitation and benchmark it against popular approaches for the harmonization of public gene-expression data (both microarray and bulkRNAsq) of the human opportunistic pathogen . Batch-effects are successfully mitigated while biologically meaningful gene-expression variation is retained. fills the gap in batch-correction approaches applicable to large-scale, public omics databases and opens up new avenues for data-driven analysis of complex biological processes beyond the scope of a single study.
The code is available at https://github.com/BorgwardtLab/reComBat, all data and evaluation code can be found at https://github.com/BorgwardtLab/batchCorrectionPublicData.
Supplementary data are available at online.
随着世界各地在截然不同的实验条件下产生的组学数据在公共数据库中的数量稳步增加,许多数据驱动的生物信息学应用中的关键步骤是数据整合。消除整个数据库的批次效应面临的挑战在于批次数量众多以及生物变异,这可能导致设计矩阵奇异。目前,任何常见的批次校正算法都无法令人满意地解决这个问题。
我们提出了一种经验贝叶斯方法的正则化版本,以克服这一限制,并将其与用于协调人类机会性病原体公共基因表达数据(微阵列和批量RNA测序)的流行方法进行基准测试。在保留具有生物学意义的基因表达变异的同时,成功减轻了批次效应。填补了适用于大规模公共组学数据库的批次校正方法的空白,并为超出单一研究范围的复杂生物过程的数据驱动分析开辟了新途径。
补充数据可在网上获取。