Genomics and Computational Biology Graduate Program, University of Pennsylvania, 3400 Civic Center Blvd, Philadelphia, PA, 19104, USA.
Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, 3400 Civic Center Blvd, Philadelphia, PA, 19104, USA.
Gigascience. 2020 Nov 3;9(11). doi: 10.1093/gigascience/giaa117.
In the past two decades, scientists in different laboratories have assayed gene expression from millions of samples. These experiments can be combined into compendia and analyzed collectively to extract novel biological patterns. Technical variability, or "batch effects," may result from combining samples collected and processed at different times and in different settings. Such variability may distort our ability to extract true underlying biological patterns. As more integrative analysis methods arise and data collections get bigger, we must determine how technical variability affects our ability to detect desired patterns when many experiments are combined.
We sought to determine the extent to which an underlying signal was masked by technical variability by simulating compendia comprising data aggregated across multiple experiments.
We developed a generative multi-layer neural network to simulate compendia of gene expression experiments from large-scale microbial and human datasets. We compared simulated compendia before and after introducing varying numbers of sources of undesired variability.
The signal from a baseline compendium was obscured when the number of added sources of variability was small. Applying statistical correction methods rescued the underlying signal in these cases. However, as the number of sources of variability increased, it became easier to detect the original signal even without correction. In fact, statistical correction reduced our power to detect the underlying signal.
When combining a modest number of experiments, it is best to correct for experiment-specific noise. However, when many experiments are combined, statistical correction reduces our ability to extract underlying patterns.
在过去的二十年中,不同实验室的科学家已经对来自数百万个样本的基因表达进行了检测。这些实验可以组合成文献集,并进行集体分析,以提取新的生物学模式。技术变异性,或“批次效应”,可能是由于将在不同时间和不同环境中收集和处理的样本组合在一起而产生的。这种可变性可能会扭曲我们提取真实潜在生物学模式的能力。随着更多综合分析方法的出现和数据收集的增加,我们必须确定当许多实验组合在一起时,技术变异性会在多大程度上影响我们检测所需模式的能力。
我们试图通过模拟由多个实验汇总数据的文献集来确定潜在信号被技术变异性掩盖的程度。
我们开发了一个生成式多层神经网络,从大规模微生物和人类数据集模拟基因表达实验的文献集。我们比较了引入不同数量的非期望变异性源前后模拟的文献集。
基线文献集中的信号在添加的变异性源数量较少时被掩盖。在这些情况下,应用统计校正方法可以恢复潜在信号。但是,随着变异性源数量的增加,即使没有校正,也更容易检测到原始信号。实际上,统计校正降低了我们检测潜在信号的能力。
当组合少量实验时,最好针对特定实验的噪声进行校正。但是,当组合许多实验时,统计校正会降低我们提取潜在模式的能力。