在表达谱综合中纠正实验特异性变异性可以去除潜在信号。

Correcting for experiment-specific variability in expression compendia can remove underlying signals.

机构信息

Genomics and Computational Biology Graduate Program, University of Pennsylvania, 3400 Civic Center Blvd, Philadelphia, PA, 19104, USA.

Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, 3400 Civic Center Blvd, Philadelphia, PA, 19104, USA.

出版信息

Gigascience. 2020 Nov 3;9(11). doi: 10.1093/gigascience/giaa117.

DOI:10.1093/gigascience/giaa117

PMID:33140829

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7607552/

Abstract

MOTIVATION

In the past two decades, scientists in different laboratories have assayed gene expression from millions of samples. These experiments can be combined into compendia and analyzed collectively to extract novel biological patterns. Technical variability, or "batch effects," may result from combining samples collected and processed at different times and in different settings. Such variability may distort our ability to extract true underlying biological patterns. As more integrative analysis methods arise and data collections get bigger, we must determine how technical variability affects our ability to detect desired patterns when many experiments are combined.

OBJECTIVE

We sought to determine the extent to which an underlying signal was masked by technical variability by simulating compendia comprising data aggregated across multiple experiments.

METHOD

We developed a generative multi-layer neural network to simulate compendia of gene expression experiments from large-scale microbial and human datasets. We compared simulated compendia before and after introducing varying numbers of sources of undesired variability.

RESULTS

The signal from a baseline compendium was obscured when the number of added sources of variability was small. Applying statistical correction methods rescued the underlying signal in these cases. However, as the number of sources of variability increased, it became easier to detect the original signal even without correction. In fact, statistical correction reduced our power to detect the underlying signal.

CONCLUSION

When combining a modest number of experiments, it is best to correct for experiment-specific noise. However, when many experiments are combined, statistical correction reduces our ability to extract underlying patterns.

摘要

动机

在过去的二十年中，不同实验室的科学家已经对来自数百万个样本的基因表达进行了检测。这些实验可以组合成文献集，并进行集体分析，以提取新的生物学模式。技术变异性，或“批次效应”，可能是由于将在不同时间和不同环境中收集和处理的样本组合在一起而产生的。这种可变性可能会扭曲我们提取真实潜在生物学模式的能力。随着更多综合分析方法的出现和数据收集的增加，我们必须确定当许多实验组合在一起时，技术变异性会在多大程度上影响我们检测所需模式的能力。

目的

我们试图通过模拟由多个实验汇总数据的文献集来确定潜在信号被技术变异性掩盖的程度。

方法

我们开发了一个生成式多层神经网络，从大规模微生物和人类数据集模拟基因表达实验的文献集。我们比较了引入不同数量的非期望变异性源前后模拟的文献集。

结果

基线文献集中的信号在添加的变异性源数量较少时被掩盖。在这些情况下，应用统计校正方法可以恢复潜在信号。但是，随着变异性源数量的增加，即使没有校正，也更容易检测到原始信号。实际上，统计校正降低了我们检测潜在信号的能力。

结论

当组合少量实验时，最好针对特定实验的噪声进行校正。但是，当组合许多实验时，统计校正会降低我们提取潜在模式的能力。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4dfe/7607552/4f6091207a2a/giaa117fig1.jpg

相似文献

Correcting for experiment-specific variability in expression compendia can remove underlying signals.

Gigascience. 2020 Nov 3;9(11). doi: 10.1093/gigascience/giaa117.

ARSyN: a method for the identification and removal of systematic noise in multifactorial time course microarray experiments.

Biostatistics. 2012 Jul;13(3):553-66. doi: 10.1093/biostatistics/kxr042. Epub 2011 Nov 14.

Unsupervised Extraction of Stable Expression Signatures from Public Compendia with an Ensemble of Neural Networks.

Cell Syst. 2017 Jul 26;5(1):63-71.e6. doi: 10.1016/j.cels.2017.06.003. Epub 2017 Jul 12.

Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.

Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.

On the choice and number of microarrays for transcriptional regulatory network inference.

BMC Bioinformatics. 2010 Sep 9;11:454. doi: 10.1186/1471-2105-11-454.

COLOMBOS: access port for cross-platform bacterial expression compendia.

PLoS One. 2011;6(7):e20938. doi: 10.1371/journal.pone.0020938. Epub 2011 Jul 14.

Worldwide analysis of factors associated with medicines compendia publishing.

Int J Clin Pharm. 2013 Jun;35(3):386-92. doi: 10.1007/s11096-012-9744-x. Epub 2013 Mar 28.

BERNN: Enhancing classification of Liquid Chromatography Mass Spectrometry data with batch effect removal neural networks.

Nat Commun. 2024 May 6;15(1):3777. doi: 10.1038/s41467-024-48177-5.

Batch-effect correction in single-cell RNA sequencing data using JIVE.

Bioinform Adv. 2024 Sep 13;4(1):vbae134. doi: 10.1093/bioadv/vbae134. eCollection 2024.

Increasing the power to detect causal associations by combining genotypic and expression data in segregating populations.

PLoS Comput Biol. 2007 Apr 13;3(4):e69. doi: 10.1371/journal.pcbi.0030069. Epub 2007 Feb 27.

引用本文的文献

MOTL: enhancing multi-omics matrix factorization with transfer learning.

Genome Biol. 2025 Jul 25;26(1):224. doi: 10.1186/s13059-025-03675-7.

Consistently processed RNA sequencing data from 50 sources enriched for pediatric data.

Sci Data. 2025 Jul 2;12(1):1134. doi: 10.1038/s41597-025-05376-z.

Analyzing human gut microbiome data from global populations: challenges and resources.

Trends Microbiol. 2025 Jun 6. doi: 10.1016/j.tim.2025.05.008.

Integration of 168,000 samples reveals global patterns of the human gut microbiome.

Cell. 2025 Feb 20;188(4):1100-1118.e17. doi: 10.1016/j.cell.2024.12.017. Epub 2025 Jan 22.

Integration of 168,000 samples reveals global patterns of the human gut microbiome.

bioRxiv. 2023 Oct 11:2023.10.11.560955. doi: 10.1101/2023.10.11.560955.

Decreased Gene Expression of Antiangiogenic Factors in Endometrial Cancer: qPCR Analysis and Machine Learning Modelling.

Cancers (Basel). 2023 Jul 18;15(14):3661. doi: 10.3390/cancers15143661.

PAUSE: principled feature attribution for unsupervised gene expression analysis.

Genome Biol. 2023 Apr 19;24(1):81. doi: 10.1186/s13059-023-02901-4.

SOPHIE: Generative Neural Networks Separate Common and Specific Transcriptional Responses.

Genomics Proteomics Bioinformatics. 2022 Oct;20(5):912-927. doi: 10.1016/j.gpb.2022.09.011. Epub 2022 Oct 7.

Using genome-wide expression compendia to study microorganisms.

Comput Struct Biotechnol J. 2022 Aug 10;20:4315-4324. doi: 10.1016/j.csbj.2022.08.012. eCollection 2022.

GenomicSuperSignature facilitates interpretation of RNA-seq experiments through robust, efficient comparison to public databases.

Nat Commun. 2022 Jun 27;13(1):3695. doi: 10.1038/s41467-022-31411-3.

本文引用的文献

Cross-platform normalization enables machine learning model training on microarray and RNA-seq data simultaneously.

Commun Biol. 2023 Feb 25;6(1):222. doi: 10.1038/s42003-023-04588-6.

MultiPLIER: A Transfer Learning Framework for Transcriptomics Reveals Systemic Features of Rare Disease.

Cell Syst. 2019 May 22;8(5):380-394.e4. doi: 10.1016/j.cels.2019.04.003.

Predictability of human differential gene expression.

Proc Natl Acad Sci U S A. 2019 Mar 26;116(13):6491-6500. doi: 10.1073/pnas.1802973116. Epub 2019 Mar 7.

Data-driven human transcriptomic modules determined by independent component analysis.

BMC Bioinformatics. 2018 Sep 17;19(1):327. doi: 10.1186/s12859-018-2338-4.

Comparison of statistical methods and the use of quality control samples for batch effect correction in human transcriptome data.

PLoS One. 2018 Aug 30;13(8):e0202947. doi: 10.1371/journal.pone.0202947. eCollection 2018.

GSEA-InContext: identifying novel and common patterns in expression experiments.

Bioinformatics. 2018 Jul 1;34(13):i555-i564. doi: 10.1093/bioinformatics/bty271.

Extracting a biologically relevant latent space from cancer transcriptomes with variational autoencoders.

Pac Symp Biocomput. 2018;23:80-91.

Unsupervised Extraction of Stable Expression Signatures from Public Compendia with an Ensemble of Neural Networks.

Cell Syst. 2017 Jul 26;5(1):63-71.e6. doi: 10.1016/j.cels.2017.06.003. Epub 2017 Jul 12.

Differential analysis of RNA-seq incorporating quantification uncertainty.

Nat Methods. 2017 Jul;14(7):687-690. doi: 10.1038/nmeth.4324. Epub 2017 Jun 5.

Reproducible RNA-seq analysis using recount2.

Nat Biotechnol. 2017 Apr 11;35(4):319-321. doi: 10.1038/nbt.3838.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

在表达谱综合中纠正实验特异性变异性可以去除潜在信号。

Correcting for experiment-specific variability in expression compendia can remove underlying signals.

机构信息

Genomics and Computational Biology Graduate Program, University of Pennsylvania, 3400 Civic Center Blvd, Philadelphia, PA, 19104, USA.

Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, 3400 Civic Center Blvd, Philadelphia, PA, 19104, USA.