• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基因组数据“清理”对使用替代变量分析的生物学发现的实际影响。

Practical impacts of genomic data "cleaning" on biological discovery using surrogate variable analysis.

作者信息

Jaffe Andrew E, Hyde Thomas, Kleinman Joel, Weinbergern Daniel R, Chenoweth Joshua G, McKay Ronald D, Leek Jeffrey T, Colantuoni Carlo

机构信息

Lieber Institute for Brain Development, 855 N Wolfe St, Ste 300, Baltimore, MD, 21205, USA.

Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, 615 N Wolfe St, Baltimore, MD, 21205, USA.

出版信息

BMC Bioinformatics. 2015 Nov 6;16:372. doi: 10.1186/s12859-015-0808-5.

DOI:10.1186/s12859-015-0808-5
PMID:26545828
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4636836/
Abstract

BACKGROUND

Genomic data production is at its highest level and continues to increase, making available novel primary data and existing public data to researchers for exploration. Here we explore the consequences of "batch" correction for biological discovery in two publicly available expression datasets. We consider this to include the estimation of and adjustment for wide-spread systematic heterogeneity in genomic measurements that is unrelated to the effects under study, whether it be technical or biological in nature.

METHODS

We present three illustrative data analyses using surrogate variable analysis (SVA) and describe how to perform artifact discovery in light of natural heterogeneity within biological groups, secondary biological questions of interest, and non-linear treatment effects in a dataset profiling differentiating pluripotent cells (GSE32923) and another from human brain tissue (GSE30272).

RESULTS

Careful specification of biological effects of interest is very important to factor-based approaches like SVA. We demonstrate greatly sharpened global and gene-specific differential expression across treatment groups in stem cell systems. Similarly, we demonstrate how to preserve major non-linear effects of age across the lifespan in the brain dataset. However, the gains in precisely defining known effects of interest come at the cost of much other information in the "cleaned" data, including sex, common copy number effects and sample or cell line-specific molecular behavior.

CONCLUSIONS

Our analyses indicate that data "cleaning" can be an important component of high-throughput genomic data analysis when interrogating explicitly defined effects in the context of data affected by robust technical artifacts. However, caution should be exercised to avoid removing biological signal of interest. It is also important to note that open data exploration is not possible after such supervised "cleaning", because effects beyond those stipulated by the researcher may have been removed. With the goal of making these statistical algorithms more powerful and transparent to researchers in the biological sciences, we provide exploratory plots and accompanying R code for identifying and guiding "cleaning" process (https://github.com/andrewejaffe/StemCellSVA). The impact of these methods is significant enough that we have made newly processed data available for the brain data set at http://braincloud.jhmi.edu/plots/ and GSE30272.

摘要

背景

基因组数据的产出处于最高水平且持续增长,为研究人员提供了新的原始数据和现有的公共数据以供探索。在此,我们在两个公开可用的表达数据集中探讨“批次”校正对生物学发现的影响。我们认为这包括对基因组测量中广泛存在的系统异质性进行估计和调整,这种异质性与所研究的效应无关,无论其本质是技术方面还是生物学方面。

方法

我们展示了三项使用替代变量分析(SVA)的说明性数据分析,并描述了如何根据生物组内的自然异质性、感兴趣的次要生物学问题以及在一个区分多能细胞的数据集(GSE32923)和另一个来自人类脑组织的数据集(GSE30272)中的非线性处理效应来进行伪像发现。

结果

仔细确定感兴趣的生物学效应对于像SVA这样基于因素的方法非常重要。我们展示了在干细胞系统中,跨治疗组的全局和基因特异性差异表达大大增强。同样,我们展示了如何在脑数据集中保留年龄在整个生命周期中的主要非线性效应。然而,精确界定已知感兴趣效应所带来的收获是以“清理后”数据中的许多其他信息为代价的,包括性别、常见的拷贝数效应以及样本或细胞系特异性的分子行为。

结论

我们的分析表明,在受强大技术伪像影响的数据背景下询问明确界定的效应时,数据“清理”可以是高通量基因组数据分析的一个重要组成部分。然而,应谨慎行事以避免去除感兴趣的生物学信号。还需注意的是,在这种有监督的“清理”之后,开放数据探索是不可能的,因为研究人员规定之外的效应可能已被去除。为了使这些统计算法对生物科学领域的研究人员更强大且更透明,我们提供了探索性图表以及用于识别和指导“清理”过程的配套R代码(https://github.com/andrewejaffe/StemCellSVA)。这些方法的影响足够显著,以至于我们已在http://braincloud.jhmi.edu/plots/和GSE30272上提供了脑数据集的新处理后数据。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2ca6/4636836/fea8f4af27c8/12859_2015_808_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2ca6/4636836/7105514b4329/12859_2015_808_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2ca6/4636836/bf372fb080fa/12859_2015_808_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2ca6/4636836/d17c964658d5/12859_2015_808_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2ca6/4636836/bd20cb3a7f40/12859_2015_808_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2ca6/4636836/fea8f4af27c8/12859_2015_808_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2ca6/4636836/7105514b4329/12859_2015_808_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2ca6/4636836/bf372fb080fa/12859_2015_808_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2ca6/4636836/d17c964658d5/12859_2015_808_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2ca6/4636836/bd20cb3a7f40/12859_2015_808_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2ca6/4636836/fea8f4af27c8/12859_2015_808_Fig5_HTML.jpg

相似文献

1
Practical impacts of genomic data "cleaning" on biological discovery using surrogate variable analysis.基因组数据“清理”对使用替代变量分析的生物学发现的实际影响。
BMC Bioinformatics. 2015 Nov 6;16:372. doi: 10.1186/s12859-015-0808-5.
2
svaseq: removing batch effects and other unwanted noise from sequencing data.svaseq:去除测序数据中的批次效应和其他不必要的噪声。
Nucleic Acids Res. 2014 Dec 1;42(21):e161. doi: 10.1093/nar/gku864. Epub 2014 Oct 7.
3
Preserving biological heterogeneity with a permuted surrogate variable analysis for genomics batch correction.通过置换替代变量分析进行基因组批次校正以保留生物异质性。
Bioinformatics. 2014 Oct;30(19):2757-63. doi: 10.1093/bioinformatics/btu375. Epub 2014 Jun 6.
4
NOJAH: NOt Just Another Heatmap for genome-wide cluster analysis.NOJAH:基因组范围聚类分析的不只是另一个热图。
PLoS One. 2019 Mar 28;14(3):e0204542. doi: 10.1371/journal.pone.0204542. eCollection 2019.
5
Translational Metabolomics of Head Injury: Exploring Dysfunctional Cerebral Metabolism with Ex Vivo NMR Spectroscopy-Based Metabolite Quantification头部损伤的转化代谢组学:基于体外核磁共振波谱的代谢物定量分析探索脑代谢功能障碍
6
A novel statistical method for quantitative comparison of multiple ChIP-seq datasets.一种用于多个ChIP-seq数据集定量比较的新型统计方法。
Bioinformatics. 2015 Jun 15;31(12):1889-96. doi: 10.1093/bioinformatics/btv094. Epub 2015 Feb 13.
7
A comparative study of k-spectrum-based error correction methods for next-generation sequencing data analysis.基于k谱的下一代测序数据分析纠错方法的比较研究。
Hum Genomics. 2016 Jul 25;10 Suppl 2(Suppl 2):20. doi: 10.1186/s40246-016-0068-0.
8
iCopyDAV: Integrated platform for copy number variations-Detection, annotation and visualization.iCopyDAV:用于拷贝数变异检测、注释和可视化的集成平台。
PLoS One. 2018 Apr 5;13(4):e0195334. doi: 10.1371/journal.pone.0195334. eCollection 2018.
9
Simulating Next-Generation Sequencing Datasets from Empirical Mutation and Sequencing Models.根据经验性突变和测序模型模拟下一代测序数据集。
PLoS One. 2016 Nov 28;11(11):e0167047. doi: 10.1371/journal.pone.0167047. eCollection 2016.
10
Functional genomics and proteomics in the clinical neurosciences: data mining and bioinformatics.临床神经科学中的功能基因组学和蛋白质组学:数据挖掘与生物信息学
Prog Brain Res. 2006;158:83-108. doi: 10.1016/S0079-6123(06)58004-5.

引用本文的文献

1
A user-driven machine learning approach for RNA-based sample discrimination and hierarchical classification.一种用于基于RNA的样本鉴别和层次分类的用户驱动机器学习方法。
STAR Protoc. 2023 Oct 27;4(4):102661. doi: 10.1016/j.xpro.2023.102661.
2
Thinking points for effective batch correction on biomedical data.生物医学数据有效批量校正的思考要点。
Brief Bioinform. 2024 Sep 23;25(6). doi: 10.1093/bib/bbae515.
3
Dopamine signaling enriched striatal gene set predicts striatal dopamine synthesis and physiological activity in vivo.

本文引用的文献

1
How data analysis affects power, reproducibility and biological insight of RNA-seq studies in complex datasets.数据分析如何影响复杂数据集中RNA测序研究的检验效能、可重复性及生物学见解。
Nucleic Acids Res. 2015 Sep 18;43(16):7664-74. doi: 10.1093/nar/gkv736. Epub 2015 Jul 21.
2
Normalization of RNA-seq data using factor analysis of control genes or samples.使用对照基因或样本的因子分析对RNA测序数据进行标准化。
Nat Biotechnol. 2014 Sep;32(9):896-902. doi: 10.1038/nbt.2931. Epub 2014 Aug 24.
3
StemCellDB: the human pluripotent stem cell database at the National Institutes of Health.
多巴胺信号增强的纹状体基因集可预测纹状体多巴胺合成和体内生理活性。
Nat Commun. 2024 Apr 30;15(1):3342. doi: 10.1038/s41467-024-47456-5.
4
Batch Correction and Harmonization of -Omics Datasets with a Tunable Median Polish of Ratio.使用可调比值中位数平滑法对组学数据集进行批次校正与归一化
Front Syst Biol. 2023;3. doi: 10.3389/fsysb.2023.1092341. Epub 2023 Apr 12.
5
Online breath analysis with SESI/HRMS for metabolic signatures in children with allergic asthma.利用SESI/HRMS进行在线呼吸分析以检测过敏性哮喘儿童的代谢特征
Front Mol Biosci. 2023 Mar 31;10:1154536. doi: 10.3389/fmolb.2023.1154536. eCollection 2023.
6
Impacts of multiple anthropogenic stressors on the transcriptional response of Gammarus fossarum in a mesocosm field experiment.多个人为压力源对大型溞转录反应的影响:一个中宇宙野外实验。
BMC Genomics. 2022 Dec 8;23(1):816. doi: 10.1186/s12864-022-09050-1.
7
Perspectives for better batch effect correction in mass-spectrometry-based proteomics.基于质谱的蛋白质组学中更好的批次效应校正前景
Comput Struct Biotechnol J. 2022 Aug 12;20:4369-4375. doi: 10.1016/j.csbj.2022.08.022. eCollection 2022.
8
Current challenges and best practices for cell-free long RNA biomarker discovery.无细胞长链RNA生物标志物发现的当前挑战与最佳实践
Biomark Res. 2022 Aug 18;10(1):62. doi: 10.1186/s40364-022-00409-w.
9
An investigation of how normalisation and local modelling techniques confound machine learning performance in a mental health study.一项关于在心理健康研究中归一化和局部建模技术如何混淆机器学习性能的调查。
Heliyon. 2022 May 21;8(5):e09502. doi: 10.1016/j.heliyon.2022.e09502. eCollection 2022 May.
10
Depth normalization of small RNA sequencing: using data and biology to select a suitable method.小 RNA 测序深度归一化:使用数据和生物学方法选择合适的方法。
Nucleic Acids Res. 2022 Jun 10;50(10):e56. doi: 10.1093/nar/gkac064.
干细胞数据库:美国国立卫生研究院的人类多能干细胞数据库。
Stem Cell Res. 2013 Jan;10(1):57-66. doi: 10.1016/j.scr.2012.09.002. Epub 2012 Sep 26.
4
Bump hunting to identify differentially methylated regions in epigenetic epidemiology studies.在表观遗传学流行病学研究中,通过寻找差异甲基化区域来进行颠簸狩猎。
Int J Epidemiol. 2012 Feb;41(1):200-9. doi: 10.1093/ije/dyr238.
5
Learning from our GWAS mistakes: from experimental design to scientific method.从 GWAS 错误中学习:从实验设计到科学方法。
Biostatistics. 2012 Apr;13(2):195-203. doi: 10.1093/biostatistics/kxr055. Epub 2012 Jan 27.
6
The sva package for removing batch effects and other unwanted variation in high-throughput experiments.sva 包用于去除高通量实验中的批次效应和其他不需要的变异。
Bioinformatics. 2012 Mar 15;28(6):882-3. doi: 10.1093/bioinformatics/bts034. Epub 2012 Jan 17.
7
Using control genes to correct for unwanted variation in microarray data.利用对照基因纠正微阵列数据中的非期望变异。
Biostatistics. 2012 Jul;13(3):539-52. doi: 10.1093/biostatistics/kxr034. Epub 2011 Nov 17.
8
Temporal dynamics and genetic control of transcription in the human prefrontal cortex.人类前额叶皮层转录的时空动态和遗传控制。
Nature. 2011 Oct 26;478(7370):519-23. doi: 10.1038/nature10524.
9
Significance analysis and statistical dissection of variably methylated regions.可变甲基化区域的意义分析和统计剖析。
Biostatistics. 2012 Jan;13(1):166-78. doi: 10.1093/biostatistics/kxr013. Epub 2011 Jun 17.
10
The role of Pax6 in forebrain development.Pax6 在大脑前脑发育中的作用。
Dev Neurobiol. 2011 Aug;71(8):690-709. doi: 10.1002/dneu.20895.